How we improved OpenLMIS import performance by over 90%

December 2, 2024

Piotr Wargulak

In today’s fast-paced digital landscape, software performance is critical, particularly in systems like OpenLMIS, where efficient data handling directly impacts user experience (UX) and operational workflows. Optimizing the performance of import features within OpenLMIS – such as products (orderable), programs, and facilities – enhances system responsiveness, providing users with quicker, more reliable access to crucial information.

Following these principles, I led the effort to improve OpenLMIS import. In this post, I’ll share how I achieved a remarkable 90% performance boost, turning a once time-consuming task into a high-speed, efficient asset for users and administrators alike.

Hibernate batches

My first step was optimizing batching configuration. Batching allows you to send multiple insert operations in a single request to the database, which significantly reduces the overhead associated with multiple database round trips. This improvement can be achieved through the configuration of Hibernate’s batch settings.

In Hibernate, the option spring.jpa.properties.hibernate.jdbc.batch_size works together with spring.jpa.properties.hibernate.order_inserts and enables batching for inserts.

For updates, there are spring.jpa.properties.hibernate.jdbc.batch_versioned_data and spring.jpa.properties.hibernate.order_updates [1].

Tip

Take note that the latest documentation of Hibernate 6.6 doesn’t mention these properties

Sorting is crucial when dealing with entities that have @ManyToOne or @OneToOne relationships with other entities, and having certain cascade options [2]. Without sorting, Hibernate would execute inserts for each entity individually, followed by inserts for the related entity/-s which would practically disable batching for that entity. An ID field with GenerationType.IDENTITY would also disable batching – but that’s not something that can be fixed by configuration.

In this case, when the tested application was started locally with the database running on the same server, these batch settings didn’t yield much improvement. The difference, as tested locally, was negligible.

Merging inserts with PostgreSQL

Upon inspecting the actual SQLs executed by Hibernate, I noticed that they weren’t quite what I expected. Hibernate was executing individual insert statements, which, at first glance, seemed odd. However, it’s important to understand that what Hibernate refers to as “batching” is simply sending multiple insert statements to the database driver at once, and waiting for all of them to complete, rather than executing them one by one. This is different from the true batch processing I was aiming for. What I really wanted was a more optimized batch process, where multiple records are handled in a single operation.

The PostgreSQL Driver [3] exposes the boolean property reWriteBatchedInserts that can help make database inserts more efficient. When this setting is enabled, instead of sending multiple insert statements separately to the database, the driver combines them into one single insert statement. This reduces the number of operations the database has to handle, improving performance.

To enable this feature, I added reWriteBatchedInserts=true to the connection URL when setting up the connection to the PostgreSQL database:

postgresql://user@host:5432/dbname?reWriteBatchedInserts=true

With these changes applied, the local test showed 50% improvement. Unfortunately, it works for inserts only.

Reducing transaction size

Since the configuration optimization didn’t provide enough of a boost, the code refactor could not be avoided. The import often works with files of 10000 rows or more, and the good practice regarding Transactions is to keep them small – around 1000-5000 entities. For that reason I choose to introduce simple Transaction-level batching as the first code refactor.

Tip

This comes with some issues regarding rollback. Since each batch runs in its own transaction, an error doesn’t roll back the whole file “out-of-the-box”. In this case, since the import works in Create&Update mode, I decided to allow partial imports.

A simple split of data into chunks of 1000 rows – in my case, safe to do since the rows of a single import file are unrelated to each other – and processing chunks in its own transactions didn’t provide any serious boost – a modest 10% improvement.

?

You might notice that I provide a single percentage when talking about improvements. It’s because, during tests, I’ve limited myself to a single data size which was required to be imported in a reasonable time. What’s more, I was testing only a selected part of the data. This is not a formal way of performance testing. For formal performance testing see: Connect for Life/Vxnaid Performance Testing.

More refactoring

Sometimes, configuration changes or quick refactors are not enough, and you have to simply dive into the code and change the algorithms – especially since some low-quality designs are visible.

The general idea of the former import algorithm was to import each line individually. If a related entity was needed, a call to Entity Manager or Repository’s findBySomething was made. In other words, for each line, the code would search for dependencies individually, use them, and discard them. It helped me easily integrate transactions, but is a suboptimal solution in general.

I’ve refactored the code to inspect the file’s rows in search of any related entity that is going to be needed – many of them would repeat – and load them with single queries. In that way, I would build – or pre-heat – an in-memory cache that would be used during the actual importing process.

Tip

You should be careful when making a select query with an IN operator and a big list of items. Depending on the database there are different limits. For example, Oracle limits the size to 1000 items [4] but for PostgreSQL, it’s 32768 [5].

The first entity: the cache pre-heating produced another 10% improvement.

The second entity (the one with the most related entities): the cache resulted in a whopping 91% improvement.

Summary

In this post, we’ve explored the journey of optimizing the import feature within OpenLMIS to achieve over a 90% improvement in performance. By leveraging Hibernate batch configurations, PostgreSQL-specific optimizations, transaction-level batching, and targeted code refactoring, I was able to streamline data handling and drastically reduce import time.

These combined enhancements transformed the import feature from a cumbersome task into a swift, robust process that benefits both users and administrators. The task which would take over 15 min, now takes under 1 minute.