The next release of Livebase will generate applications with better performance on complex queries over large datasets. Here is the story behind this major improvement, and what’s in there for you.
One of our customers managing a complex clinical database (with about 60 classes and almost 100 database tables) has modeled the main Patient class with dozens of derived attributes, i.e. attributes retrieving data from other classes through a query.
To make things worse, many of the Patient’s derived attributes are retrieved from classes connected to Patient class only indirectly, via long paths traversing several model associations, including many-to-many ones.
Apparently, our customer enjoyed the possibility to enrich the Patient class with additional derived “query” attributes (such as average blood parameters across exams) by just dragging attributes from other classes of the model onto the Patient class.
As a consequence, the query generated by Livebase for retrieving the patients included an impressive number of chained joins and grouping operators (avg, std, sum, count…). And no Relational DBMS likes this kind of queries, MySQL making no exception.
In fact, despite all the optimization techniques applied and all the indexes created by Livebase on the database, the query generated by the platform was quite slow as soon as the set of patients to retrieve was larger than a few thousands.
After some experiments with views and temporary tables, we started working on a technique for breaking down the original query into multiple (simpler) sub-queries that could be executed in parallel more efficiently, and for recombining the results in memory.
The idea behind this technique is to execute a different sub-query to retrieve a subset of the patient’s derived attributes (i.e. a subset of the columns of the original recordset) connected through the same path over the data-model (i.e. retrievable via the same chain of relational joins over the database).
The main problem with this technique is how to split/replicate the original WHERE clause into the sub-queries. What if the original WHERE clause contains a condition based on derived attributes retrieved by different sub-queries?
After two weeks spent analyzing the problem and one week of prototyping, we ended up with a sophisticated algorithm for splitting the original query according to the individual conditions in the WHERE clause, and to the chain of joins required to retrieve each derived attribute.
Retrieving nine thousands patients (with all their derived attributes) took about 40 seconds with the old single-query approach. Splitting the original query into multiple simpler sub-queries and combining their results in memory took about 2 seconds. A staggering 20x improvement.
Although such a dramatic performance improvement can be expected only on applications performing really complex queries, we believe that average Livebase applications with a smaller number of derived attributes will still provide sensibly better performances and better scalability over large datasets.
The parallel sub-query technique will be adopted whenever applicable by all the applications generated by the next version of the Livebase platform, which will be released as soon testing will be completed (probably two weeks).