Handling Analytics Generation with Large Datavalue Table (115GB) for ICRC – Seeking Advice on Optimizing Refresh Process (DHIS2 2.38.5)

diallotafsir52 · 21 October 2024 10:58

Hello DHIS2 Community,

We’re encountering a challenge with the DHIS2 instance we manage for the ICRC, particularly with the analytics generation process. We are currently using DHIS2 version 2.38.5. Our datavalue table has grown quite large (around 115GB), and generating analytics for the last three years can take up to 6 hours. We need to refresh the data for the current year at least twice a day, while keeping the older data from previous years available in the analytics tables without having to regenerate it every time.

Based on the logs from our database, a significant portion of the time is spent on the creation of indexes. The system creates an index on every column, and for the 2024 data alone, we have around 119 columns in the analytics table, most of which represent categories, as most of our datasets have attributes.

At the moment, it seems necessary to include all years in the refresh process to keep the older data available. Continuous analytics doesn’t work for us either, as it refreshes everything once a day.

Has anyone faced similar challenges with a large datavalue table? How do you manage to refresh analytics for recent data efficiently while keeping older data accessible? Are there any strategies or configurations that could help us update new data without reprocessing everything from previous years?

I will also attach the cron job configuration we are using.

We would greatly appreciate any insights or shared experiences!

Best regards,
Tafsir
ICRC

diallotafsir52 · 21 October 2024 11:17

I believe the issue may be linked to the version we are using (2.38.5). From my investigation, I found a difference in how the database handles the analytics tables between versions.

For example, in DHIS2 version 2.40, the analytics tables don’t use partitions by year. Instead, each year’s data is stored in its own table. This structure allows more flexibility in refreshing data for a specific year.

However, in version 2.38, the analytics table is a single table partitioned by year. This seems to explain why all years need to be regenerated when we refresh the analytics.

I came to this conclusion after upgrading our dev database to version 2.40 and comparing it with our current setup on version 2.38.

Does this conclusion seem correct to anyone with experience in handling large analytics tables? Any input would be appreciated.

cc
@alphasysavane @didate @Gassim

Gassim · 21 October 2024 16:58

Hi @diallotafsir52

Thank you for sharing the details as well as taking the initiative to investigate and share outcomes. I’ll believe it’ll be a good chance to get insights from @dhis2-platform + @dhis2-analytics about this (so I’m resharing).

Out of interest, I’m curious to know, regarding your use case, how accessible does older data need to be? For example, if the data is Tracker but is only needed to get aggregate results or if data is Tracker and still needs to be accessible from Capture app for updates/review…etc Do you need it to be in the dashboards only? Is it the same for both Tracker and aggregate data? I hope you don’t mind my questions.

Thanks!

jason · 22 October 2024 08:30

Hi @diallotafsir52
It does not surprise me that in more recent versions the analytics generation process is a bit quicker. I think you might also see big differences depending on the version of Postgresql you are using, with more recent versions usually performing better. There are other posts in the CoP on those topics, but one potential issue that I was thinking about was what seems to be a very large number of dimensions.

You mentioned that you are using categories for most of your data. It is worth pointing out that not all categories need to be enabled as data dimensions. Enabling each category as a data dimension comes with a price…namely longer analytics generation times since an index needs to be generated for each category which is a dimension. It might be worth reviewing all of the categories and ensure that all of them are actually important enough to need to be a dimension.

Adding more categories as dimensions will of course allow you to re-aggregate your data in many different ways (which is perhaps what you need to do). The large number of dimensions in the Data Visualizer can become overwhelming though. You may be able to choose the most common dimensions (such as Gender) which are used across many category combinations. You could then consider to use indicators to slice out particular groups of your data, particularly for those categories which are maybe not most commonly used.

Continuous analytics doesn’t work for us either, as it refreshes everything once a day.

I am not sure exactly what you mean by this, but you can use the continuous analytics to refresh only data which has changed. That way, you do not need to rebuild the entire analytics tables, but only those rows which have been updated. You should be using this scheduling job instead

Selection_306

Here you can set the delay in terms of a number of seconds (perhaps 15-30 minutes might be a good starting point). This analytics job will only refresh “dirty” data, meaning rows of data which have changed since the last time the analytics table was generated. This might speed things up as well.

Let us know how it goes.

Beset regards,
Jason

diallotafsir52 · 22 October 2024 09:39

Hi @jason

Thank you for your valuable insights!

Just to clarify, when I mentioned categories, I was actually referring to attributeOptionCombos. Most of our datasets have categoryCombos assigned to them, which results in a large number of columns in the analytics tables. These attributeOptionCombos seem to contribute to the longer analytics generation times, as an index is created for each one. We’ll definitely review whether all of them need to be enabled as dimensions and try to reduce them where possible.

Regarding continuous analytics, we’ve noticed that the logic of refreshing only specific data (e.g., by year) seems to work in DHIS2 version 2.40 but not in version 2.38. Unfortunately, due to the new Approval app introduced in version 2.39, we cannot upgrade beyond version 2.38 at this time, as it impacts our system configuration.

Lastly, I’ve noticed a structural difference in how the analytics table is handled between versions. In version 2.38, the analytics table is partitioned by year, meaning we have to regenerate all years to keep the data available. However, in version 2.40, each year has its own table, which seems more efficient. Do you know why this change was made and if it offers significant performance improvements?

Thanks again for your support. I’ll keep you posted on how these adjustments work out.

Best regards,
Tafsir

jason · 22 October 2024 10:26

Hi @diallotafsir52
It does not really matter whether the category is used for disaggregation or as part of an attribute combo. What controls whether an analytics index is generated or not is a property of the category itself

https://play.im.dhis2.org/stable-2-41-1/dhis-web-maintenance/index.html#/edit/categorySection/category/LFsZ8v5v7rq

So, if the “Data dimension” is ticked, that means that the category will become available in the Data Visualizer as a dimension. Because of that, there needs to be a column in the analytics tables, which of course in turn takes time to calculate. Thus, I think it would be advisable to carefully review all of those categories and be sure you really need all of them. Each category which is enabled as a dimension has an overall effect on the time it takes to generate analytics, and if they are never used as a dimension for any analysis, then its really pointless to generate them as a dimension. We have seen many systems where categories are configured as a dimension, but the dimension is never used in any favorite.

This issue (implemented in 2.37) describes in more detail how the analytics tables are partitioned. I am not sure exactly what you mean about partitioned by year and “each year has its own table”. Perhaps this is the same as this ticket? Regardless, yes, partitioning should speed things up., but if you are refreshing data which has not changed or building indices which you do not use, I think these are going to have a much larger impact on the time required to generate your analytics tables.

Best regards,
Jason