Data Visualizer Freezing with 2M COCs

Anne_Trinidad · 19 March 2025 06:08

I am experiencing an issue with the DHIS2 Data Visualizer, where it continuously loads without displaying any results. I suspect that the problem is caused by the large number of Category Option Combinations (COCs) in my dataset, which exceeds 2 million entries. When I generate pivot tables using other datasets with fewer COCs, the visualizer works without any problems. However, when I attempt to visualize data from this specific dataset with 2M+ COCs, the system gets stuck in an endless loading state.

The DHIS2 instance is hosted on an AWS EC2 m5.4xlarge instance (16 vCPUs, 64GB RAM) with a 1TB volume. Despite this high-performance setup, the dataset does not load successfully in the Data Visualizer, suggesting a potential issue with either the query execution time, system resource allocation, or how DHIS2 handles large COC datasets in analytics processing.

5 CATEGORIES, with a total of 332 category options.

Gassim · 19 March 2025 20:43

Thank you for your post, and welcome to the DHIS2 Community of Practice!

Would it be possible to please check the Catalina.out log? If there are any errors in the log from the point you click on ‘update’ in the Data Visualizer app, would you please share the log (without authentication/sensitive info)?

Thank you!

Anne_Trinidad · 20 March 2025 00:39

Hi,

This was the only log in catalina.out

And then I checked if analytics tables are still processing

And then I checked for the long-running transactions:
740 | 00:07:28.146321 | active | select categoryop0_.categoryoptioncomboid as category2_14_0_, categoryop0_.categoryoptionid as category1_14_0_, categoryop1_.categoryoptionid as category1_11_1_, categoryop1_.uid as uid2_11_1_, categoryop1_.code as code3_11_1_, categoryop1_.created as created4_11_1_, categoryop1_.lastUpdated as lastupda5_11_1_, categoryop1_.lastupdatedby as lastupda6_11_1_, categoryop1_.name as name7_11_1_, categoryop1_.shortname as shortnam8_11_1_, categoryop1_.description as descript9_11_1_, categoryop1_.formName as formnam10_11_1_, categoryop1_.startDate as startda11_11_1_, categoryop1_.endDate as enddate12_11_1_, categoryop1_.style as style13_11_1_, categoryop1_.translations as transla14_11_1_, categoryop1_.attributeValues as attribu15_11_1_, categoryop1_.userid as userid16_11_1_, categoryop1_.sharing as sharing17_11_1_ from categoryoptioncombos_categoryoptions categoryop0_ inner join categoryoption categoryop1_ on categoryop0_.categoryoptionid=categoryop1_.categoryoptionid where categoryop0_.categoryoptioncomboid=$1

bobj · 20 March 2025 13:09

Is it possible to put glowroot on this server (https://glowroot.org). It would reveal more detail on where the system is spending its time. Certainly 2 million COCs is going to be tough to swallow. The query above shouldn’t really be taking 7 minutes though. Is it possible you could run this same through EXPLAIN ANALYZE manually?

Anne_Trinidad · 24 March 2025 01:13

Hi, here’s some of the logs.

Anne_Trinidad · 28 March 2025 01:03

Is there an alternative to this approach that doesn’t result in 2M COC? Any suggestions would be greatly appreciated!

jason · 28 March 2025 11:15

Honestly, any category combination with 2 million COCs is either going to crash the server or the browser. While it may be theoretically possible to support such a CC, we have implemented (based on experience) some reasonable default limits of around 50k total COCs in recent versions of DHIS2. This can be configured to allow you to increase the number, but over the years, we have seen that extremely large CC/COCs lead to many performance problems, and usually result in very sparse data.

I have no idea really about the nature of your implementation, but here is an example. Let’s suppose we have a data element called “Inpatient cases” with the following category combination:

ICD10: Roughly 14k codes
Age: Lets say five year age bands (0-100) which is about 20 age bands
Gender: Male/Female (2)
Location: Urban/Rural

This seems fairly simple, but such a category combo would produce some 1,120,000 category option combos. You do not expect (at least on an aggregate basis) to have non-zero numbers in every single one of these COCs each month, thus your data becomes very “sparse” with many COCs never being used. DHIS2 does not dynamically create COCs based on whether they are actually needed, but creates them all (via the Cartesian product of all of the category options) upfront.

When loading this enormous about of COCs into your browser, you then need to combine this with periods and organisation units perhaps, and the browser simply runs out of memory. If you need to analyze data sets of this magnitude, then a browser is just not the tool.

As I mentioned before, I think that having that many COCs in any category combination is simply unreasonable, and points to a modelling flaw if your design. Perhaps you could consider to use different data elements to decrease the cardinality of your COCs. What I mean by this is that I think you should strive to remove one of the dimensions (perhaps the one with the largest

In the example I give above, you could create 14k data elements (one data element for each ICD10 code)

and then use this category combination

Age: Five year age bands (0-100) which is about 20 age bands
Gender: Male/Female (2)
Location: Urban/Rural (2)

We now have a category combo with only three dimensions and a total of 80 COCs, which seems totally reasonable.

Over the years, we have seen the category model often “abused”, and usually one of the categories should be replaced with some other metadata element, either an attribute option combo (Urban/Rural is a good candidate in this example) or a data element.

Good luck and best regards,
Jason

Anne_Trinidad · 2 April 2025 01:08

Thank you for your response! I have a question—will having thousands of data elements and indicators impact the performance and manageability of the data visualizer? or are there any other cons on this kind of setup? Thank you!

jason · 15 April 2025 15:22

Its important to remember that the data visualizer is a web page. It is not a full fledged data analysis environment like Stata/SPSS/R/Python. I think you should maybe take a step back and ask yourself “How are my users going to use a table in a web browser with thousands or rows and maybe hundreds of columns?”. I think anytime you are trying to create a complex table which is more than a few dozens of lines long and a few columns wide, you are probably creating a table which will be very difficult for users to interpret and use. I must admit however, that I am of the Tufte school of thought.

I think having thousands of indicators and data elements in a table is going to negatively impact performance. Why would it not? Indicators must be calculated in real time…they must then be rendered as HTML in your browser, likely consuming a lot of memory and resources.

Often when we see these types of performance problems it is precisely because the wrong tool is being used. Analyzing large amounts of data can of course be done with the right tool, but humans are not great at scanning through large tables to detect patterns. This is why I think you should consider to actually define what the problem is you are trying to solve, rather than trying to display all data at one time.

Best regards,
Jason