Hi everyone
As the title suggest, I’m looking for recomendation on how to properly perform analytics performance tests trying new versions of DHIS2.
In the past, we’ve faced some problems with our production instance due to not properly test the analytics performace of a new DHIS2 version. After a lot of try and fail, we could conclude that issues could be related to identified bugs on DHIS2 version that we upgrade to, or new versions of Postgres that had by default a feature turned on that largely increased the response time of the analytics queries.
A small description of our DHIS2 implementation: it is a Tracker based implementation of just 1 tracker program, the data entry is done mostly (90% of the time) using DHIS2 mobile on android devices, and the synced to the DHIS2 server. The data is synced mostly at least 3 times a day, during the morning, afternoon and at night.
The reporting is done also mostly 3 times a day, at first hours in the day to catch latest information from previous day, mid-day to catch morning data, and in the afternoon to some afternoon and mornig data.
We also run analytics 3 times a day to allow that necessity of reporting.
The reporting is based mostly on program indicators.
So, to avoid having this issues when upgrading to a newer version of DHIS2, we tried something:
- create some shell scripts that could simulate a peak period, when we have many the users executing more than one report at the same time.
- we also had a list of the most used reports; These reports were a set of program indicators (some with few program indicators, others were heavier with many program indicators).
- in the scripts we would randomly pick an user (from a list of users/password), select a random organisation unit (at a random level), and assign a random period (we tested for monthly periods only)
- after randomly getting the user, location, period and report. we would perform a call to the DHIS2 analytics API endpoint and record the execution time on a log file.
- Then we created cronjobs to execute the scripts: execute the scripts 10 times (at the same time), every 5 minutes, for 2 hours. This means we were simulating 10 users at the same time executing 1 report each (user), and after 5 mins another 10 users, repeating the process for 2 hours.
- Since we timed the same reports for different levels before, we could compare the logged times with what we have and determine if the DHIS2 was giving the expected results.
This took us some time to create, implement and have results that would help to take decisions.
I would like to know:
- if the approach taken makes any sense?
- If it was a correct approach?
- Does anyone face the same issues and came with a different approach?
- What could we improve?
- What could we do differently?
Since we will need to upgrade to newer version, I was thinking of creating an application (React, Php, Python) to perform these tests and store them into a database to facilite the interpretation, and also the maintenance of the testing process.
Thanks in advance
Best regards