Dhis2 Server Downtime

lnfregos · 5 February 2021 14:53

Hello All,

We are experiencing dhis2 downtime on dhis2 v2.33.5 (and in the past v2.33.4).

Each time this happens, we see a CPU spike, on our separate AWS RDS instance, maxing it out to 100% and making it unresponsive for several hours. This happened five times since July 2020. Twice the system restored itself, otherwise we had to restart the RDS instance. We cannot see a pattern shared between each downtime occurance, or understand the reason the system went down.

The most recent downtime occurance, we see the following in the log right before dhis2 went down:

INFO 2021-01-29 12:00:00,000 ‘system-process’ update org.hisp.dhis.scheduling.JobConfiguration, name: Remove expired reserved values, uid: uwWCT2BMmlq (AuditLogUtil.java [taskScheduler-10])
INFO 2021-01-29 12:00:00,010 ‘system-process’ update org.hisp.dhis.scheduling.JobConfiguration, name: Remove expired reserved values, uid: uwWCT2BMmlq (AuditLogUtil.java [taskScheduler-10])

It seems that the job ‘Remove Expired Reserved Values’ could be the issue, but this job runs every hour, and we do not have tracked entity attributes configured to generate reserved values.

Has anyone else experienced dhis2 downtime? What was the issue you faced? Was there a solution, or workaround? @Scott has this come up before?

Many Thanks,
Lauren

Mohammad_Ullah · 6 February 2021 17:34

Hi Lauren,

We had similar issue in one of our instances. As a workaround we have increased the CPU core from 2 to 4 for that server where three other instance (medium sized) is running. Don’t know the exact cause so far, still digging the log.

Also we recommend our users to run data integrity check, import/export, large pivot table generation and other heavy resource consumption activities either early morning or at end of the day.

Edward_Robinson · 8 February 2021 09:53

Are you able to ssh into the server and run the top command to see what the offending process name is? I am guessing it’s the java executable as I’ve often seen this behavior in Java apps. If it is, what is the output of java --version ?

Mohammad_Ullah · 11 February 2021 13:01

Hi Edward,

Yes its the java process which occupying the CPU. We are using the following version,

Tomcat Server number: 8.5.30.0
OS Name: Linux
OS Version: 5.4.95
Architecture: amd64
JVM Version: 1.8.0_171-8u171-b11-1~deb9u1-b11

After upgrading the CPU core, it seems to be more stable now.

lnfregos · 11 February 2021 19:34

Hello both,

Thank you for the replies.

@Mohammad_Ullah we have a separate AWS RDS instance with PostgreSQL, which should be sufficient for our activities.

@Edward_Robinson we are not able to ssh into our server, but we do have monitoring tools that inform us about the activity. None have been able to identify the cause of the CPU spike to 100%.

Edward_Robinson · 22 February 2021 10:17

Apologies for the late response, I’ve been out of action after eye surgery and am back at work today. If the issue persists, check your JVM environment settings including:
Xms
Xmx
Also, how much RAM is available on the machine? How much is available to services such as PostgreSQL, Tomcat?
Are there other Java apps running on this server (e.g. monitoring tools)?