Dhis2 2.30 tracked entity synchronisation takes forever and finally starve dhis2 connection pool

Stephan_Mestach · 13 August 2019 11:23

Is it possible that there’s a synchronisation deadlock ? The tablets try to synchronize mulitple program stage instances but linked to the same tracked entity instance ?
or does the user have too much “trackentityinstances” ?

the “updates” and “inserts” of trackedentityinstance seem to last way too long and seem to deadlock

all connections are taken in the datasource pool and finally kill dhis2.

note that some threads stack traces looks similar to the “caffeine”
https://jira.dhis2.org/browse/DHIS2-7197

Is there a way to get more info in the logs ?
eg number of program stage/trackentity instances or payload posted ?
the nginx logs don’t say much

/var/log/nginx/error.log:2019/08/13 06:06:41 [error] 5052#0: *87 upstream timed out (110: Connection timed out) while reading response header from upstream, client: ...., server: , request: "POST /api/trackedEntityInstances?strategy=SYNC HTTP/1.1", upstream: "http://17.17.0.17:8080/api/trackedEntityInstances?strategy=SYNC", host: ...

Stephan_Mestach · 14 August 2019 13:13

I’ve tcpdumped the network and extracted the body with wireshark,
I’m now able to reproduce that on my server with a curl (without the mobile app).

The payload “looks” trivial in number of instances :

14 trackedEntityInstances
14 enrollment
135 events (so around 10 per entity)

As the nginx logs suggest the request get a http 504 timeout.
The command is run enough you get the symptom where dhis2 looks dead.

I don’t see the dead lock now, but tons of sql running, so assume the “post” ended up in a previous curl. So dhis2 do a lot of select to verify the data matches and conclude nothing should be updated.
but not reaching to that conclusion within 1 minute.

Stephan_Mestach · 14 August 2019 13:30

the synchronisation logs at dhis2 level are here : dhis2-logs.txt · GitHub

start 2019-08-14 13:23:14,079
end 2019-08-14 13:24:44,846
=> 1 minute 30 seconds

So the user receive “synchronisation” didn’t work but dhis2 is continuing the report… press again “synchronise”… and so on and kills the server or deadlock them self.

Stephan_Mestach · 19 August 2019 13:05

does someone can share their numbers ?
what is the biggest tracker ?
should users really be limited to one orgunit to keep this performant ?
does upgrading dhis2 will save the project ?

jomutsani · 19 August 2019 13:07

@Emma_Kassy - are you able to assist @Stephan_Mestach with this please?

Stephan_Mestach · 5 September 2019 10:03

the server requires multiple restarts a day…

pjaspers · 6 September 2019 09:23

Bumping this because still an issue.

pjaspers · 11 September 2019 09:19

We’ve tracked this down to the java side doing too much, my hypothesis is that it opens a db transaction but never closes it because it can’t finish on the java side quick enough and just gets swamped with requests and db locks.

We’ve “fixed” this by upgrading our server to have more CPU available, which seems to help a lot, but this will not scale endlessly.