Dhis2 deadlock it self

Stephan_Mestach · 29 June 2019 10:32

We have regular dead lock of dhis2 instances.

    "http-nio-8080-exec-357" #3226 daemon prio=5 os_prio=0 tid=0x00007f9cc0043000 nid=0xcad waiting for monitor entry [0x00007f9c80cd9000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1868)
	- waiting to lock <0x00000006032d1bc8> (a java.util.concurrent.ConcurrentHashMap$Node)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.doComputeIfAbsent(BoundedLocalCache.java:2029)
	at com.github.benmanes.caffeine.cache.BoundedLocalCache.computeIfAbsent(BoundedLocalCache.java:2012)
	at com.github.benmanes.caffeine.cache.LocalCache.computeIfAbsent(LocalCache.java:113)
	at com.github.benmanes.caffeine.cache.LocalManualCache.get(LocalManualCache.java:54)
	at org.hisp.dhis.cache.LocalCache.get(LocalCache.java:101)
	at org.hisp.dhis.setting.DefaultSystemSettingManager.getSystemSetting(DefaultSystemSettingManager.java:207)
	at org.hisp.dhis.setting.DefaultSystemSettingManager.getSystemSettings(DefaultSystemSettingManager.java:348)
	at sun.reflect.GeneratedMethodAccessor1764.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe

strangely the db do nothing, all connections are idle

datid,datname,pid,usesysid,usename,application_name,client_addr,client_hostname,client_port,backend_start,xact_start,query_start,state_change,wait_event_type,wait_event,state,backend_xid,backend_xmin,query
16396,dhis2,14679,16395,dhis2,PostgreSQL JDBC Driver,172.31.19.128,,43942,2019-06-29 06:29:11.552247+00,,2019-06-29 09:38:51.691524+00,2019-06-29 09:38:51.69154+00,,,idle,,,SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE
16396,dhis2,14678,16395,dhis2,PostgreSQL JDBC Driver,172.31.19.128,,43940,2019-06-29 06:29:11.551823+00,,2019-06-29 09:38:51.679668+00,2019-06-29 09:38:51.679689+00,,,idle,,,SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE
16396,dhis2,12653,16395,dhis2,PostgreSQL JDBC Driver,172.31.19.128,,59430,2019-06-27 12:27:00.339296+00,,2019-06-29 09:38:51.689919+00,2019-06-29 09:38:51.689933+00,,,idle,,,SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE
16396,dhis2,14680,16395,dhis2,PostgreSQL JDBC Driver,172.31.19.128,,43944,2019-06-29 06:29:11.552594+00,,2019-06-29 09:38:46.623479+00,2019-06-29 09:38:46.623511+00,,,idle,,,SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE
16396,dhis2,14855,16395,dhis2,PostgreSQL JDBC Driver,172.31.19.128,,34296,2019-06-27 12:27:59.67825+00,,2019-06-29 09:38:51.707286+00,2019-06-29 09:38:51.7073+00,,,idle,,,SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE

some threads seem go a bit further in the “compute” part

"http-nio-8080-exec-510" #3512 daemon prio=5 os_prio=0 tid=0x00007f9cbc0cc800 nid=0xdcb in Object.wait() [0x00007f9c6d0a0000]
java.lang.Thread.State: WAITING (on object monitor)
at java.lang.Object.wait(Native Method)
at com.mchange.v2.resourcepool.BasicResourcePool.awaitAvailable(BasicResourcePool.java:1465)
at com.mchange.v2.resourcepool.BasicResourcePool.prelimCheckoutResource(BasicResourcePool.java:644)
- locked <0x00000004c26b6220> (a com.mchange.v2.resourcepool.BasicResourcePool)
at com.mchange.v2.resourcepool.BasicResourcePool.checkoutResource(BasicResourcePool.java:554)
at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutAndMarkConnectionInUse(C3P0PooledConnectionPool.java:758)
at com.mchange.v2.c3p0.impl.C3P0PooledConnectionPool.checkoutPooledConnection(C3P0PooledConnectionPool.java:685)
at com.mchange.v2.c3p0.impl.AbstractPoolBackedDataSource.getConnection(AbstractPoolBackedDataSource.java:146)

most threads are involved in user and system settings access

org.hisp.dhis.setting.DefaultSystemSettingManager.getSystemSetting
org.hisp.dhis.interceptor.SystemSettingInterceptor.intercept(SystemSettingInterceptor.java:104)
org.hisp.dhis.user.DefaultUserSettingService.getUserSettingOptional

Is there anyway to specify a timeout on the connection pool in order to let this resolve (with a few errors) by it self instead of requiring a tomcat restart ?
Is this a known bug ? (dhis2 2.30)

Thanks,

for developers : a full thread dump is available here : threaddump.txt · GitHub

Stephan_Mestach · 30 June 2019 06:57

Is it possible that it’s related to this :

Issue with hanging lookups when migrating guava cache to caffeine · Issue #209 · ben-manes/caffeine · GitHub or this
Deadlock in asynchronous refresh · Issue #69 · ben-manes/caffeine · GitHub

Stephan_Mestach · 30 June 2019 08:24

https://jira.dhis2.org/browse/DHIS2-7197

jomutsani · 1 July 2019 07:55

Hi @Stephan_Mestach,

Our team from @dhis2-backend will have a look and advise on this. Kindly be patient and available in case they need more information from you.

Stephan_Mestach · 1 July 2019 11:03

Thanks, no problem.

Lars · 1 July 2019 14:12

Could potentially be JDK level bug:

https://bugs.openjdk.java.net/browse/JDK-8161372

Stephan_Mestach · 1 July 2019 18:17

Thanks for pointing this.

What are my options ?

morning/evening restart to prevent at least this during the day (12h cache from what I’ve seen in the codebase) ?
switching cache implementation ? Is redis cache advised ? (don’t see a lot of references in the doc except in cluster support)
is there a plan to support latest java versions ?

Stephan_Mestach · 9 July 2019 05:58

The live locking seem easy to reproduce if your computeIfAbsent call another cache entry

is this possible to run dhis2 on newer jdk ?

Stian · 10 July 2019 14:42

Hi @Stephan_Mestach,

just to keep the thread up to date, I want to report that we are able to reproduce the issue. We have come up with several ways to fix or mitigate the issue and are in the process of testing out the most reasonable way to fix this issue for existing versions of DHIS2.

pjaspers · 11 September 2019 09:15

Some follow-up: [DHIS2-7197] - Jira

Which resulted in: feat: Cache2k implementation by ameenhere · Pull Request #3664 · dhis2/dhis2-core · GitHub (switch to cache2k)

So this will probably get resolved in a future DHIS version. Thanks all involved!