Problems with detailed metadata export..

jason · 21 January 2015 15:00

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

···

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

bobj · 21 January 2015 17:42

Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.

···

On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

–

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

morten · 21 January 2015 18:26

18k is not really that much… is it giving you any kind of errors? given the SL demo db… we can easily export much more objects that that…

···

On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

–
Morten

On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

–

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

bobj · 21 January 2015 18:42

If it is running for 15 minutes then almost certainly it is stuck in some kind of loop. Find the pid of the running jvm and also the owner of the java process. Then run:

sudo -u jstack > jstack.txt

That might give an idea of what the threads are up to …

···

On 21 January 2015 at 18:26, Morten Olav Hansen mortenoh@gmail.com wrote:

18k is not really that much… is it giving you any kind of errors? given the SL demo db… we can easily export much more objects that that…

–
Morten

On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

–

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

jason · 21 January 2015 19:28

Not really sure what is going on, but here is the stack dump. Maybe it will help.

And some more info below.

top - 21:26:40 up 477 days, 16:20, 2 users, load average: 2.99, 2.05, 1.03

Tasks: 171 total, 1 running, 170 sleeping, 0 stopped, 0 zombie

Cpu(s): 38.3%us, 0.0%sy, 0.0%ni, 61.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st

Mem: 16433100k total, 15416512k used, 1016588k free, 86276k buffers

Swap: 262140k total, 18872k used, 243268k free, 12412748k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

28091 dhis 20 0 11.8g 2.1g 13m S 307 13.2 29:39.99 java

Regards,

Jason

stack3.txt (55.5 KB)

···

On Wed, Jan 21, 2015 at 7:42 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

If it is running for 15 minutes then almost certainly it is stuck in some kind of loop. Find the pid of the running jvm and also the owner of the java process. Then run:

sudo -u jstack > jstack.txt

That might give an idea of what the threads are up to …

–

On 21 January 2015 at 18:26, Morten Olav Hansen mortenoh@gmail.com wrote:

18k is not really that much… is it giving you any kind of errors? given the SL demo db… we can easily export much more objects that that…

–
Morten

On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

–

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

bobj · 21 January 2015 22:05

Jason the cpu is not really “through the roof” here. Can you try another dump (while the cpu is through the roof)?

The interesting threads to look at are the http-bio-8080 threads, which are the tomcat worker threads. If the server is not being accessed by dozens of users most of them should be sitting in a WAITING state. So then it is interesting to look at those which are marked as RUNNABLE. These are the ones which are trying as hard as they can to get the attention of the scheduler in order to chew cpu cycles.

So if your process has become cpu bound then sometimes its possible to figure out which thread it is and what it is trying to do. This attached stack dump doesn’t reveal much unfortunately, but then again the cpu (from top) is showing 62% idle. So try and get the cpu burning up again and grab 2 or 3 jstack samples. If it is cpu bound then it might show.

···

On 21 January 2015 at 19:28, Jason Pickering jason.p.pickering@gmail.com wrote:

Not really sure what is going on, but here is the stack dump. Maybe it will help.

And some more info below.

top - 21:26:40 up 477 days, 16:20, 2 users, load average: 2.99, 2.05, 1.03

Tasks: 171 total, 1 running, 170 sleeping, 0 stopped, 0 zombie

Cpu(s): 38.3%us, 0.0%sy, 0.0%ni, 61.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st

Mem: 16433100k total, 15416512k used, 1016588k free, 86276k buffers

Swap: 262140k total, 18872k used, 243268k free, 12412748k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

28091 dhis 20 0 11.8g 2.1g 13m S 307 13.2 29:39.99 java

Regards,

Jason

On Wed, Jan 21, 2015 at 7:42 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

If it is running for 15 minutes then almost certainly it is stuck in some kind of loop. Find the pid of the running jvm and also the owner of the java process. Then run:

sudo -u jstack > jstack.txt

That might give an idea of what the threads are up to …

–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

On 21 January 2015 at 18:26, Morten Olav Hansen mortenoh@gmail.com wrote:

18k is not really that much… is it giving you any kind of errors? given the SL demo db… we can easily export much more objects that that…

–
Morten

On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

–

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

jason · 22 January 2015 08:09

OK, is this better? Stack attached.

Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie

Cpu(s): 99.1%us, 0.1%sy, 0.0%ni, 0.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st

Mem: 16433100k total, 16330904k used, 102196k free, 71828k buffers

Swap: 262140k total, 25108k used, 237032k free, 11411252k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

28091 dhis 20 0 12.0g 3.6g 5916 S 796 23.0 328:48.52 java

Command used was

curl --verbose “http://XXXX/dhis/api/dataElements.json?links=false&paging=false” -u XXXX:XXXX

Regards,

Jason

stack4.txt (119 KB)

···

On Wed, Jan 21, 2015 at 11:05 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Jason the cpu is not really “through the roof” here. Can you try another dump (while the cpu is through the roof)?

The interesting threads to look at are the http-bio-8080 threads, which are the tomcat worker threads. If the server is not being accessed by dozens of users most of them should be sitting in a WAITING state. So then it is interesting to look at those which are marked as RUNNABLE. These are the ones which are trying as hard as they can to get the attention of the scheduler in order to chew cpu cycles.

So if your process has become cpu bound then sometimes its possible to figure out which thread it is and what it is trying to do. This attached stack dump doesn’t reveal much unfortunately, but then again the cpu (from top) is showing 62% idle. So try and get the cpu burning up again and grab 2 or 3 jstack samples. If it is cpu bound then it might show.

–

On 21 January 2015 at 19:28, Jason Pickering jason.p.pickering@gmail.com wrote:

Not really sure what is going on, but here is the stack dump. Maybe it will help.

And some more info below.

top - 21:26:40 up 477 days, 16:20, 2 users, load average: 2.99, 2.05, 1.03

Tasks: 171 total, 1 running, 170 sleeping, 0 stopped, 0 zombie

Cpu(s): 38.3%us, 0.0%sy, 0.0%ni, 61.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st

Mem: 16433100k total, 15416512k used, 1016588k free, 86276k buffers

Swap: 262140k total, 18872k used, 243268k free, 12412748k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

28091 dhis 20 0 11.8g 2.1g 13m S 307 13.2 29:39.99 java

Regards,

Jason

On Wed, Jan 21, 2015 at 7:42 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

If it is running for 15 minutes then almost certainly it is stuck in some kind of loop. Find the pid of the running jvm and also the owner of the java process. Then run:

sudo -u jstack > jstack.txt

That might give an idea of what the threads are up to …

–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

On 21 January 2015 at 18:26, Morten Olav Hansen mortenoh@gmail.com wrote:

18k is not really that much… is it giving you any kind of errors? given the SL demo db… we can easily export much more objects that that…

–
Morten

On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

–

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

bobj · 22 January 2015 10:58

Sorry Jason. A quick look through isnt shedding light on me

Is this a test environment or did you run that against a running server with other users? There doesn’t seem to be any tight loop like we saw with the paging bug, so I guess what we have here is not so much as a logical error as some sort of gross inefficiency somewhere.

One thing I notice is that almost all RUNNABLE threads are somehow engaged with operations relating to user service. getCurrentUser, getUserCredentials etc. Given that this is a static snapshot in time of where exactly each thread is at, it seems odd that so many are caught up chasing users instead of doing real work. This might be a red herring or a pointer to that gross inefficiency - it might be that if the user credentials are being retrieved and tested each time the user is touching a metadata object (and it it is maybe touching many of them with this metadata export) that that is leading to excessive activity. Do you have a vast number of users in the system by any chance?

···

On 22 January 2015 at 08:09, Jason Pickering jason.p.pickering@gmail.com wrote:

OK, is this better? Stack attached.

Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie

Cpu(s): 99.1%us, 0.1%sy, 0.0%ni, 0.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st

Mem: 16433100k total, 16330904k used, 102196k free, 71828k buffers

Swap: 262140k total, 25108k used, 237032k free, 11411252k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

28091 dhis 20 0 12.0g 3.6g 5916 S 796 23.0 328:48.52 java

Command used was

curl --verbose “http://XXXX/dhis/api/dataElements.json?links=false&paging=false” -u XXXX:XXXX

Regards,

Jason

On Wed, Jan 21, 2015 at 11:05 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Jason the cpu is not really “through the roof” here. Can you try another dump (while the cpu is through the roof)?

The interesting threads to look at are the http-bio-8080 threads, which are the tomcat worker threads. If the server is not being accessed by dozens of users most of them should be sitting in a WAITING state. So then it is interesting to look at those which are marked as RUNNABLE. These are the ones which are trying as hard as they can to get the attention of the scheduler in order to chew cpu cycles.

So if your process has become cpu bound then sometimes its possible to figure out which thread it is and what it is trying to do. This attached stack dump doesn’t reveal much unfortunately, but then again the cpu (from top) is showing 62% idle. So try and get the cpu burning up again and grab 2 or 3 jstack samples. If it is cpu bound then it might show.

–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

On 21 January 2015 at 19:28, Jason Pickering jason.p.pickering@gmail.com wrote:

Not really sure what is going on, but here is the stack dump. Maybe it will help.

And some more info below.

top - 21:26:40 up 477 days, 16:20, 2 users, load average: 2.99, 2.05, 1.03

Tasks: 171 total, 1 running, 170 sleeping, 0 stopped, 0 zombie

Cpu(s): 38.3%us, 0.0%sy, 0.0%ni, 61.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st

Mem: 16433100k total, 15416512k used, 1016588k free, 86276k buffers

Swap: 262140k total, 18872k used, 243268k free, 12412748k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

28091 dhis 20 0 11.8g 2.1g 13m S 307 13.2 29:39.99 java

Regards,

Jason

On Wed, Jan 21, 2015 at 7:42 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

If it is running for 15 minutes then almost certainly it is stuck in some kind of loop. Find the pid of the running jvm and also the owner of the java process. Then run:

sudo -u jstack > jstack.txt

That might give an idea of what the threads are up to …

–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

On 21 January 2015 at 18:26, Morten Olav Hansen mortenoh@gmail.com wrote:

18k is not really that much… is it giving you any kind of errors? given the SL demo db… we can easily export much more objects that that…

–
Morten

On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

–

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

bobj · 23 January 2015 12:10

Hi Jason

Just spent a little time looking further at this. I think the stack dump has the answer after all.

I believe the inefficiency is coming from the fact that for each object access there is a call to getAccess( T object ) in DefaultAclService. The trouble is this in turn is calling currentUserService.getCurrentUser() each time. You wouldn’t think this is such a load but each time I look at a snapshot of the thread running it is busy in that method so its obviously significant. I guess it is doing a string comparison search each time (userService.getUserCredentialsByUsername( username )), so with sufficient number of users in the system its accounting for the bulk of the time - a lot of users multiplied by a lot of dataelements and its going to be more problematic. Do you have a lot of users?

The solution has got to be to call currentUserService.getCurrentUser() just once, and then use the getAccess( T object, User user ) method instead. I’m guessing that will reduce the computation load for returning a list of dataelements hugely.

The trouble is I am not 100% of where that getAccess call is coming from in this object sharing code.

Morten will know better but I guess everything starts in the AbstractCrudController. Given that currentUserService.getCurrentUser() is not such a lightweight operation we clearly need to be much more conservative in how often we use it. Currently it is peppered all ove the place. Can’t we do this just once per session? And from there on make sure to use getAccess( T object, User user ) ?

I’d bet a pint of best that this would reduce the time for that call to list dataelements by 90%.

Bob

···

On 22 January 2015 at 10:58, Bob Jolliffe bobjolliffe@gmail.com wrote:

Sorry Jason. A quick look through isnt shedding light on me

Is this a test environment or did you run that against a running server with other users? There doesn’t seem to be any tight loop like we saw with the paging bug, so I guess what we have here is not so much as a logical error as some sort of gross inefficiency somewhere.

One thing I notice is that almost all RUNNABLE threads are somehow engaged with operations relating to user service. getCurrentUser, getUserCredentials etc. Given that this is a static snapshot in time of where exactly each thread is at, it seems odd that so many are caught up chasing users instead of doing real work. This might be a red herring or a pointer to that gross inefficiency - it might be that if the user credentials are being retrieved and tested each time the user is touching a metadata object (and it it is maybe touching many of them with this metadata export) that that is leading to excessive activity. Do you have a vast number of users in the system by any chance?

On 22 January 2015 at 08:09, Jason Pickering jason.p.pickering@gmail.com wrote:

OK, is this better? Stack attached.

Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie

Cpu(s): 99.1%us, 0.1%sy, 0.0%ni, 0.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st

Mem: 16433100k total, 16330904k used, 102196k free, 71828k buffers

Swap: 262140k total, 25108k used, 237032k free, 11411252k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

28091 dhis 20 0 12.0g 3.6g 5916 S 796 23.0 328:48.52 java

Command used was

curl --verbose “http://XXXX/dhis/api/dataElements.json?links=false&paging=false” -u XXXX:XXXX

Regards,

Jason

On Wed, Jan 21, 2015 at 11:05 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Jason the cpu is not really “through the roof” here. Can you try another dump (while the cpu is through the roof)?

The interesting threads to look at are the http-bio-8080 threads, which are the tomcat worker threads. If the server is not being accessed by dozens of users most of them should be sitting in a WAITING state. So then it is interesting to look at those which are marked as RUNNABLE. These are the ones which are trying as hard as they can to get the attention of the scheduler in order to chew cpu cycles.

So if your process has become cpu bound then sometimes its possible to figure out which thread it is and what it is trying to do. This attached stack dump doesn’t reveal much unfortunately, but then again the cpu (from top) is showing 62% idle. So try and get the cpu burning up again and grab 2 or 3 jstack samples. If it is cpu bound then it might show.

–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

On 21 January 2015 at 19:28, Jason Pickering jason.p.pickering@gmail.com wrote:

Not really sure what is going on, but here is the stack dump. Maybe it will help.

And some more info below.

top - 21:26:40 up 477 days, 16:20, 2 users, load average: 2.99, 2.05, 1.03

Tasks: 171 total, 1 running, 170 sleeping, 0 stopped, 0 zombie

Cpu(s): 38.3%us, 0.0%sy, 0.0%ni, 61.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st

Mem: 16433100k total, 15416512k used, 1016588k free, 86276k buffers

Swap: 262140k total, 18872k used, 243268k free, 12412748k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

28091 dhis 20 0 11.8g 2.1g 13m S 307 13.2 29:39.99 java

Regards,

Jason

On Wed, Jan 21, 2015 at 7:42 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

If it is running for 15 minutes then almost certainly it is stuck in some kind of loop. Find the pid of the running jvm and also the owner of the java process. Then run:

sudo -u jstack > jstack.txt

That might give an idea of what the threads are up to …

–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

On 21 January 2015 at 18:26, Morten Olav Hansen mortenoh@gmail.com wrote:

18k is not really that much… is it giving you any kind of errors? given the SL demo db… we can easily export much more objects that that…

–
Morten

On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:

In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,

Jason

–

Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

jason · 23 January 2015 12:22

Hi Bob
In this system, there are only 277 users, but about 18000 data
elements. This problem surfaced after the upgrade to 2.17 and after
the injection of about 14,000 new data elements into the system.
Before, it was fine.

Hope it can be solved quickly, as even the loading of data elements in
the analysis tools is excruciatingly slow.

Best regards,
Jason

···

On Fri, Jan 23, 2015 at 1:10 PM, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

Hi Jason

Just spent a little time looking further at this. I think the stack dump has
the answer after all.

I believe the inefficiency is coming from the fact that for each object
access there is a call to getAccess( T object ) in DefaultAclService. The
trouble is this in turn is calling currentUserService.getCurrentUser() each
time. You wouldn't think this is such a load but each time I look at a
snapshot of the thread running it is busy in that method so its obviously
significant. I guess it is doing a string comparison search each time
(userService.getUserCredentialsByUsername( username )), so with sufficient
number of users in the system its accounting for the bulk of the time - a
lot of users multiplied by a lot of dataelements and its going to be more
problematic. Do you have a lot of users?

The solution has got to be to call currentUserService.getCurrentUser() just
once, and then use the getAccess( T object, User user ) method instead. I'm
guessing that will reduce the computation load for returning a list of
dataelements hugely.

The trouble is I am not 100% of where that getAccess call is coming from in
this object sharing code.

Morten will know better but I guess everything starts in the
AbstractCrudController. Given that currentUserService.getCurrentUser() is
not such a lightweight operation we clearly need to be much more
conservative in how often we use it. Currently it is peppered all ove the
place. Can't we do this just once per session? And from there on make sure
to use getAccess( T object, User user ) ?

I'd bet a pint of best that this would reduce the time for that call to list
dataelements by 90%.

Bob

On 22 January 2015 at 10:58, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

Sorry Jason. A quick look through isnt shedding light on me

Is this a test environment or did you run that against a running server
with other users? There doesn't seem to be any tight loop like we saw with
the paging bug, so I guess what we have here is not so much as a logical
error as some sort of gross inefficiency somewhere.
One thing I notice is that almost all RUNNABLE threads are somehow engaged
with operations relating to user service. getCurrentUser,
getUserCredentials etc. Given that this is a static snapshot in time of
where exactly each thread is at, it seems odd that so many are caught up
chasing users instead of doing real work. This might be a red herring or a
pointer to that gross inefficiency - it might be that if the user
credentials are being retrieved and tested each time the user is touching a
metadata object (and it it is maybe touching many of them with this metadata
export) that that is leading to excessive activity. Do you have a vast
number of users in the system by any chance?

On 22 January 2015 at 08:09, Jason Pickering <jason.p.pickering@gmail.com> >> wrote:

OK, is this better? Stack attached.

Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.1%us, 0.1%sy, 0.0%ni, 0.4%id, 0.0%wa, 0.0%hi, 0.0%si,
0.3%st
Mem: 16433100k total, 16330904k used, 102196k free, 71828k buffers
Swap: 262140k total, 25108k used, 237032k free, 11411252k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28091 dhis 20 0 12.0g 3.6g 5916 S 796 23.0 328:48.52 java

Command used was

curl --verbose
"http://XXXX/dhis/api/dataElements.json?links=false&paging=false" -u
XXXX:XXXX

Regards,
Jason

On Wed, Jan 21, 2015 at 11:05 PM, Bob Jolliffe <bobjolliffe@gmail.com> >>> wrote:

Jason the cpu is not really "through the roof" here. Can you try
another dump (while the cpu is through the roof)?

The interesting threads to look at are the http-bio-8080 threads, which
are the tomcat worker threads. If the server is not being accessed by
dozens of users most of them should be sitting in a WAITING state. So then
it is interesting to look at those which are marked as RUNNABLE. These are
the ones which are trying as hard as they can to get the attention of the
scheduler in order to chew cpu cycles.

So if your process has become cpu bound then sometimes its possible to
figure out which thread it is and what it is trying to do. This attached
stack dump doesn't reveal much unfortunately, but then again the cpu (from
top) is showing 62% idle. So try and get the cpu burning up again and grab
2 or 3 jstack samples. If it is cpu bound then it might show.

On 21 January 2015 at 19:28, Jason Pickering >>>> <jason.p.pickering@gmail.com> wrote:

Not really sure what is going on, but here is the stack dump. Maybe it
will help.

And some more info below.

top - 21:26:40 up 477 days, 16:20, 2 users, load average: 2.99, 2.05,
1.03
Tasks: 171 total, 1 running, 170 sleeping, 0 stopped, 0 zombie
Cpu(s): 38.3%us, 0.0%sy, 0.0%ni, 61.6%id, 0.0%wa, 0.0%hi, 0.0%si,
0.1%st
Mem: 16433100k total, 15416512k used, 1016588k free, 86276k
buffers
Swap: 262140k total, 18872k used, 243268k free, 12412748k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28091 dhis 20 0 11.8g 2.1g 13m S 307 13.2 29:39.99 java

Regards,
Jason

On Wed, Jan 21, 2015 at 7:42 PM, Bob Jolliffe <bobjolliffe@gmail.com> >>>>> wrote:

If it is running for 15 minutes then almost certainly it is stuck in
some kind of loop. Find the pid of the running jvm and also the owner of
the java process. Then run:

sudo -u <process owner> jstack <pid> > jstack.txt

That might give an idea of what the threads are up to ...

On 21 January 2015 at 18:26, Morten Olav Hansen <mortenoh@gmail.com> >>>>>> wrote:

18k is not really that much.. is it giving you any kind of errors?
given the SL demo db... we can easily export much more objects that that..

--
Morten

On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe >>>>>>> <bobjolliffe@gmail.com> wrote:

Can you try and get a jstack dump while it is busy chomping cpu.
That will probably indicate where in the code it is spinning. There was a
problem earlier with a weakness in the paging code but should have been
fixed by that build.

On 21 January 2015 at 15:00, Jason Pickering >>>>>>>> <jason.p.pickering@gmail.com> wrote:

In one DHIS2 instance, we have a rather large number of data
elements, more than 18,000. During the detailed metadata export, a request
is made to "api/dataElements.json?paging=false&links=false". This throws the
CPU of the server through the roof, with the operation taking many minutes
(has been running for 15 minutes now with no response).There are no errors
in the logs or the browser. Build revision is not known, but it is 2.17
from from 2015-01-17 02:03.

Any idea what might be going on here?

Regards,
Jason

--
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

_______________________________________________
Mailing list: DHIS 2 developers in Launchpad
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : DHIS 2 developers in Launchpad
More help : ListHelp - Launchpad Help

_______________________________________________
Mailing list: DHIS 2 developers in Launchpad
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : DHIS 2 developers in Launchpad
More help : ListHelp - Launchpad Help

--
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

--
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

--
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049

Lars · 23 January 2015 13:58

Hi Bob,

I think you are right. I've seen this before, and reducing the number of
calls to one usually helps a lot. The problem is that getCurrentUser() is
transactional and makes a (cached) db query. The overhead of a huge number
of transaction causes things to slow down. If you or someone could write a
patch/fix it would be great.

Lars

morten · 23 January 2015 14:08

Yes, we have already fixed this a few places. I’m making a note of it, and will fix it later.

···

On Fri, Jan 23, 2015 at 8:58 PM, Lars Helge Øverland larshelge@gmail.com wrote:

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

–
Morten

Hi Bob,

I think you are right. I’ve seen this before, and reducing the number of calls to one usually helps a lot. The problem is that getCurrentUser() is transactional and makes a (cached) db query. The overhead of a huge number of transaction causes things to slow down. If you or someone could write a patch/fix it would be great.

Lars

bobj · 23 January 2015 14:59

I guess you will also have to backport to 2.17.

···

On 23 January 2015 at 14:08, Morten Olav Hansen mortenoh@gmail.com wrote:

Yes, we have already fixed this a few places. I’m making a note of it, and will fix it later.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

–
Morten

On Fri, Jan 23, 2015 at 8:58 PM, Lars Helge Øverland larshelge@gmail.com wrote:

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

Hi Bob,

I think you are right. I’ve seen this before, and reducing the number of calls to one usually helps a lot. The problem is that getCurrentUser() is transactional and makes a (cached) db query. The overhead of a huge number of transaction causes things to slow down. If you or someone could write a patch/fix it would be great.

Lars

morten · 23 January 2015 15:26

I have backportet to 2.17

···

On Fri, Jan 23, 2015 at 9:59 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

I guess you will also have to backport to 2.17.

–
Morten

On 23 January 2015 at 14:08, Morten Olav Hansen mortenoh@gmail.com wrote:

Yes, we have already fixed this a few places. I’m making a note of it, and will fix it later.

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

–
Morten

On Fri, Jan 23, 2015 at 8:58 PM, Lars Helge Øverland larshelge@gmail.com wrote:

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

Hi Bob,

I think you are right. I’ve seen this before, and reducing the number of calls to one usually helps a lot. The problem is that getCurrentUser() is transactional and makes a (cached) db query. The overhead of a huge number of transaction causes things to slow down. If you or someone could write a patch/fix it would be great.

Lars