Just spent a little time looking further at this. I think the stack dump has the answer after all.
I believe the inefficiency is coming from the fact that for each object access there is a call to getAccess( T object ) in DefaultAclService. The trouble is this in turn is calling currentUserService.getCurrentUser() each time. You wouldn’t think this is such a load but each time I look at a snapshot of the thread running it is busy in that method so its obviously significant. I guess it is doing a string comparison search each time (userService.getUserCredentialsByUsername( username )), so with sufficient number of users in the system its accounting for the bulk of the time - a lot of users multiplied by a lot of dataelements and its going to be more problematic. Do you have a lot of users?
The solution has got to be to call currentUserService.getCurrentUser() just once, and then use the getAccess( T object, User user ) method instead. I’m guessing that will reduce the computation load for returning a list of dataelements hugely.
The trouble is I am not 100% of where that getAccess call is coming from in this object sharing code.
Morten will know better but I guess everything starts in the AbstractCrudController. Given that currentUserService.getCurrentUser() is not such a lightweight operation we clearly need to be much more conservative in how often we use it. Currently it is peppered all ove the place. Can’t we do this just once per session? And from there on make sure to use getAccess( T object, User user ) ?
I’d bet a pint of best that this would reduce the time for that call to list dataelements by 90%.
···
On 22 January 2015 at 10:58, Bob Jolliffe bobjolliffe@gmail.com wrote:
Sorry Jason. A quick look through isnt shedding light on me 
Is this a test environment or did you run that against a running server with other users? There doesn’t seem to be any tight loop like we saw with the paging bug, so I guess what we have here is not so much as a logical error as some sort of gross inefficiency somewhere.
One thing I notice is that almost all RUNNABLE threads are somehow engaged with operations relating to user service. getCurrentUser, getUserCredentials etc. Given that this is a static snapshot in time of where exactly each thread is at, it seems odd that so many are caught up chasing users instead of doing real work. This might be a red herring or a pointer to that gross inefficiency - it might be that if the user credentials are being retrieved and tested each time the user is touching a metadata object (and it it is maybe touching many of them with this metadata export) that that is leading to excessive activity. Do you have a vast number of users in the system by any chance?
On 22 January 2015 at 08:09, Jason Pickering jason.p.pickering@gmail.com wrote:
OK, is this better? Stack attached.
Tasks: 178 total, 1 running, 177 sleeping, 0 stopped, 0 zombie
Cpu(s): 99.1%us, 0.1%sy, 0.0%ni, 0.4%id, 0.0%wa, 0.0%hi, 0.0%si, 0.3%st
Mem: 16433100k total, 16330904k used, 102196k free, 71828k buffers
Swap: 262140k total, 25108k used, 237032k free, 11411252k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28091 dhis 20 0 12.0g 3.6g 5916 S 796 23.0 328:48.52 java
Command used was
curl --verbose “http://XXXX/dhis/api/dataElements.json?links=false&paging=false” -u XXXX:XXXX
Regards,
Jason
On Wed, Jan 21, 2015 at 11:05 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:
Jason the cpu is not really “through the roof” here. Can you try another dump (while the cpu is through the roof)?
The interesting threads to look at are the http-bio-8080 threads, which are the tomcat worker threads. If the server is not being accessed by dozens of users most of them should be sitting in a WAITING state. So then it is interesting to look at those which are marked as RUNNABLE. These are the ones which are trying as hard as they can to get the attention of the scheduler in order to chew cpu cycles.
So if your process has become cpu bound then sometimes its possible to figure out which thread it is and what it is trying to do. This attached stack dump doesn’t reveal much unfortunately, but then again the cpu (from top) is showing 62% idle. So try and get the cpu burning up again and grab 2 or 3 jstack samples. If it is cpu bound then it might show.
–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049
On 21 January 2015 at 19:28, Jason Pickering jason.p.pickering@gmail.com wrote:
Not really sure what is going on, but here is the stack dump. Maybe it will help.
And some more info below.
top - 21:26:40 up 477 days, 16:20, 2 users, load average: 2.99, 2.05, 1.03
Tasks: 171 total, 1 running, 170 sleeping, 0 stopped, 0 zombie
Cpu(s): 38.3%us, 0.0%sy, 0.0%ni, 61.6%id, 0.0%wa, 0.0%hi, 0.0%si, 0.1%st
Mem: 16433100k total, 15416512k used, 1016588k free, 86276k buffers
Swap: 262140k total, 18872k used, 243268k free, 12412748k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
28091 dhis 20 0 11.8g 2.1g 13m S 307 13.2 29:39.99 java
Regards,
Jason
On Wed, Jan 21, 2015 at 7:42 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:
If it is running for 15 minutes then almost certainly it is stuck in some kind of loop. Find the pid of the running jvm and also the owner of the java process. Then run:
sudo -u jstack > jstack.txt
That might give an idea of what the threads are up to …
–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049
On 21 January 2015 at 18:26, Morten Olav Hansen mortenoh@gmail.com wrote:
18k is not really that much… is it giving you any kind of errors? given the SL demo db… we can easily export much more objects that that…
–
Morten
On Thu, Jan 22, 2015 at 12:42 AM, Bob Jolliffe bobjolliffe@gmail.com wrote:
Can you try and get a jstack dump while it is busy chomping cpu. That will probably indicate where in the code it is spinning. There was a problem earlier with a weakness in the paging code but should have been fixed by that build.
Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp
On 21 January 2015 at 15:00, Jason Pickering jason.p.pickering@gmail.com wrote:
In one DHIS2 instance, we have a rather large number of data elements, more than 18,000. During the detailed metadata export, a request is made to “api/dataElements.json?paging=false&links=false”. This throws the CPU of the server through the roof, with the operation taking many minutes (has been running for 15 minutes now with no response).There are no errors in the logs or the browser. Build revision is not known, but it is 2.17 from from 2015-01-17 02:03.
Any idea what might be going on here?
Regards,
Jason
–
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049
Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp