DHIS version 2.8 released

Edward_Ari_Bichetero · 4 May 2012 14:37

Congratulation on the new release, it is nice to see the new features especially the CSV import functionality.
However ...

I've just had a go at importing a small set of historical data (Monthly outpatient attendance by district) going back six years (2005-2011). This is just two data elements (Outpatient attendance, Outpatient reattendance) with four combo categories (Male, Female, Below 5 yrs old, 5 yrs and older) for each of our 112 districts. The CSV data file is about 4.6 megabytes in size.

I gave up on watching the import process after an hour. At that point it had been using 90% of our test servers memory (4GB) and burning 100% of one cpu/core almost the entire time. This is just one of the smallest datasets we would be looking to import. It appears that the CSV import in it's current state is not able to cope with reasonably large data. Or am I getting this wrong ? Do you have any ideas/workarounds ?

- Edward -

Lars · 4 May 2012 15:07

Hi, approximately how many records are in your import file?

Lars

···

On Fri, May 4, 2012 at 4:37 PM, Edward Ari Bichetero ebichete@yahoo.com wrote:

Congratulation on the new release, it is nice to see the new features especially the CSV import functionality.

However …

I’ve just had a go at importing a small set of historical data (Monthly outpatient attendance by district) going back six years (2005-2011). This is just two data elements (Outpatient attendance, Outpatient reattendance) with four combo categories (Male, Female, Below 5 yrs old, 5 yrs and older) for each of our 112 districts. The CSV data file is about 4.6 megabytes in size.

I gave up on watching the import process after an hour. At that point it had been using 90% of our test servers memory (4GB) and burning 100% of one cpu/core almost the entire time. This is just one of the smallest datasets we would be looking to import. It appears that the CSV import in it’s current state is not able to cope with reasonably large data. Or am I getting this wrong ? Do you have any ideas/workarounds ?

Edward_Ari_Bichetero · 5 May 2012 07:11

The import file contains 46800 records (individual CSV lines).

- Edward -

···

________________________________
From: Lars Helge Øverland <larshelge@gmail.com>
To: Edward Ari Bichetero <ebichete@yahoo.com>
Cc: "dhis2-users@lists.launchpad.net" <dhis2-users@lists.launchpad.net>
Sent: Friday, May 4, 2012 6:07 PM
Subject: Re: [Dhis2-users] DHIS version 2.8 released

On Fri, May 4, 2012 at 4:37 PM, Edward Ari Bichetero <ebichete@yahoo.com> wrote:

Congratulation on the new release, it is nice to see the new features especially the CSV import functionality.

However ...

I've just had a go at importing a small set of historical data (Monthly outpatient attendance by district) going back six years (2005-2011). This is just two data elements (Outpatient attendance, Outpatient reattendance) with four combo categories (Male, Female, Below 5 yrs old, 5 yrs and older) for each of our 112 districts. The CSV data file is about 4.6 megabytes in size.

I gave up on watching the import process after an hour. At that point it had been using 90% of our test servers memory (4GB) and burning 100% of one cpu/core almost the entire time. This is just one of the smallest datasets we would be looking to import. It appears that the CSV import in it's current state is not able to cope with reasonably large data. Or am I getting this wrong ? Do you have any ideas/workarounds ?

Hi, approximately how many records are in your import file?

Lars

muhireandrew2020 · 7 May 2012 07:16

Hi to all,

In Rwanda we have 4 months data in our DHIS2 database, And downloading data for the whole 4 months, Excel gives an overloading error or can even burst…Any advice on using mydatamart? how are you running it over there? …This issue happens when we download aggregated data from facility level…I would suggest to have a selection box in mydatamart where a user can select only the data elements or Indicators to be analysed insteady of having the whole staff.

In Rwanda, users are used on row data from MYDATAMART,now 4 months its impossible to download to excel at the central level!!! what will happen in upcoming months?

···

Muhire Andrew

HMIS/Ministry of Health

First say to yourself what you would be; and then do what you have to do..

jason · 7 May 2012 07:32

Hi Andrew,
I do not know all of the circumstances of your data, but in general,
Excel is limited by a certain number of rows and columns. I think
Excel 2003 is limited to around 65,000 rows and Excel 2007 to over 1
million rows. PivotTables are limited by the amount of available
memory on your machine. If your dataset has more than this, you can do
a few things.

1) Use a higher level of aggregation. Instead of pulling data at the
facility level for the entire country, use the district level (or
something higher) this will decrease the number of rows which must be
pulled into Excel. Similarly, looking at quarterly data, versus
monthly data will decrease the amount of data which needs to be
retrieved by Excel.

2) Filter the rows which are pulled from MyDatamart with some type of
filter in the pivot source query. This normally is done using a filter
on either the _dataelementgroupsetstructure or
_indicatorgroupsetstructure table of the MyDatamart database. This
will allow you to for instance, pull only PMTCT indicators into an
Excel sheet, without pulling unrelated indicators, and then having to
filter them in the Excel table. You will need to modify and install
some new SQL views in MyDatamart to use this approach.

3) Consider using other tools, such as R/SPSS/Stata for analysis which
do not have such limitations.

We regularly analyze rather largish datasets (for Excel anyway) with
several million rows of data, using these approaches. The best way
however is just to bring the data which you need for a specific
PivotTable, without returning everything with may not be relevant for
a particular analysis.

Best regards,
Jason

···

On Mon, May 7, 2012 at 9:16 AM, Muhire Andrew <muhireandrew@yahoo.com> wrote:

Hi to all,

In Rwanda we have 4 months data in our DHIS2 database, And downloading data
for the whole 4 months, Excel gives an overloading error or can even
burst.......................Any advice on using mydatamart? how are you
running it over there? ......This issue happens when we download aggregated
data from facility level................I would suggest to have a selection
box in mydatamart where a user can select only the data elements or
Indicators to be analysed insteady of having the whole staff.

In Rwanda, users are used on row data from MYDATAMART,now 4 months its
impossible to download to excel at the central level!!!! what will happen in
upcoming months?

________________________________

Muhire Andrew
HMIS/Ministry of Health
First say to yourself what you would be; and then do what you have to do..

_______________________________________________
Mailing list: DHIS 2 Users in Launchpad
Post to : dhis2-users@lists.launchpad.net
Unsubscribe : DHIS 2 Users in Launchpad
More help : ListHelp - Launchpad Help

Lars · 7 May 2012 07:57

Hello,

I did a test here with a CSV file containing 50 000 records and it took 16 minutes.

I suspect this has to do with your Java configuration - have you e.g. set the environment variable JAVA_OPTS to allocate memory to Java ? A reasonable value would be

-Xms500m -Xmx1000m -XX:PermSize=250m -XX:MaxPermSize=500m

Lars

···

On Sat, May 5, 2012 at 9:11 AM, Edward Ari Bichetero ebichete@yahoo.com wrote:

The import file contains 46800 records (individual CSV lines).

Edward -

From: Lars Helge Øverland larshelge@gmail.com

To: Edward Ari Bichetero ebichete@yahoo.com

Cc: “dhis2-users@lists.launchpad.net” dhis2-users@lists.launchpad.net

Sent: Friday, May 4, 2012 6:07 PM

Subject: Re: [Dhis2-users] DHIS version 2.8 released

On Fri, May 4, 2012 at 4:37 PM, Edward Ari Bichetero ebichete@yahoo.com wrote:

Congratulation on the new release, it is nice to see the new features especially the CSV import functionality.

However …

I’ve just had a go at importing a small set of historical data (Monthly outpatient attendance by district) going back six years (2005-2011). This is just two data elements (Outpatient attendance, Outpatient reattendance) with four combo categories (Male, Female, Below 5 yrs old, 5 yrs and older) for each of our 112 districts. The CSV data file is about 4.6 megabytes in size.

I gave up on watching the import process after an hour. At that point it had been using 90% of our test servers memory (4GB) and burning 100% of one cpu/core almost the entire time. This is just one of the smallest datasets we would be looking to import. It appears that the CSV import in it’s current state is not able to cope with reasonably large data. Or am I getting this wrong ? Do you have any ideas/workarounds ?

Hi, approximately how many records are in your import file?

Lars

olatitle · 7 May 2012 08:07

Hi Andrew,

Do you know how many rows that is returned from your pivot source query?

The mydatmart tool is designed for what Jason describes in option 1) in his email.

Also note that we require Excel 2007 or newer for pivot tables to work properly with mydatamart (on metatdata updates and re-linking of mydatamart files and excel).

The user selects its orgunit and then the analysis level (the lowest level of the orgunit hierarchy needed for analysis). Typically you will not need to look at facility level for general data analysis at the national level, and in stead you can probably pick district level.

At the districts the users can still select facility as the analysis level and download facility level data to their local mydatamart files and load that into Excel.

To facilitate this difference in analysis level in mydatamart/Excel you typically will need to use at least two different Excel files (pivot table template files), one for national level users with pivot tables that pull district level data (e.g. the pivotsource_routinedata_ou2_m query), and another excel file for district users with facility level data (using pivotsource_routinedata_ou4_m - where facility level = level 4).

Note that Excel will automatically aggregate the data for you in the pivot tables, so you can easily look at district totals also in the pivot table with facility level data. The difference is how deep you can drill down, what the lowest level of analysis is; the district-level pivot table will never be able to show facility data.

When you have data for more than 1 year you can also consider breaking up the pivot table by either calendar or financial year by inserting custom sql queries in your mydatamart file that filter data rows on the periods.

In my experience the pivot tables start to get really slow when you get above 7-800 000 rows, but that of course also depends on your hardware.

I have promised to write up a section on setting up pivot tables with mydatamart to the user documentation, so look out for commits to the documentation branch in the next few weeks if you’re interested in more details on this topic.

And. to make your life a lot easier, I would consider getting rid of most of those zero values Lars just referred to… you probably only require that for a few selected data elements.

Ola

···

Ola Hodne Titlestad (Mr)
HISP
Department of Informatics

University of Oslo

Mobile: +47 48069736
Home address: Vetlandsvn. 95B, 0685 Oslo, Norway. Googlemaps link

On 7 May 2012 09:32, Jason Pickering jason.p.pickering@gmail.com wrote:

Hi Andrew,

I do not know all of the circumstances of your data, but in general,

Excel is limited by a certain number of rows and columns. I think

Excel 2003 is limited to around 65,000 rows and Excel 2007 to over 1

million rows. PivotTables are limited by the amount of available

memory on your machine. If your dataset has more than this, you can do

a few things.

Use a higher level of aggregation. Instead of pulling data at the

facility level for the entire country, use the district level (or

something higher) this will decrease the number of rows which must be

pulled into Excel. Similarly, looking at quarterly data, versus

monthly data will decrease the amount of data which needs to be

retrieved by Excel.

Filter the rows which are pulled from MyDatamart with some type of

filter in the pivot source query. This normally is done using a filter

on either the _dataelementgroupsetstructure or

_indicatorgroupsetstructure table of the MyDatamart database. This

will allow you to for instance, pull only PMTCT indicators into an

Excel sheet, without pulling unrelated indicators, and then having to

filter them in the Excel table. You will need to modify and install

some new SQL views in MyDatamart to use this approach.

Consider using other tools, such as R/SPSS/Stata for analysis which

do not have such limitations.

We regularly analyze rather largish datasets (for Excel anyway) with

several million rows of data, using these approaches. The best way

however is just to bring the data which you need for a specific

PivotTable, without returning everything with may not be relevant for

a particular analysis.

Best regards,

Jason

On Mon, May 7, 2012 at 9:16 AM, Muhire Andrew muhireandrew@yahoo.com wrote:

Hi to all,

In Rwanda we have 4 months data in our DHIS2 database, And downloading data

for the whole 4 months, Excel gives an overloading error or can even

burst…Any advice on using mydatamart? how are you

running it over there? …This issue happens when we download aggregated

data from facility level…I would suggest to have a selection

box in mydatamart where a user can select only the data elements or

Indicators to be analysed insteady of having the whole staff.

In Rwanda, users are used on row data from MYDATAMART,now 4 months its

impossible to download to excel at the central level!!! what will happen in

upcoming months?

Muhire Andrew

HMIS/Ministry of Health

First say to yourself what you would be; and then do what you have to do…

Mailing list: https://launchpad.net/~dhis2-users

Post to : dhis2-users@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-users

More help : https://help.launchpad.net/ListHelp

Edward_Ari_Bichetero · 7 May 2012 09:51

Thanks for the reply. I'll check our Java setup, retry the import and report back.

However there is still a larger issue. Taking 16 minutes as a representative timing and assuming linear scaling (unlikely but for the sake of discussion), importing the full dataset ("District Outpatient monthly report"), of which the previous data is just a small part, would take at least 10 hours of solid computation. I'm not sure our servers would survive that kind of abuse.

And then there is "Inpatients Monthly", "Weekly Disease Outbreak" and about 4 smaller (approx. an order of magnitude) reports. This would be about 35 hours of computation, with a longer "wall clock" time. Are any of the other import methods any faster ? Do we have to resort to generating SQL statements instead ?

- Edward -

···

________________________________
From: Lars Helge Øverland <larshelge@gmail.com>
To: Edward Ari Bichetero <ebichete@yahoo.com>
Cc: "dhis2-users@lists.launchpad.net" <dhis2-users@lists.launchpad.net>
Sent: Monday, May 7, 2012 10:57 AM
Subject: Re: [Dhis2-users] DHIS version 2.8 released

Hello,

I did a test here with a CSV file containing 50 000 records and it took 16 minutes.

I suspect this has to do with your Java configuration - have you e.g. set the environment variable JAVA_OPTS to allocate memory to Java ? A reasonable value would be

-Xms500m -Xmx1000m -XX:PermSize=250m -XX:MaxPermSize=500m

Lars

On Sat, May 5, 2012 at 9:11 AM, Edward Ari Bichetero <ebichete@yahoo.com> wrote:

The import file contains 46800 records (individual CSV lines).

- Edward -

________________________________
From: Lars Helge Øverland <larshelge@gmail.com>
To: Edward Ari Bichetero <ebichete@yahoo.com>
Cc: "dhis2-users@lists.launchpad.net" <dhis2-users@lists.launchpad.net>
Sent: Friday, May 4, 2012 6:07 PM
Subject: Re: [Dhis2-users] DHIS version 2.8 released

On Fri, May 4, 2012 at 4:37 PM, Edward Ari Bichetero <ebichete@yahoo.com> wrote:

Congratulation on the new release, it is nice to see the new features especially the CSV import functionality.

However ...

I've just had a go at importing a small set of historical data (Monthly outpatient attendance by district) going back six years (2005-2011). This is just two data elements (Outpatient attendance, Outpatient reattendance) with four combo categories (Male, Female, Below 5 yrs old, 5 yrs and older) for each of our 112 districts. The CSV data file is about 4.6 megabytes in size.

I gave up on watching the import process after an hour. At that point it had been using 90% of our test servers memory (4GB) and burning 100% of one cpu/core almost the entire time. This is just one of the smallest datasets we would be looking to import. It appears that the CSV import in it's current state is not able to cope with reasonably large data. Or am I getting this wrong ? Do you have any ideas/workarounds ?

Hi, approximately how many records are in your import file?

Lars

Lars · 7 May 2012 15:16

Hi again.

Thanks for making us aware of this. I have now corrected a flaw which caused slow handling of periods for each import record.

After applying the fix my test with 50 k records are down to 20 sec. Note that I cleared out existing records from my database - new records are faster since we can use multiple SQL insert statements compared to single updates.

I have also incorporated a new import option in the user interface for “existing records check”. This indicates whether the system should check for existing data value records or not during import. If you turn it off it means yourself is responsible for not having existing records in your database relative to the import file. This might be useful for database bootstraps and other situations where you are certain existing records do not exist. With that setting turned off (skip) my test is down to 12 sec.

I have backported these fixes to 2.8. You can download the latest build (r6830) here:

http://dhis2.org/download/releases/2.8/dhis.war

Let me know how it goes.

regards, Lars

···

On Mon, May 7, 2012 at 11:51 AM, Edward Ari Bichetero ebichete@yahoo.com wrote:

Thanks for the reply. I’ll check our Java setup, retry the import and report back.

However there is still a larger issue. Taking 16 minutes as a representative timing and assuming linear scaling (unlikely but for the sake of discussion), importing the full dataset (“District Outpatient monthly report”), of which the previous data is just a small part, would take at least 10 hours of solid computation. I’m not sure our servers would survive that kind of abuse.

And then there is “Inpatients Monthly”, “Weekly Disease Outbreak” and about 4 smaller (approx. an order of magnitude) reports. This would be about 35 hours of computation, with a longer “wall clock” time. Are any of the other import methods any faster ? Do we have to resort to generating SQL statements instead ?

Edward -

On Sat, May 5, 2012 at 9:11 AM, Edward Ari Bichetero ebichete@yahoo.com wrote:

The import file contains 46800 records (individual CSV lines).

Edward -

From: Lars Helge Øverland larshelge@gmail.com

To: Edward Ari Bichetero ebichete@yahoo.com

Cc: “dhis2-users@lists.launchpad.net” dhis2-users@lists.launchpad.net

Sent: Friday, May 4, 2012 6:07 PM

Subject: Re: [Dhis2-users] DHIS version 2.8 released

On Fri, May 4, 2012 at 4:37 PM, Edward Ari Bichetero ebichete@yahoo.com wrote:

Congratulation on the new release, it is nice to see the new features especially the CSV import functionality.

However …

I’ve just had a go at importing a small set of historical data (Monthly outpatient attendance by district) going back six years (2005-2011). This is just two data elements (Outpatient attendance, Outpatient reattendance) with four combo categories (Male, Female, Below 5 yrs old, 5 yrs and older) for each of our 112 districts. The CSV data file is about 4.6 megabytes in size.

I gave up on watching the import process after an hour. At that point it had been using 90% of our test servers memory (4GB) and burning 100% of one cpu/core almost the entire time. This is just one of the smallest datasets we would be looking to import. It appears that the CSV import in it’s current state is not able to cope with reasonably large data. Or am I getting this wrong ? Do you have any ideas/workarounds ?

Hi, approximately how many records are in your import file?

Lars

muhireandrew2020 · 7 May 2012 18:02

Thanks all for the quick supportive advices,i think if we do 1) aggregation level District and also removing zero values will help.

Dear Ola please publish this section , it will be helpfull to us.

Thanks.

···

Muhire Andrew

HMIS/Ministry of Health

First say to yourself what you would be; and then do what you have to do..

From: Ola Hodne Titlestad olati@ifi.uio.no
To: Jason Pickering jason.p.pickering@gmail.com
Cc: Muhire Andrew muhireandrew@yahoo.com; Lars Helge Øverland larshelge@gmail.com; Bob Jolliffe bobjolliffe@gmail.com; Knut Staring knutst@gmail.com; “dhis2-users@lists.launchpad.net” dhis2-users@lists.launchpad.net
Sent: Monday, May 7, 2012 10:07 AM
Subject: Re: [Dhis2-users] Mydatamart issue?