Analytics and disk space

Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk usage with DHIS2.

Here is a list of the biggest tables in the system.

public.datavalue | 2316 MB

public.datavalue_pkey | 1230 MB

public.in_datavalue_lastupdated | 680 MB

There are a lot more tables, and all in all, the database occupies about 5.4 GB without analytics.

This represents about 30 million data rows, so not that big of a database really. This server is being run off of a Digital Ocean virtual server with 60 GB of disk space. The only thing on the server really is Linux, Postgresql and Tomcat. Nothing else. With out analytics and everything installed for the system, we have about 23% of that 60 GB free.

When analytics runs, it maintains a copy of the main analytics tables ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When things are finished and the indexes are built, the tables are swapped. This ensures that analytics resources are available while analytics are being built, but the downside of this is that A LOT more disk space is required, as now we effectively have two copies of the tables along with all their indexes, which are quite large themselves (up to 60% the size of the table itself). Here’s what happens when analytics is run

public.analytics_temp_2015 | 1017 MB

public.analytics_temp_2014 | 985 MB

public.analytics_temp_2011 | 952 MB

public.analytics_temp_2010 | 918 MB

public.analytics_temp_2013 | 885 MB

public.analytics_temp_2012 | 835 MB

public.analytics_temp_2009 | 804 MB

Now each analytics table is taking about 1 GB of space. In the end, it adds up to more than 60 GB and analytics fails to complete.

So, while I understand the need for this functionality, I am wondering if we need a system option to allow the analytics tables to be dropped prior to regenerating them, or to have more control over the order in which they are generated (for instance to generate specific periods). I realize this can be done from the API or the scheduler, but only for the past three relative years.

The reason I am asking for this is because its a bit of a pain (at the moment) when using Digital Ocean as a service provider, since their stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit easier, but DigitalOcean only supports block storage in two regions at the moment. Regardless, it would seem somewhat wasteful to have to have such a large amount of disk space, for such a relatively small database.

Is this something we just need to plan for and maybe provide better documentation on, or should we think about trying to offer better functionality for people running smaller servers?

Regards,

Jason

Hi,

One (presumably) simple solution is to stagger analytics on a year by year basis - i.e. run and complete 2009 before processing 2010. That would reduce temp disk space requirements significantly while (presumably) not changing the general design.

Regards

Calle

···

On 26 July 2016 at 10:24, Jason Pickering jason.p.pickering@gmail.com wrote:

Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk usage with DHIS2.

Here is a list of the biggest tables in the system.

public.datavalue | 2316 MB

public.datavalue_pkey | 1230 MB

public.in_datavalue_lastupdated | 680 MB

There are a lot more tables, and all in all, the database occupies about 5.4 GB without analytics.

This represents about 30 million data rows, so not that big of a database really. This server is being run off of a Digital Ocean virtual server with 60 GB of disk space. The only thing on the server really is Linux, Postgresql and Tomcat. Nothing else. With out analytics and everything installed for the system, we have about 23% of that 60 GB free.

When analytics runs, it maintains a copy of the main analytics tables ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When things are finished and the indexes are built, the tables are swapped. This ensures that analytics resources are available while analytics are being built, but the downside of this is that A LOT more disk space is required, as now we effectively have two copies of the tables along with all their indexes, which are quite large themselves (up to 60% the size of the table itself). Here’s what happens when analytics is run

public.analytics_temp_2015 | 1017 MB

public.analytics_temp_2014 | 985 MB

public.analytics_temp_2011 | 952 MB

public.analytics_temp_2010 | 918 MB

public.analytics_temp_2013 | 885 MB

public.analytics_temp_2012 | 835 MB

public.analytics_temp_2009 | 804 MB

Now each analytics table is taking about 1 GB of space. In the end, it adds up to more than 60 GB and analytics fails to complete.

So, while I understand the need for this functionality, I am wondering if we need a system option to allow the analytics tables to be dropped prior to regenerating them, or to have more control over the order in which they are generated (for instance to generate specific periods). I realize this can be done from the API or the scheduler, but only for the past three relative years.

The reason I am asking for this is because its a bit of a pain (at the moment) when using Digital Ocean as a service provider, since their stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit easier, but DigitalOcean only supports block storage in two regions at the moment. Regardless, it would seem somewhat wasteful to have to have such a large amount of disk space, for such a relatively small database.

Is this something we just need to plan for and maybe provide better documentation on, or should we think about trying to offer better functionality for people running smaller servers?

Regards,

Jason


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


+1 to Calle’s idea of staggering analytics year by year

I also like Jason’s suggestion of being able to configure the time period for which analytics is regenerated. If the general use-case has data being entered only for the current year, then is it perhaps unnecessary to regenerate data for previous years?

Cheers,

···

On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg calle.hedberg@gmail.com wrote:

Hi,

One (presumably) simple solution is to stagger analytics on a year by year basis - i.e. run and complete 2009 before processing 2010. That would reduce temp disk space requirements significantly while (presumably) not changing the general design.

Regards

Calle


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

-doh

On 26 July 2016 at 10:24, Jason Pickering jason.p.pickering@gmail.com wrote:

Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk usage with DHIS2.

Here is a list of the biggest tables in the system.

public.datavalue | 2316 MB

public.datavalue_pkey | 1230 MB

public.in_datavalue_lastupdated | 680 MB

There are a lot more tables, and all in all, the database occupies about 5.4 GB without analytics.

This represents about 30 million data rows, so not that big of a database really. This server is being run off of a Digital Ocean virtual server with 60 GB of disk space. The only thing on the server really is Linux, Postgresql and Tomcat. Nothing else. With out analytics and everything installed for the system, we have about 23% of that 60 GB free.

When analytics runs, it maintains a copy of the main analytics tables ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When things are finished and the indexes are built, the tables are swapped. This ensures that analytics resources are available while analytics are being built, but the downside of this is that A LOT more disk space is required, as now we effectively have two copies of the tables along with all their indexes, which are quite large themselves (up to 60% the size of the table itself). Here’s what happens when analytics is run

public.analytics_temp_2015 | 1017 MB

public.analytics_temp_2014 | 985 MB

public.analytics_temp_2011 | 952 MB

public.analytics_temp_2010 | 918 MB

public.analytics_temp_2013 | 885 MB

public.analytics_temp_2012 | 835 MB

public.analytics_temp_2009 | 804 MB

Now each analytics table is taking about 1 GB of space. In the end, it adds up to more than 60 GB and analytics fails to complete.

So, while I understand the need for this functionality, I am wondering if we need a system option to allow the analytics tables to be dropped prior to regenerating them, or to have more control over the order in which they are generated (for instance to generate specific periods). I realize this can be done from the API or the scheduler, but only for the past three relative years.

The reason I am asking for this is because its a bit of a pain (at the moment) when using Digital Ocean as a service provider, since their stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit easier, but DigitalOcean only supports block storage in two regions at the moment. Regardless, it would seem somewhat wasteful to have to have such a large amount of disk space, for such a relatively small database.

Is this something we just need to plan for and maybe provide better documentation on, or should we think about trying to offer better functionality for people running smaller servers?

Regards,

Jason


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


Hi,

It’s not only analytics that would benefit from segmented/staggered processing: I exported around 100 mill data values yesterday from a number of instance, and found that the export process was (seemingly) exponentially slower with increasing number of records exported. Most of the export files contained well under 10 mill records, which was pretty fast. In comparison, the largest export file with around 30 mill data values probably took 20 times as much time as an 8 mill value export. Based on just keeping an eye on the “progress bar”, it seemed like some kind of cache staggering was taking place - the amount exported would increase quickly by 2-3mb, then “hang” for a good while, then increase quickly by 2-3mb again.

Note also that there are several fundamental strategies one could use to reducing heavy work processes like analytics, exports (and thus imports), etc:

  • to be able to specify a sub-period as Jason’s suggest

  • to be able to specify the “dirty” part of the instance by using e.g. LastUpdated >= xxxxx

  • to be able to specify a sub-OrgUnit-area

These partial strategies are of course mostly relevant for very large instances, but such large instances are also the ones where you typically only have changes made to a small segment of the total - like if you have data for 30 years, 27 of those might be locked down and no longer available for updates.

Regards

Calle

···

On 11 September 2016 at 15:47, David Siang Fong Oh doh@thoughtworks.com wrote:

+1 to Calle’s idea of staggering analytics year by year

I also like Jason’s suggestion of being able to configure the time period for which analytics is regenerated. If the general use-case has data being entered only for the current year, then is it perhaps unnecessary to regenerate data for previous years?

Cheers,

-doh

On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg calle.hedberg@gmail.com wrote:

Hi,

One (presumably) simple solution is to stagger analytics on a year by year basis - i.e. run and complete 2009 before processing 2010. That would reduce temp disk space requirements significantly while (presumably) not changing the general design.

Regards

Calle


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

On 26 July 2016 at 10:24, Jason Pickering jason.p.pickering@gmail.com wrote:

Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk usage with DHIS2.

Here is a list of the biggest tables in the system.

public.datavalue | 2316 MB

public.datavalue_pkey | 1230 MB

public.in_datavalue_lastupdated | 680 MB

There are a lot more tables, and all in all, the database occupies about 5.4 GB without analytics.

This represents about 30 million data rows, so not that big of a database really. This server is being run off of a Digital Ocean virtual server with 60 GB of disk space. The only thing on the server really is Linux, Postgresql and Tomcat. Nothing else. With out analytics and everything installed for the system, we have about 23% of that 60 GB free.

When analytics runs, it maintains a copy of the main analytics tables ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When things are finished and the indexes are built, the tables are swapped. This ensures that analytics resources are available while analytics are being built, but the downside of this is that A LOT more disk space is required, as now we effectively have two copies of the tables along with all their indexes, which are quite large themselves (up to 60% the size of the table itself). Here’s what happens when analytics is run

public.analytics_temp_2015 | 1017 MB

public.analytics_temp_2014 | 985 MB

public.analytics_temp_2011 | 952 MB

public.analytics_temp_2010 | 918 MB

public.analytics_temp_2013 | 885 MB

public.analytics_temp_2012 | 835 MB

public.analytics_temp_2009 | 804 MB

Now each analytics table is taking about 1 GB of space. In the end, it adds up to more than 60 GB and analytics fails to complete.

So, while I understand the need for this functionality, I am wondering if we need a system option to allow the analytics tables to be dropped prior to regenerating them, or to have more control over the order in which they are generated (for instance to generate specific periods). I realize this can be done from the API or the scheduler, but only for the past three relative years.

The reason I am asking for this is because its a bit of a pain (at the moment) when using Digital Ocean as a service provider, since their stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit easier, but DigitalOcean only supports block storage in two regions at the moment. Regardless, it would seem somewhat wasteful to have to have such a large amount of disk space, for such a relatively small database.

Is this something we just need to plan for and maybe provide better documentation on, or should we think about trying to offer better functionality for people running smaller servers?

Regards,

Jason


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg



Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


I think Jason also pointed out that this could be achieved from the API, but the question is whether it needs to be more user-friendly, i.e. customisable using the web application as opposed to requiring a custom script triggered by a cron job.

Cheers,

···

On Sun, Sep 11, 2016 at 8:36 PM, Dan Cocos dcocos@gmail.com wrote:

Hi All,

You could run this

/api/24/maintenance/analyticsTablesClear
and this possibly this

/api/24/maintenance/periodPruning

I don’t see it in the documentation but we use call this /api/resourceTables/analytics?lastYears=2 quite often for clients with a lot of historical data.

Good luck,

Dan

Dan Cocos
Principal, BAO Systems

dcocos@baosystems.com | http://www.baosystems.com | 2900 K Street, Suite 404, Washington D.C. 20007

On Sep 11, 2016, at 10:05 AM, Calle Hedberg calle.hedberg@gmail.com wrote:

Hi,

It’s not only analytics that would benefit from segmented/staggered processing: I exported around 100 mill data values yesterday from a number of instance, and found that the export process was (seemingly) exponentially slower with increasing number of records exported. Most of the export files contained well under 10 mill records, which was pretty fast. In comparison, the largest export file with around 30 mill data values probably took 20 times as much time as an 8 mill value export. Based on just keeping an eye on the “progress bar”, it seemed like some kind of cache staggering was taking place - the amount exported would increase quickly by 2-3mb, then “hang” for a good while, then increase quickly by 2-3mb again.

Note also that there are several fundamental strategies one could use to reducing heavy work processes like analytics, exports (and thus imports), etc:

  • to be able to specify a sub-period as Jason’s suggest
  • to be able to specify the “dirty” part of the instance by using e.g. LastUpdated >= xxxxx
  • to be able to specify a sub-OrgUnit-area

These partial strategies are of course mostly relevant for very large instances, but such large instances are also the ones where you typically only have changes made to a small segment of the total - like if you have data for 30 years, 27 of those might be locked down and no longer available for updates.

Regards

Calle


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp

-doh

+1 to Calle’s idea of staggering analytics year by year

I also like Jason’s suggestion of being able to configure the time period for which analytics is regenerated. If the general use-case has data being entered only for the current year, then is it perhaps unnecessary to regenerate data for previous years?

Cheers,

On 11 September 2016 at 15:47, David Siang Fong Oh doh@thoughtworks.com wrote:


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


-doh

On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg calle.hedberg@gmail.com wrote:

Hi,

One (presumably) simple solution is to stagger analytics on a year by year basis - i.e. run and complete 2009 before processing 2010. That would reduce temp disk space requirements significantly while (presumably) not changing the general design.

Regards

Calle


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp

On 26 July 2016 at 10:24, Jason Pickering jason.p.pickering@gmail.com wrote:

Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk usage with DHIS2.

Here is a list of the biggest tables in the system.

public.datavalue | 2316 MB

public.datavalue_pkey | 1230 MB

public.in_datavalue_lastupdated | 680 MB

There are a lot more tables, and all in all, the database occupies about 5.4 GB without analytics.

This represents about 30 million data rows, so not that big of a database really. This server is being run off of a Digital Ocean virtual server with 60 GB of disk space. The only thing on the server really is Linux, Postgresql and Tomcat. Nothing else. With out analytics and everything installed for the system, we have about 23% of that 60 GB free.

When analytics runs, it maintains a copy of the main analytics tables ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When things are finished and the indexes are built, the tables are swapped. This ensures that analytics resources are available while analytics are being built, but the downside of this is that A LOT more disk space is required, as now we effectively have two copies of the tables along with all their indexes, which are quite large themselves (up to 60% the size of the table itself). Here’s what happens when analytics is run

public.analytics_temp_2015 | 1017 MB

public.analytics_temp_2014 | 985 MB

public.analytics_temp_2011 | 952 MB

public.analytics_temp_2010 | 918 MB

public.analytics_temp_2013 | 885 MB

public.analytics_temp_2012 | 835 MB

public.analytics_temp_2009 | 804 MB

Now each analytics table is taking about 1 GB of space. In the end, it adds up to more than 60 GB and analytics fails to complete.

So, while I understand the need for this functionality, I am wondering if we need a system option to allow the analytics tables to be dropped prior to regenerating them, or to have more control over the order in which they are generated (for instance to generate specific periods). I realize this can be done from the API or the scheduler, but only for the past three relative years.

The reason I am asking for this is because its a bit of a pain (at the moment) when using Digital Ocean as a service provider, since their stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit easier, but DigitalOcean only supports block storage in two regions at the moment. Regardless, it would seem somewhat wasteful to have to have such a large amount of disk space, for such a relatively small database.

Is this something we just need to plan for and maybe provide better documentation on, or should we think about trying to offer better functionality for people running smaller servers?

Regards,

Jason


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


Hi there,

thanks for the feedback. Most of what’s requested is available in the API. It’s on our list to rewrite the import-export app and write a better scheduling manager for background tasks such analytics generation.

In the meantime:

  • Analytics tables generation for last x years

  • Data value export (lastUpdated, lastUpdatedDuration, orgUnit params)

regards,

Lars

···

On Sun, Sep 11, 2016 at 5:20 PM, David Siang Fong Oh doh@thoughtworks.com wrote:

I think Jason also pointed out that this could be achieved from the API, but the question is whether it needs to be more user-friendly, i.e. customisable using the web application as opposed to requiring a custom script triggered by a cron job.

Cheers,


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

-doh

On Sun, Sep 11, 2016 at 8:36 PM, Dan Cocos dcocos@gmail.com wrote:

Hi All,

You could run this

/api/24/maintenance/analyticsTablesClear
and this possibly this

/api/24/maintenance/periodPruning

I don’t see it in the documentation but we use call this /api/resourceTables/analytics?lastYears=2 quite often for clients with a lot of historical data.

Good luck,

Dan

Dan Cocos
Principal, BAO Systems

dcocos@baosystems.com | http://www.baosystems.com | 2900 K Street, Suite 404, Washington D.C. 20007

On Sep 11, 2016, at 10:05 AM, Calle Hedberg calle.hedberg@gmail.com wrote:

Hi,

It’s not only analytics that would benefit from segmented/staggered processing: I exported around 100 mill data values yesterday from a number of instance, and found that the export process was (seemingly) exponentially slower with increasing number of records exported. Most of the export files contained well under 10 mill records, which was pretty fast. In comparison, the largest export file with around 30 mill data values probably took 20 times as much time as an 8 mill value export. Based on just keeping an eye on the “progress bar”, it seemed like some kind of cache staggering was taking place - the amount exported would increase quickly by 2-3mb, then “hang” for a good while, then increase quickly by 2-3mb again.

Note also that there are several fundamental strategies one could use to reducing heavy work processes like analytics, exports (and thus imports), etc:

  • to be able to specify a sub-period as Jason’s suggest
  • to be able to specify the “dirty” part of the instance by using e.g. LastUpdated >= xxxxx
  • to be able to specify a sub-OrgUnit-area

These partial strategies are of course mostly relevant for very large instances, but such large instances are also the ones where you typically only have changes made to a small segment of the total - like if you have data for 30 years, 27 of those might be locked down and no longer available for updates.

Regards

Calle


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp

+1 to Calle’s idea of staggering analytics year by year

I also like Jason’s suggestion of being able to configure the time period for which analytics is regenerated. If the general use-case has data being entered only for the current year, then is it perhaps unnecessary to regenerate data for previous years?

Cheers,

On 11 September 2016 at 15:47, David Siang Fong Oh doh@thoughtworks.com wrote:


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


-doh

On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg calle.hedberg@gmail.com wrote:

Hi,

One (presumably) simple solution is to stagger analytics on a year by year basis - i.e. run and complete 2009 before processing 2010. That would reduce temp disk space requirements significantly while (presumably) not changing the general design.

Regards

Calle


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp

On 26 July 2016 at 10:24, Jason Pickering jason.p.pickering@gmail.com wrote:

Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk usage with DHIS2.

Here is a list of the biggest tables in the system.

public.datavalue | 2316 MB

public.datavalue_pkey | 1230 MB

public.in_datavalue_lastupdated | 680 MB

There are a lot more tables, and all in all, the database occupies about 5.4 GB without analytics.

This represents about 30 million data rows, so not that big of a database really. This server is being run off of a Digital Ocean virtual server with 60 GB of disk space. The only thing on the server really is Linux, Postgresql and Tomcat. Nothing else. With out analytics and everything installed for the system, we have about 23% of that 60 GB free.

When analytics runs, it maintains a copy of the main analytics tables ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When things are finished and the indexes are built, the tables are swapped. This ensures that analytics resources are available while analytics are being built, but the downside of this is that A LOT more disk space is required, as now we effectively have two copies of the tables along with all their indexes, which are quite large themselves (up to 60% the size of the table itself). Here’s what happens when analytics is run

public.analytics_temp_2015 | 1017 MB

public.analytics_temp_2014 | 985 MB

public.analytics_temp_2011 | 952 MB

public.analytics_temp_2010 | 918 MB

public.analytics_temp_2013 | 885 MB

public.analytics_temp_2012 | 835 MB

public.analytics_temp_2009 | 804 MB

Now each analytics table is taking about 1 GB of space. In the end, it adds up to more than 60 GB and analytics fails to complete.

So, while I understand the need for this functionality, I am wondering if we need a system option to allow the analytics tables to be dropped prior to regenerating them, or to have more control over the order in which they are generated (for instance to generate specific periods). I realize this can be done from the API or the scheduler, but only for the past three relative years.

The reason I am asking for this is because its a bit of a pain (at the moment) when using Digital Ocean as a service provider, since their stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit easier, but DigitalOcean only supports block storage in two regions at the moment. Regardless, it would seem somewhat wasteful to have to have such a large amount of disk space, for such a relatively small database.

Is this something we just need to plan for and maybe provide better documentation on, or should we think about trying to offer better functionality for people running smaller servers?

Regards,

Jason


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


Lars Helge Øverland

Lead developer, DHIS 2

University of Oslo

Skype: larshelgeoverland

lars@dhis2.org

http://www.dhis2.org

Hi Lars,

You are right in that most of this is available via the API, but making a little bit easier for people through the UI would seem to be a quick win really. Previously when the datamart was in use, one could choose specific time periods to regenerate data for, but no longer.Currently, users who use the UI can only execute a full-analytics run, which is really the basis of this request. In this particular situation, a full analytics run caused the server to choke up, due to lack of disk space and the manner in which the analytics tables are (re)generated.

Hopefully the new scheduling functionality will allow for better control over analytics, without users always having to resort to a curl call to the API.

Regards,

Jason

···

On Mon, Sep 12, 2016 at 8:42 AM, Lars Helge Øverland lars@dhis2.org wrote:

Hi there,

thanks for the feedback. Most of what’s requested is available in the API. It’s on our list to rewrite the import-export app and write a better scheduling manager for background tasks such analytics generation.

In the meantime:

  • Data value export (lastUpdated, lastUpdatedDuration, orgUnit params)

regards,

Lars


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

On Sun, Sep 11, 2016 at 5:20 PM, David Siang Fong Oh doh@thoughtworks.com wrote:

I think Jason also pointed out that this could be achieved from the API, but the question is whether it needs to be more user-friendly, i.e. customisable using the web application as opposed to requiring a custom script triggered by a cron job.

Cheers,


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

Lars Helge Øverland

Lead developer, DHIS 2

University of Oslo

Skype: larshelgeoverland

lars@dhis2.org

http://www.dhis2.org

-doh

On Sun, Sep 11, 2016 at 8:36 PM, Dan Cocos dcocos@gmail.com wrote:

Hi All,

You could run this

/api/24/maintenance/analyticsTablesClear
and this possibly this

/api/24/maintenance/periodPruning

I don’t see it in the documentation but we use call this /api/resourceTables/analytics?lastYears=2 quite often for clients with a lot of historical data.

Good luck,

Dan

Dan Cocos
Principal, BAO Systems

dcocos@baosystems.com | http://www.baosystems.com | 2900 K Street, Suite 404, Washington D.C. 20007

On Sep 11, 2016, at 10:05 AM, Calle Hedberg calle.hedberg@gmail.com wrote:

Hi,

It’s not only analytics that would benefit from segmented/staggered processing: I exported around 100 mill data values yesterday from a number of instance, and found that the export process was (seemingly) exponentially slower with increasing number of records exported. Most of the export files contained well under 10 mill records, which was pretty fast. In comparison, the largest export file with around 30 mill data values probably took 20 times as much time as an 8 mill value export. Based on just keeping an eye on the “progress bar”, it seemed like some kind of cache staggering was taking place - the amount exported would increase quickly by 2-3mb, then “hang” for a good while, then increase quickly by 2-3mb again.

Note also that there are several fundamental strategies one could use to reducing heavy work processes like analytics, exports (and thus imports), etc:

  • to be able to specify a sub-period as Jason’s suggest
  • to be able to specify the “dirty” part of the instance by using e.g. LastUpdated >= xxxxx
  • to be able to specify a sub-OrgUnit-area

These partial strategies are of course mostly relevant for very large instances, but such large instances are also the ones where you typically only have changes made to a small segment of the total - like if you have data for 30 years, 27 of those might be locked down and no longer available for updates.

Regards

Calle


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp

+1 to Calle’s idea of staggering analytics year by year

I also like Jason’s suggestion of being able to configure the time period for which analytics is regenerated. If the general use-case has data being entered only for the current year, then is it perhaps unnecessary to regenerate data for previous years?

Cheers,

On 11 September 2016 at 15:47, David Siang Fong Oh doh@thoughtworks.com wrote:


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


-doh

On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg calle.hedberg@gmail.com wrote:

Hi,

One (presumably) simple solution is to stagger analytics on a year by year basis - i.e. run and complete 2009 before processing 2010. That would reduce temp disk space requirements significantly while (presumably) not changing the general design.

Regards

Calle


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp

On 26 July 2016 at 10:24, Jason Pickering jason.p.pickering@gmail.com wrote:

Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk usage with DHIS2.

Here is a list of the biggest tables in the system.

public.datavalue | 2316 MB

public.datavalue_pkey | 1230 MB

public.in_datavalue_lastupdated | 680 MB

There are a lot more tables, and all in all, the database occupies about 5.4 GB without analytics.

This represents about 30 million data rows, so not that big of a database really. This server is being run off of a Digital Ocean virtual server with 60 GB of disk space. The only thing on the server really is Linux, Postgresql and Tomcat. Nothing else. With out analytics and everything installed for the system, we have about 23% of that 60 GB free.

When analytics runs, it maintains a copy of the main analytics tables ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When things are finished and the indexes are built, the tables are swapped. This ensures that analytics resources are available while analytics are being built, but the downside of this is that A LOT more disk space is required, as now we effectively have two copies of the tables along with all their indexes, which are quite large themselves (up to 60% the size of the table itself). Here’s what happens when analytics is run

public.analytics_temp_2015 | 1017 MB

public.analytics_temp_2014 | 985 MB

public.analytics_temp_2011 | 952 MB

public.analytics_temp_2010 | 918 MB

public.analytics_temp_2013 | 885 MB

public.analytics_temp_2012 | 835 MB

public.analytics_temp_2009 | 804 MB

Now each analytics table is taking about 1 GB of space. In the end, it adds up to more than 60 GB and analytics fails to complete.

So, while I understand the need for this functionality, I am wondering if we need a system option to allow the analytics tables to be dropped prior to regenerating them, or to have more control over the order in which they are generated (for instance to generate specific periods). I realize this can be done from the API or the scheduler, but only for the past three relative years.

The reason I am asking for this is because its a bit of a pain (at the moment) when using Digital Ocean as a service provider, since their stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit easier, but DigitalOcean only supports block storage in two regions at the moment. Regardless, it would seem somewhat wasteful to have to have such a large amount of disk space, for such a relatively small database.

Is this something we just need to plan for and maybe provide better documentation on, or should we think about trying to offer better functionality for people running smaller servers?

Regards,

Jason


Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp


Calle Hedberg

46D Alma Road, 7700 Rosebank, SOUTH AFRICA

Tel/fax (home): +27-21-685-6472

Cell: +27-82-853-5352

Iridium SatPhone: +8816-315-19119

Email: calle.hedberg@gmail.com

Skype: calle_hedberg


Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049