You are right in that most of this is available via the API, but making a little bit easier for people through the UI would seem to be a quick win really. Previously when the datamart was in use, one could choose specific time periods to regenerate data for, but no longer.Currently, users who use the UI can only execute a full-analytics run, which is really the basis of this request. In this particular situation, a full analytics run caused the server to choke up, due to lack of disk space and the manner in which the analytics tables are (re)generated.
Hopefully the new scheduling functionality will allow for better control over analytics, without users always having to resort to a curl call to the API.
···
On Mon, Sep 12, 2016 at 8:42 AM, Lars Helge Øverland lars@dhis2.org wrote:
Hi there,
thanks for the feedback. Most of what’s requested is available in the API. It’s on our list to rewrite the import-export app and write a better scheduling manager for background tasks such analytics generation.
In the meantime:
- Data value export (lastUpdated, lastUpdatedDuration, orgUnit params)
regards,
Lars
Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp
–
On Sun, Sep 11, 2016 at 5:20 PM, David Siang Fong Oh doh@thoughtworks.com wrote:
I think Jason also pointed out that this could be achieved from the API, but the question is whether it needs to be more user-friendly, i.e. customisable using the web application as opposed to requiring a custom script triggered by a cron job.
Cheers,
Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp
–
Lars Helge Øverland
Lead developer, DHIS 2
University of Oslo
Skype: larshelgeoverland
lars@dhis2.org
http://www.dhis2.org
-doh
On Sun, Sep 11, 2016 at 8:36 PM, Dan Cocos dcocos@gmail.com wrote:
Hi All,
You could run this
/api/24/maintenance/analyticsTablesClear
and this possibly this
/api/24/maintenance/periodPruning
I don’t see it in the documentation but we use call this /api/resourceTables/analytics?lastYears=2 quite often for clients with a lot of historical data.
Good luck,
Dan
Dan Cocos
Principal, BAO Systems
dcocos@baosystems.com | http://www.baosystems.com | 2900 K Street, Suite 404, Washington D.C. 20007
On Sep 11, 2016, at 10:05 AM, Calle Hedberg calle.hedberg@gmail.com wrote:
Hi,
It’s not only analytics that would benefit from segmented/staggered processing: I exported around 100 mill data values yesterday from a number of instance, and found that the export process was (seemingly) exponentially slower with increasing number of records exported. Most of the export files contained well under 10 mill records, which was pretty fast. In comparison, the largest export file with around 30 mill data values probably took 20 times as much time as an 8 mill value export. Based on just keeping an eye on the “progress bar”, it seemed like some kind of cache staggering was taking place - the amount exported would increase quickly by 2-3mb, then “hang” for a good while, then increase quickly by 2-3mb again.
Note also that there are several fundamental strategies one could use to reducing heavy work processes like analytics, exports (and thus imports), etc:
- to be able to specify a sub-period as Jason’s suggest
- to be able to specify the “dirty” part of the instance by using e.g. LastUpdated >= xxxxx
- to be able to specify a sub-OrgUnit-area
These partial strategies are of course mostly relevant for very large instances, but such large instances are also the ones where you typically only have changes made to a small segment of the total - like if you have data for 30 years, 27 of those might be locked down and no longer available for updates.
Regards
Calle
Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp
+1 to Calle’s idea of staggering analytics year by year
I also like Jason’s suggestion of being able to configure the time period for which analytics is regenerated. If the general use-case has data being entered only for the current year, then is it perhaps unnecessary to regenerate data for previous years?
Cheers,
On 11 September 2016 at 15:47, David Siang Fong Oh doh@thoughtworks.com wrote:
–
Calle Hedberg
46D Alma Road, 7700 Rosebank, SOUTH AFRICA
Tel/fax (home): +27-21-685-6472
Cell: +27-82-853-5352
Iridium SatPhone: +8816-315-19119
Email: calle.hedberg@gmail.com
Skype: calle_hedberg
-doh
On Tue, Jul 26, 2016 at 2:36 PM, Calle Hedberg calle.hedberg@gmail.com wrote:
Hi,
One (presumably) simple solution is to stagger analytics on a year by year basis - i.e. run and complete 2009 before processing 2010. That would reduce temp disk space requirements significantly while (presumably) not changing the general design.
Regards
Calle
Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp
On 26 July 2016 at 10:24, Jason Pickering jason.p.pickering@gmail.com wrote:
Hi Devs,
I am seeking some advice on how to try and decrease the amount of disk usage with DHIS2.
Here is a list of the biggest tables in the system.
public.datavalue | 2316 MB
public.datavalue_pkey | 1230 MB
public.in_datavalue_lastupdated | 680 MB
There are a lot more tables, and all in all, the database occupies about 5.4 GB without analytics.
This represents about 30 million data rows, so not that big of a database really. This server is being run off of a Digital Ocean virtual server with 60 GB of disk space. The only thing on the server really is Linux, Postgresql and Tomcat. Nothing else. With out analytics and everything installed for the system, we have about 23% of that 60 GB free.
When analytics runs, it maintains a copy of the main analytics tables ( analytics_XXXX) and creates temp tables like analytics_temp_2004. When things are finished and the indexes are built, the tables are swapped. This ensures that analytics resources are available while analytics are being built, but the downside of this is that A LOT more disk space is required, as now we effectively have two copies of the tables along with all their indexes, which are quite large themselves (up to 60% the size of the table itself). Here’s what happens when analytics is run
public.analytics_temp_2015 | 1017 MB
public.analytics_temp_2014 | 985 MB
public.analytics_temp_2011 | 952 MB
public.analytics_temp_2010 | 918 MB
public.analytics_temp_2013 | 885 MB
public.analytics_temp_2012 | 835 MB
public.analytics_temp_2009 | 804 MB
Now each analytics table is taking about 1 GB of space. In the end, it adds up to more than 60 GB and analytics fails to complete.
So, while I understand the need for this functionality, I am wondering if we need a system option to allow the analytics tables to be dropped prior to regenerating them, or to have more control over the order in which they are generated (for instance to generate specific periods). I realize this can be done from the API or the scheduler, but only for the past three relative years.
The reason I am asking for this is because its a bit of a pain (at the moment) when using Digital Ocean as a service provider, since their stock disk storage is 60 GB. With other VPS providers (Amazon, Linode), its a bit easier, but DigitalOcean only supports block storage in two regions at the moment. Regardless, it would seem somewhat wasteful to have to have such a large amount of disk space, for such a relatively small database.
Is this something we just need to plan for and maybe provide better documentation on, or should we think about trying to offer better functionality for people running smaller servers?
Regards,
Jason
Mailing list: https://launchpad.net/~dhis2-devs
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : https://launchpad.net/~dhis2-devs
More help : https://help.launchpad.net/ListHelp
–
Calle Hedberg
46D Alma Road, 7700 Rosebank, SOUTH AFRICA
Tel/fax (home): +27-21-685-6472
Cell: +27-82-853-5352
Iridium SatPhone: +8816-315-19119
Email: calle.hedberg@gmail.com
Skype: calle_hedberg
Jason P. Pickering
email: jason.p.pickering@gmail.com
tel:+46764147049