Pivot tables

bobj · 22 October 2010 08:43

Hi Lars, Ola and all

Thinking about exporting pivot tables .. we need to pen up some
requirements which we can get into a blueprint. While we are
considering that, I've given some thoughts below.

My initial thought was that we would construct a pivot table with a
jdbc connection like we would normally do manually, and then simply
populate the pivot cache so that the spreadsheet would be
"free-standing". On reflection that is probably neither the best nor
the simplest thing to do. Playing around with excel (in wine I
figured it is actually much more straightforward to have two sheets;
one for the data and one for the pivot table. The pivot table would
operate on the data in the other worksheet. This is more explicit
than simply hiding the data in the pivot cache. Though I need to test
this a bit more. If the pivotcache is an obligatory data store then
it might not make sense to carry the data twice. The pivot table
constructs are fairly complex but not outlandishly so - once I'm done
unpicking, I will write up a short description of the anatomy of a
pivot table so we can see what needs to be done.

One possibility would be that the pivot table could be generated as
part of the export of a report table ie. optionally export to excel or
export to excel-with-pivot.

Given that there is no advance indication of the columns in a report
table, a challenge will be how and when to define the pivot model -
ie. the pivot fields - and how to persist that model (we don't want
users to have to recreate the pivot model each time). This will
probably require an extra object in our data model (PivotModel) which
defines the row and column pivot fields and data field, where a
ReportTable can have a 0-* relationship with 0 or more PivotModels. A
possible beneficial side effect of this could be that we also leave
open the ability to generate the pivot table rendition with things
other than excel. I don't see this as an immediate requirement but
useful to have the pivotmodel abstraction anyway. In reality the
design of the pivotmodel will be based on what excel requires.

Generating the excel spreadsheet off the reporttable+pivotmodel will
produce an OOXML xlsx file - actually a bundle of xml streams which
need to be zipped. Its a bit unfortunate that M$ defines a flat file
single xml for other OPC office documents, but sadly not excel.
Dealing with a single flat file is much easier than zip containers
with multiple streams but so be it. At least they end up smaller.
Though given that these reporttables can be very large I don't think
an xslt approach is really the most efficient at least for producing
the data worksheet. It might make sense if the source data was xml,
but even then probably not. So simple iteration through the table
rows with a stax writer will work best. The second sheet (the pivot
sheet) would just be a serialization of the pivot model.

We will probably have to implement the excel optimization of
sharedstrings (where literal strings are not written into the
spreadsheet - just index values from a table). This adds complexity
but I think we are talking pretty large files here. If we don't do
the sharedstrings optimization, Excel will do it anyway the first time
it opens and saves the file, but I suspect we will have to do it up
front.

Regarding UI and PivotModel I want to take a look at our existing work
on web pivot tables to see whether we can't leverage some of this. If
we follow roughly the outline above the user would have the option to
define 0 or more pivottable definitions at the time of creating, or
editing, a reporttable. What do you think? Is the above close to
what you guys have in mind. Lets hammer out some requirements and
create the blueprint.

Regards
Bob

Lars · 22 October 2010 10:14

Hi Bob

Sorry for not returning to this earlier, been battling with windows 64 bit odbc drivers.

Hi Lars, Ola and all

Thinking about exporting pivot tables … we need to pen up some

requirements which we can get into a blueprint. While we are

considering that, I’ve given some thoughts below.

My initial thought was that we would construct a pivot table with a

jdbc connection like we would normally do manually, and then simply

populate the pivot cache so that the spreadsheet would be

“free-standing”. On reflection that is probably neither the best nor

the simplest thing to do. Playing around with excel (in wine I

figured it is actually much more straightforward to have two sheets;

one for the data and one for the pivot table. The pivot table would

operate on the data in the other worksheet. This is more explicit

than simply hiding the data in the pivot cache. Though I need to test

this a bit more. If the pivotcache is an obligatory data store then

it might not make sense to carry the data twice. The pivot table

constructs are fairly complex but not outlandishly so - once I’m done

unpicking, I will write up a short description of the anatomy of a

pivot table so we can see what needs to be done.

This sounds just fine to me.

One possibility would be that the pivot table could be generated as

part of the export of a report table ie. optionally export to excel or

export to excel-with-pivot.

Given that there is no advance indication of the columns in a report

table, a challenge will be how and when to define the pivot model -

ie. the pivot fields - and how to persist that model (we don’t want

users to have to recreate the pivot model each time). This will

probably require an extra object in our data model (PivotModel) which

defines the row and column pivot fields and data field, where a

ReportTable can have a 0-* relationship with 0 or more PivotModels. A

possible beneficial side effect of this could be that we also leave

open the ability to generate the pivot table rendition with things

other than excel. I don’t see this as an immediate requirement but

useful to have the pivotmodel abstraction anyway. In reality the

design of the pivotmodel will be based on what excel requires.

Generating the excel spreadsheet off the reporttable+pivotmodel will

produce an OOXML xlsx file - actually a bundle of xml streams which

need to be zipped. Its a bit unfortunate that M$ defines a flat file

single xml for other OPC office documents, but sadly not excel.

Dealing with a single flat file is much easier than zip containers

with multiple streams but so be it. At least they end up smaller.

Though given that these reporttables can be very large I don’t think

an xslt approach is really the most efficient at least for producing

the data worksheet. It might make sense if the source data was xml,

but even then probably not. So simple iteration through the table

rows with a stax writer will work best. The second sheet (the pivot

sheet) would just be a serialization of the pivot model.

We will probably have to implement the excel optimization of

sharedstrings (where literal strings are not written into the

spreadsheet - just index values from a table). This adds complexity

but I think we are talking pretty large files here. If we don’t do

the sharedstrings optimization, Excel will do it anyway the first time

it opens and saves the file, but I suspect we will have to do it up

front.

Regarding UI and PivotModel I want to take a look at our existing work

on web pivot tables to see whether we can’t leverage some of this. If

we follow roughly the outline above the user would have the option to

define 0 or more pivottable definitions at the time of creating, or

editing, a reporttable. What do you think? Is the above close to

what you guys have in mind. Lets hammer out some requirements and

create the blueprint.

We won’t base this on report tables as (at least I) can’t see any huge benefits. The excel pivot tables themselves are more capable of having dimensions on columns, rows, filters than our report tables. In fact we usually use a SQL view as datasource for the pivot tables - which has a fixed number of columns. The view is based on the aggregateddatavalue table and joins in the dataelement, orgunit, period and periodtype tables. In addition we join in a series of resource tables to get information about the orgunit structure, dataelement groupset structure, orgunit groupset structure (additional dimensional information!) to make the pivot table more valuable for analysis. (This also goes for indicators.)

Since this should be more of a offline analysis tool I think exporting all dataelements/indicators will be most appropriate. The user interface could simply request a start and endate (and/or relative periods), indicators vs dataelements, parent organisation unit and organisation unit level. Ola will maybe have some views here…

When it comes to SQL views we have recently implemented a function for this where SQL views can be persisted as an application object and re-generated at any time. This means we can compile in a few default SQL views which can be used for this in DHIS. As you say an iteration over the rows in this view with eg. a stax writer would work.

Ola/I have made some samples.

Demo database with SQL views: http://folk.uio.no/larshelg/files/dhis2sl.backup

Small pivot table based on the aggregateddatavalue orgunit level 2 view: http://folk.uio.no/larshelg/files/dataou2.xlsx

Bigger pivot table based on the aggregateddatavalue orgunit level 3 view: http://folk.uio.no/larshelg/files/dataou3.xlsx

If you want to create more pivot tables install postgres on windows here: http://www.postgresql.org/download/

Then install psql odbc windows here: http://www.postgresql.org/ftp/odbc/versions/msi/

You can create a data source directly from Excel by going to Data - From other sources - From Microsoft Query and choosing the postgres odbc driver under 2. (Stay clear of x64!

Your thoughts on shared string optimizations and zipped stream bundles sound fine to me, in any case you know better here:)

regards, Lars

···

2010/10/22 Bob Jolliffe bobjolliffe@gmail.com

olatitle · 22 October 2010 12:10

Hi Bob

Sorry for not returning to this earlier, been battling with windows 64 bit odbc drivers.

Hi Lars, Ola and all

Thinking about exporting pivot tables … we need to pen up some

requirements which we can get into a blueprint. While we are

considering that, I’ve given some thoughts below.

My initial thought was that we would construct a pivot table with a

jdbc connection like we would normally do manually, and then simply

populate the pivot cache so that the spreadsheet would be

“free-standing”. On reflection that is probably neither the best nor

the simplest thing to do. Playing around with excel (in wine I

figured it is actually much more straightforward to have two sheets;

one for the data and one for the pivot table. The pivot table would

operate on the data in the other worksheet. This is more explicit

than simply hiding the data in the pivot cache. Though I need to test

this a bit more. If the pivotcache is an obligatory data store then

it might not make sense to carry the data twice. The pivot table

constructs are fairly complex but not outlandishly so - once I’m done

unpicking, I will write up a short description of the anatomy of a

pivot table so we can see what needs to be done.

This sounds just fine to me.

This might work, just have in mind that we will then need several pairs of data+pivot worksheets as we want to have multiple pivot tables in an Excel file.

Also have in mind that we can talk about up to around 1.500.000 rows (that’s when my laptop starts to complain at least) of data here, so these data sheets will be large. Typically, number of rows of data will be somewhere around 200-600K.

One possibility would be that the pivot table could be generated as

part of the export of a report table ie. optionally export to excel or

export to excel-with-pivot.

Given that there is no advance indication of the columns in a report

table, a challenge will be how and when to define the pivot model -

ie. the pivot fields - and how to persist that model (we don’t want

users to have to recreate the pivot model each time). This will

probably require an extra object in our data model (PivotModel) which

defines the row and column pivot fields and data field, where a

ReportTable can have a 0-* relationship with 0 or more PivotModels. A

possible beneficial side effect of this could be that we also leave

open the ability to generate the pivot table rendition with things

other than excel. I don’t see this as an immediate requirement but

useful to have the pivotmodel abstraction anyway. In reality the

design of the pivotmodel will be based on what excel requires.

Generating the excel spreadsheet off the reporttable+pivotmodel will

produce an OOXML xlsx file - actually a bundle of xml streams which

need to be zipped. Its a bit unfortunate that M$ defines a flat file

single xml for other OPC office documents, but sadly not excel.

Dealing with a single flat file is much easier than zip containers

with multiple streams but so be it. At least they end up smaller.

Though given that these reporttables can be very large I don’t think

an xslt approach is really the most efficient at least for producing

the data worksheet. It might make sense if the source data was xml,

but even then probably not. So simple iteration through the table

rows with a stax writer will work best. The second sheet (the pivot

sheet) would just be a serialization of the pivot model.

We will probably have to implement the excel optimization of

sharedstrings (where literal strings are not written into the

spreadsheet - just index values from a table). This adds complexity

but I think we are talking pretty large files here. If we don’t do

the sharedstrings optimization, Excel will do it anyway the first time

it opens and saves the file, but I suspect we will have to do it up

front.

Regarding UI and PivotModel I want to take a look at our existing work

on web pivot tables to see whether we can’t leverage some of this. If

we follow roughly the outline above the user would have the option to

define 0 or more pivottable definitions at the time of creating, or

editing, a reporttable. What do you think? Is the above close to

what you guys have in mind. Lets hammer out some requirements and

create the blueprint.

We won’t base this on report tables as (at least I) can’t see any huge benefits. The excel pivot tables themselves are more capable of having dimensions on columns, rows, filters than our report tables. In fact we usually use a SQL view as datasource for the pivot tables - which has a fixed number of columns. The view is based on the aggregateddatavalue table and joins in the dataelement, orgunit, period and periodtype tables. In addition we join in a series of resource tables to get information about the orgunit structure, dataelement groupset structure, orgunit groupset structure (additional dimensional information!) to make the pivot table more valuable for analysis. (This also goes for indicators.)

Since this should be more of a offline analysis tool I think exporting all dataelements/indicators will be most appropriate. The user interface could simply request a start and endate (and/or relative periods), indicators vs dataelements, parent organisation unit and organisation unit level. Ola will maybe have some views here…

Agree with Lars, I don’t think report tables should be used as source for pivot tables, at least not for the standard “ALL data” pivot tables. For more ad-hoc use the pivots generated off a report table might be useful as an export option, but let’s start with the most common use cases, the big pivot tables with all the data for browsing and data analysis.

The pivot source views (sql) that we have used e.g. in the newly distributed pivots for the demo database can be found in an email to the list on Sep 21, together with the pivots and the database dumps:

http://www.mail-archive.com/dhis2-devs@lists.launchpad.net/msg07341.html (from Sep 21)

These views have been stable for a long time and describe the fields of the basic pivot tables that are needed for DHIS data analysis.

These set of fields go back as far as DHIS 1.3, so the basic pivot tables are stable and we know their structure, basically the core data dimensions of DHIS2 (see chapter 2 in user manual).

What always will be variables are the orgunit level of aggregation for the data or indicator values and the period type. You will see these two are specified at the end of every where clause in the pivot views. These are used as a filters when fetching data from the data mart.

If you open the pivot table for download in that email you can follow my reasoning below:

Typically we have two tables for monthly (raw data+indicator data) for the lowest level plus at least two more for a higher level. E.g. in the demo based on SL we use the health facility level (level 4) and the Chiefdom level (3). The reason for using Chiefdom level as well is that these tables are much faster to manipulate since there are a lot less data when shaving away the lowest level. How many levels and which levels to put in the pivot table will vary from place to place. In addition to these at least four tables, there is often at least one table for population data, which has a yearly period type and therefore needs a separate view and table. So for the demo database we have 5 basic tables which covers all the data for data elements and indicators. This set of pivot tables is what I would define as a minimum pivot table setup for a DHIS database. Over time (depending on the amount of data collected) it might be necessary to split up these basic tables by year since they can get too big for excel to handle. Such a split can be facilitated by a simple from and to filter on the start date in the period field in the sql.

Then there can be many more customised pivot tables which make use of additional dimensions to the data like the data element categories, and data element and orgunit group sets. The simplest approach there is to (still talking sql) do a select * from the various resource tables. Then you will have a data source for all other customised tables (using the same period type and aggregation level). Working with data element categories in pivot tables it doesn’t make sense to use too many at the same time, and rather work on data elements that share the same categories (within or across datasets). You can see in the demo pivot table file that we have set up multiple tables making use of data element categories (EPI, RCH, HIV etc.). These are all using the same source data (chiefdom level monthly raw data), but are using different pivot fields (columns in the result set). The total number of available fields are the same for all these custom chiefdom tables, but we are only using a small subset of these fields in each table. This means that the same data source could also be used for the basic tables (the two chiefdom tables at least), and just select even fewer fields in the table. The users can also easily add more fields to a table in Excel, using drag and drop functionality.

It seems to me that there is a need to separate between the source data and the pivot structure, to reuse the source data in multiple tables. I think that is how Excel is thinking too, but it doesn’t always seem to work that way, e.g. when creating a copy of an existing worksheet with a pivot table.

I am not sure how much of this customisation we need to support inside DHIS2 and how much we can leave to Excel (which has nice tools for copying tables, drag and drop pivot fields etc.). First we need to come up with something that works for the basic tables and then look at these more custom use cases.

When it comes to pivot tables using period types other than monthly or yearly (both covered in the basic tables listed above), like six-monthly, quarterly or weekly these are also important to support. The pivot fields are the same as in the basic tables, it is just a different periodtype filter on the data source, so these might be easier to accommodate.

Looking at the next steps, to me the key new features that I would like to see with regards to pivot tables are:

1) a pivot table update service, where users that are on the online server (no local install) can update their local pivot tables against the online DHIS server

Since a pivot table can contain months and years of data we do not want to download the full tables every month, but simply append the latest month of data to the existing local pivot tables

2) a pivot generator, where the users can download a new pivot table file directly off a DHIS2 server without having to do any local configuration etc. , and then later use the update service to update their tables

with 1) being the most critical and urgent, but I realise that we might need 2) first, in order to deal with a standard set of tables and pivot data sources in 1).

Ola

···

2010/10/22 Lars Helge Øverland larshelge@gmail.com

2010/10/22 Bob Jolliffe bobjolliffe@gmail.com

When it comes to SQL views we have recently implemented a function for this where SQL views can be persisted as an application object and re-generated at any time. This means we can compile in a few default SQL views which can be used for this in DHIS. As you say an iteration over the rows in this view with eg. a stax writer would work.

Ola/I have made some samples.

Demo database with SQL views: http://folk.uio.no/larshelg/files/dhis2sl.backup

Small pivot table based on the aggregateddatavalue orgunit level 2 view: http://folk.uio.no/larshelg/files/dataou2.xlsx

Bigger pivot table based on the aggregateddatavalue orgunit level 3 view: http://folk.uio.no/larshelg/files/dataou3.xlsx

If you want to create more pivot tables install postgres on windows here: http://www.postgresql.org/download/

Then install psql odbc windows here: http://www.postgresql.org/ftp/odbc/versions/msi/

You can create a data source directly from Excel by going to Data - From other sources - From Microsoft Query and choosing the postgres odbc driver under 2. (Stay clear of x64!

Your thoughts on shared string optimizations and zipped stream bundles sound fine to me, in any case you know better here:)

regards, Lars

Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

bobj · 22 October 2010 12:32

hi

Hi Bob
Sorry for not returning to this earlier, been battling with windows 64 bit
odbc drivers.

Hi Lars, Ola and all

Thinking about exporting pivot tables .. we need to pen up some
requirements which we can get into a blueprint. While we are
considering that, I've given some thoughts below.

My initial thought was that we would construct a pivot table with a
jdbc connection like we would normally do manually, and then simply
populate the pivot cache so that the spreadsheet would be
"free-standing". On reflection that is probably neither the best nor
the simplest thing to do. Playing around with excel (in wine I
figured it is actually much more straightforward to have two sheets;
one for the data and one for the pivot table. The pivot table would
operate on the data in the other worksheet. This is more explicit
than simply hiding the data in the pivot cache. Though I need to test
this a bit more. If the pivotcache is an obligatory data store then
it might not make sense to carry the data twice. The pivot table
constructs are fairly complex but not outlandishly so - once I'm done
unpicking, I will write up a short description of the anatomy of a
pivot table so we can see what needs to be done.

This sounds just fine to me.

This might work, just have in mind that we will then need several pairs of
data+pivot worksheets as we want to have multiple pivot tables in an Excel
file.
Also have in mind that we can talk about up to around 1.500.000 rows (that's
when my laptop starts to complain at least) of data here, so these data
sheets will be large. Typically, number of rows of data will be somewhere
around 200-600K.

One possibility would be that the pivot table could be generated as
part of the export of a report table ie. optionally export to excel or
export to excel-with-pivot.

Given that there is no advance indication of the columns in a report
table, a challenge will be how and when to define the pivot model -
ie. the pivot fields - and how to persist that model (we don't want
users to have to recreate the pivot model each time). This will
probably require an extra object in our data model (PivotModel) which
defines the row and column pivot fields and data field, where a
ReportTable can have a 0-* relationship with 0 or more PivotModels. A
possible beneficial side effect of this could be that we also leave
open the ability to generate the pivot table rendition with things
other than excel. I don't see this as an immediate requirement but
useful to have the pivotmodel abstraction anyway. In reality the
design of the pivotmodel will be based on what excel requires.

Generating the excel spreadsheet off the reporttable+pivotmodel will
produce an OOXML xlsx file - actually a bundle of xml streams which
need to be zipped. Its a bit unfortunate that M$ defines a flat file
single xml for other OPC office documents, but sadly not excel.
Dealing with a single flat file is much easier than zip containers
with multiple streams but so be it. At least they end up smaller.
Though given that these reporttables can be very large I don't think
an xslt approach is really the most efficient at least for producing
the data worksheet. It might make sense if the source data was xml,
but even then probably not. So simple iteration through the table
rows with a stax writer will work best. The second sheet (the pivot
sheet) would just be a serialization of the pivot model.

We will probably have to implement the excel optimization of
sharedstrings (where literal strings are not written into the
spreadsheet - just index values from a table). This adds complexity
but I think we are talking pretty large files here. If we don't do
the sharedstrings optimization, Excel will do it anyway the first time
it opens and saves the file, but I suspect we will have to do it up
front.

Regarding UI and PivotModel I want to take a look at our existing work
on web pivot tables to see whether we can't leverage some of this. If
we follow roughly the outline above the user would have the option to
define 0 or more pivottable definitions at the time of creating, or
editing, a reporttable. What do you think? Is the above close to
what you guys have in mind. Lets hammer out some requirements and
create the blueprint.

We won't base this on report tables as (at least I) can't see any huge
benefits. The excel pivot tables themselves are more capable of having
dimensions on columns, rows, filters than our report tables. In fact we
usually use a SQL view as datasource for the pivot tables - which has a
fixed number of columns. The view is based on the aggregateddatavalue table
and joins in the dataelement, orgunit, period and periodtype tables. In
addition we join in a series of resource tables to get information about the
orgunit structure, dataelement groupset structure, orgunit groupset
structure (additional dimensional information!) to make the pivot table more
valuable for analysis. (This also goes for indicators.)
Since this should be more of a offline analysis tool I think exporting all
dataelements/indicators will be most appropriate. The user interface could
simply request a start and endate (and/or relative periods), indicators vs
dataelements, parent organisation unit and organisation unit level. Ola will
maybe have some views here...

Agree with Lars, I don't think report tables should be used as source for
pivot tables, at least not for the standard "ALL data" pivot tables.

Yes I understand that now

For
more ad-hoc use the pivots generated off a report table might be useful as
an export option, but let's start with the most common use cases, the big
pivot tables with all the data for browsing and data analysis.
The pivot source views (sql) that we have used e.g. in the newly distributed
pivots for the demo database can be found in an email to the list on Sep 21,
together with the pivots and the database dumps:
[Dhis2-devs] Working with h2, odbc, and pivot tables (from
Sep 21)
These views have been stable for a long time and describe the fields of the
basic pivot tables that are needed for DHIS data analysis.
These set of fields go back as far as DHIS 1.3, so the basic pivot tables
are stable and we know their structure, basically the core data dimensions
of DHIS2 (see chapter 2 in user manual).
What always will be variables are the orgunit level of aggregation for the
data or indicator values and the period type. You will see these two are
specified at the end of every where clause in the pivot views. These are
used as a filters when fetching data from the data mart.
If you open the pivot table for download in that email you can follow my
reasoning below:
Typically we have two tables for monthly (raw data+indicator data) for the
lowest level plus at least two more for a higher level. E.g. in the demo
based on SL we use the health facility level (level 4) and the Chiefdom
level (3). The reason for using Chiefdom level as well is that these tables
are much faster to manipulate since there are a lot less data when shaving
away the lowest level. How many levels and which levels to put in the pivot
table will vary from place to place. In addition to these at least four
tables, there is often at least one table for population data, which has a
yearly period type and therefore needs a separate view and table. So for the
demo database we have 5 basic tables which covers all the data for data
elements and indicators. This set of pivot tables is what I would define as
a minimum pivot table setup for a DHIS database. Over time (depending on the
amount of data collected) it might be necessary to split up these basic
tables by year since they can get too big for excel to handle. Such a split
can be facilitated by a simple from and to filter on the start date in the
period field in the sql.
Then there can be many more customised pivot tables which make use of
additional dimensions to the data like the data element categories, and data
element and orgunit group sets. The simplest approach there is to (still
talking sql) do a select * from the various resource tables. Then you will
have a data source for all other customised tables (using the same period
type and aggregation level). Working with data element categories in pivot
tables it doesn't make sense to use too many at the same time, and rather
work on data elements that share the same categories (within or across
datasets). You can see in the demo pivot table file that we have set up
multiple tables making use of data element categories (EPI, RCH, HIV etc.).
These are all using the same source data (chiefdom level monthly raw data),
but are using different pivot fields (columns in the result set). The total
number of available fields are the same for all these custom chiefdom
tables, but we are only using a small subset of these fields in each table.
This means that the same data source could also be used for the basic tables
(the two chiefdom tables at least), and just select even fewer fields in the
table. The users can also easily add more fields to a table in Excel, using
drag and drop functionality.
It seems to me that there is a need to separate between the source data and
the pivot structure, to reuse the source data in multiple tables. I think
that is how Excel is thinking too, but it doesn't always seem to work that
way, e.g. when creating a copy of an existing worksheet with a pivot table.

Yes we can ask excel pivot tables to use common pivot table cache.

I am not sure how much of this customisation we need to support inside DHIS2
and how much we can leave to Excel (which has nice tools for copying tables,
drag and drop pivot fields etc.). First we need to come up with something
that works for the basic tables and then look at these more custom use
cases.
When it comes to pivot tables using period types other than monthly or
yearly (both covered in the basic tables listed above), like six-monthly,
quarterly or weekly these are also important to support. The pivot fields
are the same as in the basic tables, it is just a different periodtype
filter on the data source, so these might be easier to accommodate.
Looking at the next steps, to me the key new features that I would like to
see with regards to pivot tables are:
1) a pivot table update service, where users that are on the online server
(no local install) can update their local pivot tables against the online
DHIS server
Since a pivot table can contain months and years of data we do not want to
download the full tables every month, but simply append the latest month of
data to the existing local pivot tables
2) a pivot generator, where the users can download a new pivot table file
directly off a DHIS2 server without having to do any local configuration
etc. , and then later use the update service to update their tables
with 1) being the most critical and urgent, but I realise that we might need
2) first, in order to deal with a standard set of tables and pivot data

Yes I think 1 can only really work in concert with 2. Since we can't
ask dhis to update any old pivot table, it will need to be a
pivottable which it is intimately familiar with - which is either a
fixed format one as you have discussed or one that is created through
dhis. Naturally we start with former. BTW looking at size of tables
I have dropped the idea of holding data in a worksheet. This does in
fact produce duplication of data which is also held in cache and both
unfortunately also held in memory. So we work only with the hidden
pivot table cache as the local store.

Thinking how and where data is being moved, it might not make sense to
pass the (massive) excel file back and forwards to dhis in order to
get updates injected. That might defeat the purpose. Which raises
some questions. Either we have a standalone tool which grabs
incremental update from dhis and appends to excel pivottable cache or
we have some sort of VBA macro defined within excel. I think the
latter might prove problematic - like performing circumcision on
oneself whilst awake, it might cause excel to breakdown. So probably
a standalone tool ..

I haven't worked through the samples yet. I think I will buy excel on
the weekend and invoice Jorn.

Cheers
Bob

···

2010/10/22 Ola Hodne Titlestad <olati@ifi.uio.no>:

2010/10/22 Lars Helge Øverland <larshelge@gmail.com>

2010/10/22 Bob Jolliffe <bobjolliffe@gmail.com>

sources in 1).
Ola
---------

When it comes to SQL views we have recently implemented a function for
this where SQL views can be persisted as an application object and
re-generated at any time. This means we can compile in a few default SQL
views which can be used for this in DHIS. As you say an iteration over the
rows in this view with eg. a stax writer would work.
Ola/I have made some samples.
Demo database with SQL
views: http://folk.uio.no/larshelg/files/dhis2sl.backup
Small pivot table based on the aggregateddatavalue orgunit level 2
view: http://folk.uio.no/larshelg/files/dataou2.xlsx
Bigger pivot table based on the aggregateddatavalue orgunit level 3
view: http://folk.uio.no/larshelg/files/dataou3.xlsx
If you want to create more pivot tables install postgres on windows
here: PostgreSQL: Downloads
Then install psql odbc windows
here: PostgreSQL: File Browser
You can create a data source directly from Excel by going to Data - From
other sources - From Microsoft Query and choosing the postgres odbc driver
under 2. (Stay clear of x64!
Your thoughts on shared string optimizations and zipped stream bundles
sound fine to me, in any case you know better here:)
regards, Lars

_______________________________________________
Mailing list: DHIS 2 developers in Launchpad
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : DHIS 2 developers in Launchpad
More help : ListHelp - Launchpad Help

Lars · 22 October 2010 12:51

Definitely lets start with 2). We can think of 1) later, maybe even a script will do the job. Bob I will put up excel with key on separate mail.

Did some quick investigations on size and it seems excel is using around 30 KB / 1000 datavalues. The file seems already compressed btw. If a typical pivot file has 600 000 values that means 18 MB.

Lars

···

On Fri, Oct 22, 2010 at 2:32 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

hi

2010/10/22 Ola Hodne Titlestad olati@ifi.uio.no:

2010/10/22 Lars Helge Øverland larshelge@gmail.com

Hi Bob

Sorry for not returning to this earlier, been battling with windows 64 bit

odbc drivers.

2010/10/22 Bob Jolliffe bobjolliffe@gmail.com

Hi Lars, Ola and all

Thinking about exporting pivot tables … we need to pen up some

requirements which we can get into a blueprint. While we are

considering that, I’ve given some thoughts below.

My initial thought was that we would construct a pivot table with a

jdbc connection like we would normally do manually, and then simply

populate the pivot cache so that the spreadsheet would be

“free-standing”. On reflection that is probably neither the best nor

the simplest thing to do. Playing around with excel (in wine I

figured it is actually much more straightforward to have two sheets;

one for the data and one for the pivot table. The pivot table would

operate on the data in the other worksheet. This is more explicit

than simply hiding the data in the pivot cache. Though I need to test

this a bit more. If the pivotcache is an obligatory data store then

it might not make sense to carry the data twice. The pivot table

constructs are fairly complex but not outlandishly so - once I’m done

unpicking, I will write up a short description of the anatomy of a

pivot table so we can see what needs to be done.

This sounds just fine to me.

This might work, just have in mind that we will then need several pairs of

data+pivot worksheets as we want to have multiple pivot tables in an Excel

file.

Also have in mind that we can talk about up to around 1.500.000 rows (that’s

when my laptop starts to complain at least) of data here, so these data

sheets will be large. Typically, number of rows of data will be somewhere

around 200-600K.

One possibility would be that the pivot table could be generated as

part of the export of a report table ie. optionally export to excel or

export to excel-with-pivot.

Given that there is no advance indication of the columns in a report

table, a challenge will be how and when to define the pivot model -

ie. the pivot fields - and how to persist that model (we don’t want

users to have to recreate the pivot model each time). This will

probably require an extra object in our data model (PivotModel) which

defines the row and column pivot fields and data field, where a

ReportTable can have a 0-* relationship with 0 or more PivotModels. A

possible beneficial side effect of this could be that we also leave

open the ability to generate the pivot table rendition with things

other than excel. I don’t see this as an immediate requirement but

useful to have the pivotmodel abstraction anyway. In reality the

design of the pivotmodel will be based on what excel requires.

Generating the excel spreadsheet off the reporttable+pivotmodel will

produce an OOXML xlsx file - actually a bundle of xml streams which

need to be zipped. Its a bit unfortunate that M$ defines a flat file

single xml for other OPC office documents, but sadly not excel.

Dealing with a single flat file is much easier than zip containers

with multiple streams but so be it. At least they end up smaller.

Though given that these reporttables can be very large I don’t think

an xslt approach is really the most efficient at least for producing

the data worksheet. It might make sense if the source data was xml,

but even then probably not. So simple iteration through the table

rows with a stax writer will work best. The second sheet (the pivot

sheet) would just be a serialization of the pivot model.

We will probably have to implement the excel optimization of

sharedstrings (where literal strings are not written into the

spreadsheet - just index values from a table). This adds complexity

but I think we are talking pretty large files here. If we don’t do

the sharedstrings optimization, Excel will do it anyway the first time

it opens and saves the file, but I suspect we will have to do it up

front.

Regarding UI and PivotModel I want to take a look at our existing work

on web pivot tables to see whether we can’t leverage some of this. If

we follow roughly the outline above the user would have the option to

define 0 or more pivottable definitions at the time of creating, or

editing, a reporttable. What do you think? Is the above close to

what you guys have in mind. Lets hammer out some requirements and

create the blueprint.

We won’t base this on report tables as (at least I) can’t see any huge

benefits. The excel pivot tables themselves are more capable of having

dimensions on columns, rows, filters than our report tables. In fact we

usually use a SQL view as datasource for the pivot tables - which has a

fixed number of columns. The view is based on the aggregateddatavalue table

and joins in the dataelement, orgunit, period and periodtype tables. In

addition we join in a series of resource tables to get information about the

orgunit structure, dataelement groupset structure, orgunit groupset

structure (additional dimensional information!) to make the pivot table more

valuable for analysis. (This also goes for indicators.)

Since this should be more of a offline analysis tool I think exporting all

dataelements/indicators will be most appropriate. The user interface could

simply request a start and endate (and/or relative periods), indicators vs

dataelements, parent organisation unit and organisation unit level. Ola will

maybe have some views here…

Agree with Lars, I don’t think report tables should be used as source for

pivot tables, at least not for the standard “ALL data” pivot tables.

Yes I understand that now

For

more ad-hoc use the pivots generated off a report table might be useful as

an export option, but let’s start with the most common use cases, the big

pivot tables with all the data for browsing and data analysis.

The pivot source views (sql) that we have used e.g. in the newly distributed

pivots for the demo database can be found in an email to the list on Sep 21,

together with the pivots and the database dumps:

http://www.mail-archive.com/dhis2-devs@lists.launchpad.net/msg07341.html (from

Sep 21)

These views have been stable for a long time and describe the fields of the

basic pivot tables that are needed for DHIS data analysis.

These set of fields go back as far as DHIS 1.3, so the basic pivot tables

are stable and we know their structure, basically the core data dimensions

of DHIS2 (see chapter 2 in user manual).

What always will be variables are the orgunit level of aggregation for the

data or indicator values and the period type. You will see these two are

specified at the end of every where clause in the pivot views. These are

used as a filters when fetching data from the data mart.

If you open the pivot table for download in that email you can follow my

reasoning below:

Typically we have two tables for monthly (raw data+indicator data) for the

lowest level plus at least two more for a higher level. E.g. in the demo

based on SL we use the health facility level (level 4) and the Chiefdom

level (3). The reason for using Chiefdom level as well is that these tables

are much faster to manipulate since there are a lot less data when shaving

away the lowest level. How many levels and which levels to put in the pivot

table will vary from place to place. In addition to these at least four

tables, there is often at least one table for population data, which has a

yearly period type and therefore needs a separate view and table. So for the

demo database we have 5 basic tables which covers all the data for data

elements and indicators. This set of pivot tables is what I would define as

a minimum pivot table setup for a DHIS database. Over time (depending on the

amount of data collected) it might be necessary to split up these basic

tables by year since they can get too big for excel to handle. Such a split

can be facilitated by a simple from and to filter on the start date in the

period field in the sql.

Then there can be many more customised pivot tables which make use of

additional dimensions to the data like the data element categories, and data

element and orgunit group sets. The simplest approach there is to (still

talking sql) do a select * from the various resource tables. Then you will

have a data source for all other customised tables (using the same period

type and aggregation level). Working with data element categories in pivot

tables it doesn’t make sense to use too many at the same time, and rather

work on data elements that share the same categories (within or across

datasets). You can see in the demo pivot table file that we have set up

multiple tables making use of data element categories (EPI, RCH, HIV etc.).

These are all using the same source data (chiefdom level monthly raw data),

but are using different pivot fields (columns in the result set). The total

number of available fields are the same for all these custom chiefdom

tables, but we are only using a small subset of these fields in each table.

This means that the same data source could also be used for the basic tables

(the two chiefdom tables at least), and just select even fewer fields in the

table. The users can also easily add more fields to a table in Excel, using

drag and drop functionality.

It seems to me that there is a need to separate between the source data and

the pivot structure, to reuse the source data in multiple tables. I think

that is how Excel is thinking too, but it doesn’t always seem to work that

way, e.g. when creating a copy of an existing worksheet with a pivot table.

Yes we can ask excel pivot tables to use common pivot table cache.

I am not sure how much of this customisation we need to support inside DHIS2

and how much we can leave to Excel (which has nice tools for copying tables,

drag and drop pivot fields etc.). First we need to come up with something

that works for the basic tables and then look at these more custom use

cases.

When it comes to pivot tables using period types other than monthly or

yearly (both covered in the basic tables listed above), like six-monthly,

quarterly or weekly these are also important to support. The pivot fields

are the same as in the basic tables, it is just a different periodtype

filter on the data source, so these might be easier to accommodate.

Looking at the next steps, to me the key new features that I would like to

see with regards to pivot tables are:

a pivot table update service, where users that are on the online server

(no local install) can update their local pivot tables against the online

DHIS server

Since a pivot table can contain months and years of data we do not want to

download the full tables every month, but simply append the latest month of

data to the existing local pivot tables

a pivot generator, where the users can download a new pivot table file

directly off a DHIS2 server without having to do any local configuration

etc. , and then later use the update service to update their tables

with 1) being the most critical and urgent, but I realise that we might need

first, in order to deal with a standard set of tables and pivot data

Yes I think 1 can only really work in concert with 2. Since we can’t

ask dhis to update any old pivot table, it will need to be a

pivottable which it is intimately familiar with - which is either a

fixed format one as you have discussed or one that is created through

dhis. Naturally we start with former. BTW looking at size of tables

I have dropped the idea of holding data in a worksheet. This does in

fact produce duplication of data which is also held in cache and both

unfortunately also held in memory. So we work only with the hidden

pivot table cache as the local store.

Thinking how and where data is being moved, it might not make sense to

pass the (massive) excel file back and forwards to dhis in order to

get updates injected. That might defeat the purpose. Which raises

some questions. Either we have a standalone tool which grabs

incremental update from dhis and appends to excel pivottable cache or

we have some sort of VBA macro defined within excel. I think the

latter might prove problematic - like performing circumcision on

oneself whilst awake, it might cause excel to breakdown. So probably

a standalone tool …

I haven’t worked through the samples yet. I think I will buy excel on

the weekend and invoice Jorn.

bobj · 24 October 2010 23:09

hi

>>
>> Hi Bob
>> Sorry for not returning to this earlier, been battling with windows 64
>> bit
>> odbc drivers.
>>
>>>
>>> Hi Lars, Ola and all
>>>
>>> Thinking about exporting pivot tables .. we need to pen up some
>>> requirements which we can get into a blueprint. While we are
>>> considering that, I've given some thoughts below.
>>>
>>> My initial thought was that we would construct a pivot table with a
>>> jdbc connection like we would normally do manually, and then simply
>>> populate the pivot cache so that the spreadsheet would be
>>> "free-standing". On reflection that is probably neither the best nor
>>> the simplest thing to do. Playing around with excel (in wine I
>>> figured it is actually much more straightforward to have two sheets;
>>> one for the data and one for the pivot table. The pivot table would
>>> operate on the data in the other worksheet. This is more explicit
>>> than simply hiding the data in the pivot cache. Though I need to test
>>> this a bit more. If the pivotcache is an obligatory data store then
>>> it might not make sense to carry the data twice. The pivot table
>>> constructs are fairly complex but not outlandishly so - once I'm done
>>> unpicking, I will write up a short description of the anatomy of a
>>> pivot table so we can see what needs to be done.
>>>
>>
>> This sounds just fine to me.
>
> This might work, just have in mind that we will then need several pairs
> of
> data+pivot worksheets as we want to have multiple pivot tables in an
> Excel
> file.
> Also have in mind that we can talk about up to around 1.500.000 rows
> (that's
> when my laptop starts to complain at least) of data here, so these data
> sheets will be large. Typically, number of rows of data will be
> somewhere
> around 200-600K.
>
>
>>
>>
>>>
>>> One possibility would be that the pivot table could be generated as
>>> part of the export of a report table ie. optionally export to excel or
>>> export to excel-with-pivot.
>>>
>>> Given that there is no advance indication of the columns in a report
>>> table, a challenge will be how and when to define the pivot model -
>>> ie. the pivot fields - and how to persist that model (we don't want
>>> users to have to recreate the pivot model each time). This will
>>> probably require an extra object in our data model (PivotModel) which
>>> defines the row and column pivot fields and data field, where a
>>> ReportTable can have a 0-* relationship with 0 or more PivotModels. A
>>> possible beneficial side effect of this could be that we also leave
>>> open the ability to generate the pivot table rendition with things
>>> other than excel. I don't see this as an immediate requirement but
>>> useful to have the pivotmodel abstraction anyway. In reality the
>>> design of the pivotmodel will be based on what excel requires.
>>>
>>> Generating the excel spreadsheet off the reporttable+pivotmodel will
>>> produce an OOXML xlsx file - actually a bundle of xml streams which
>>> need to be zipped. Its a bit unfortunate that M$ defines a flat file
>>> single xml for other OPC office documents, but sadly not excel.
>>> Dealing with a single flat file is much easier than zip containers
>>> with multiple streams but so be it. At least they end up smaller.
>>> Though given that these reporttables can be very large I don't think
>>> an xslt approach is really the most efficient at least for producing
>>> the data worksheet. It might make sense if the source data was xml,
>>> but even then probably not. So simple iteration through the table
>>> rows with a stax writer will work best. The second sheet (the pivot
>>> sheet) would just be a serialization of the pivot model.
>>>
>>> We will probably have to implement the excel optimization of
>>> sharedstrings (where literal strings are not written into the
>>> spreadsheet - just index values from a table). This adds complexity
>>> but I think we are talking pretty large files here. If we don't do
>>> the sharedstrings optimization, Excel will do it anyway the first time
>>> it opens and saves the file, but I suspect we will have to do it up
>>> front.
>>>
>>> Regarding UI and PivotModel I want to take a look at our existing work
>>> on web pivot tables to see whether we can't leverage some of this. If
>>> we follow roughly the outline above the user would have the option to
>>> define 0 or more pivottable definitions at the time of creating, or
>>> editing, a reporttable. What do you think? Is the above close to
>>> what you guys have in mind. Lets hammer out some requirements and
>>> create the blueprint.
>>>
>>
>> We won't base this on report tables as (at least I) can't see any huge
>> benefits. The excel pivot tables themselves are more capable of having
>> dimensions on columns, rows, filters than our report tables. In fact we
>> usually use a SQL view as datasource for the pivot tables - which has a
>> fixed number of columns. The view is based on the aggregateddatavalue
>> table
>> and joins in the dataelement, orgunit, period and periodtype tables. In
>> addition we join in a series of resource tables to get information
>> about the
>> orgunit structure, dataelement groupset structure, orgunit groupset
>> structure (additional dimensional information!) to make the pivot table
>> more
>> valuable for analysis. (This also goes for indicators.)
>> Since this should be more of a offline analysis tool I think exporting
>> all
>> dataelements/indicators will be most appropriate. The user interface
>> could
>> simply request a start and endate (and/or relative periods), indicators
>> vs
>> dataelements, parent organisation unit and organisation unit level. Ola
>> will
>> maybe have some views here...
>
> Agree with Lars, I don't think report tables should be used as source
> for
> pivot tables, at least not for the standard "ALL data" pivot tables.

Yes I understand that now

> For
> more ad-hoc use the pivots generated off a report table might be useful
> as
> an export option, but let's start with the most common use cases, the
> big
> pivot tables with all the data for browsing and data analysis.
> The pivot source views (sql) that we have used e.g. in the newly
> distributed
> pivots for the demo database can be found in an email to the list on Sep
> 21,
> together with the pivots and the database dumps:
>
> [Dhis2-devs] Working with h2, odbc, and pivot tables (from
> Sep 21)
> These views have been stable for a long time and describe the fields of
> the
> basic pivot tables that are needed for DHIS data analysis.
> These set of fields go back as far as DHIS 1.3, so the basic pivot
> tables
> are stable and we know their structure, basically the core data
> dimensions
> of DHIS2 (see chapter 2 in user manual).
> What always will be variables are the orgunit level of aggregation for
> the
> data or indicator values and the period type. You will see these two are
> specified at the end of every where clause in the pivot views. These are
> used as a filters when fetching data from the data mart.
> If you open the pivot table for download in that email you can follow my
> reasoning below:
> Typically we have two tables for monthly (raw data+indicator data) for
> the
> lowest level plus at least two more for a higher level. E.g. in the demo
> based on SL we use the health facility level (level 4) and the Chiefdom
> level (3). The reason for using Chiefdom level as well is that these
> tables
> are much faster to manipulate since there are a lot less data when
> shaving
> away the lowest level. How many levels and which levels to put in the
> pivot
> table will vary from place to place. In addition to these at least four
> tables, there is often at least one table for population data, which
> has a
> yearly period type and therefore needs a separate view and table. So for
> the
> demo database we have 5 basic tables which covers all the data for data
> elements and indicators. This set of pivot tables is what I would define
> as
> a minimum pivot table setup for a DHIS database. Over time (depending on
> the
> amount of data collected) it might be necessary to split up these basic
> tables by year since they can get too big for excel to handle. Such a
> split
> can be facilitated by a simple from and to filter on the start date in
> the
> period field in the sql.
> Then there can be many more customised pivot tables which make use of
> additional dimensions to the data like the data element categories, and
> data
> element and orgunit group sets. The simplest approach there is to (still
> talking sql) do a select * from the various resource tables. Then you
> will
> have a data source for all other customised tables (using the same
> period
> type and aggregation level). Working with data element categories in
> pivot
> tables it doesn't make sense to use too many at the same time, and
> rather
> work on data elements that share the same categories (within or across
> datasets). You can see in the demo pivot table file that we have set up
> multiple tables making use of data element categories (EPI, RCH, HIV
> etc.).
> These are all using the same source data (chiefdom level monthly raw
> data),
> but are using different pivot fields (columns in the result set). The
> total
> number of available fields are the same for all these custom chiefdom
> tables, but we are only using a small subset of these fields in each
> table.
> This means that the same data source could also be used for the basic
> tables
> (the two chiefdom tables at least), and just select even fewer fields in
> the
> table. The users can also easily add more fields to a table in Excel,
> using
> drag and drop functionality.
> It seems to me that there is a need to separate between the source data
> and
> the pivot structure, to reuse the source data in multiple tables. I
> think
> that is how Excel is thinking too, but it doesn't always seem to work
> that
> way, e.g. when creating a copy of an existing worksheet with a pivot
> table.

Yes we can ask excel pivot tables to use common pivot table cache.

> I am not sure how much of this customisation we need to support inside
> DHIS2
> and how much we can leave to Excel (which has nice tools for copying
> tables,
> drag and drop pivot fields etc.). First we need to come up with
> something
> that works for the basic tables and then look at these more custom use
> cases.
> When it comes to pivot tables using period types other than monthly or
> yearly (both covered in the basic tables listed above), like
> six-monthly,
> quarterly or weekly these are also important to support. The pivot
> fields
> are the same as in the basic tables, it is just a different periodtype
> filter on the data source, so these might be easier to accommodate.
> Looking at the next steps, to me the key new features that I would like
> to
> see with regards to pivot tables are:
> 1) a pivot table update service, where users that are on the online
> server
> (no local install) can update their local pivot tables against the
> online
> DHIS server
> Since a pivot table can contain months and years of data we do not want
> to
> download the full tables every month, but simply append the latest month
> of
> data to the existing local pivot tables
> 2) a pivot generator, where the users can download a new pivot table
> file
> directly off a DHIS2 server without having to do any local configuration
> etc. , and then later use the update service to update their tables
> with 1) being the most critical and urgent, but I realise that we might
> need
> 2) first, in order to deal with a standard set of tables and pivot data

Yes I think 1 can only really work in concert with 2. Since we can't
ask dhis to update any old pivot table, it will need to be a
pivottable which it is intimately familiar with - which is either a
fixed format one as you have discussed or one that is created through
dhis. Naturally we start with former. BTW looking at size of tables
I have dropped the idea of holding data in a worksheet. This does in
fact produce duplication of data which is also held in cache and both
unfortunately also held in memory. So we work only with the hidden
pivot table cache as the local store.

Thinking how and where data is being moved, it might not make sense to
pass the (massive) excel file back and forwards to dhis in order to
get updates injected. That might defeat the purpose. Which raises
some questions. Either we have a standalone tool which grabs
incremental update from dhis and appends to excel pivottable cache or
we have some sort of VBA macro defined within excel. I think the
latter might prove problematic - like performing circumcision on
oneself whilst awake, it might cause excel to breakdown. So probably
a standalone tool ..

I haven't worked through the samples yet. I think I will buy excel on
the weekend and invoice Jorn.

Definitely lets start with 2). We can think of 1) later, maybe even a script
will do the job. Bob I will put up excel with key on separate mail.
Did some quick investigations on size and it seems excel is using around 30
KB / 1000 datavalues. The file seems already compressed btw. If a typical
pivot file has 600 000 values that means 18 MB.

Ok. Working on the uncompressed stream will then be considerably
greater than that (100MB?). And up 1.5 million rows might be maybe
closer to 300MB.

A few thoughts stemming from that:
(i) if you are populating a pivot table from a jdbc connection backed
by a query, then the fields which are returned in that resultset
should be carefully and minimally (optimally) trimmed. All the data
is retained in pivot cache of excel file and all of that is loaded
into memory when excel opens the file (great though excel might be,
and much though we al love it(!) its not a database). So for example
in the xlsx case, two important streams in the representation of the
pivotcache are the pivotCacheDefinition(n).xml and the actual
pivotCacheRecords(n). The cache definition is something like the
database metadata definitions, but including "selection options" for
common strings eg:

So a resultset of a query with 13 fields will be represented by 13
cacheField elements in the pivotCacheDefinition. This in turn will
translate into each row in the resultset being represented by a record
in the pivotCacheRecords with 13 child elements, eg

So if we are talking about really, really large resultsets (as it
seems we are) then we should really pay great attention to eliminating
excess fields in the query as these will get replicated potentially
1.5 million times in the pivotCacheRecords which all has to get loaded
in memory when excel loads the file. That's just a word of caution in
case folk haven't realized that. So for example having year, month
and period might be excessive. Similarly I have seen dataelement name
being duplicated in some of the sample queries sent to me. I guess
this is probably a matter of getting the query right first then fine
tuning ruthlesslesly. People should be aware that everything (not
just the fields actually used in the pivot) is stored in the
pivotCacheRecords.

(ii) If I understand the use cases correctly, the main issue is that
when we want to refresh a pivottable which uses an odbc connection,
the pivotcacherecords are flushed and repopulated with the resultset
of the latest query result. Given the size of the pivotRecordsCache
this is a non-trivial bulk of data movement. And if its over a
network its likely to be intolerable. Particularly if all we want to
do is to append new data from say the next month.

Looking at the example of a pivotcache record above, it should be
obvious that it is quite trivial to generate new records and append
them to the cache. (though if we are talking about a couple of 100 MB
cache I'd be reluctant to do this in java, tcl or python - some c++
code using libxerces-c is probably going to make most sense). I have
succeeded so far in adding records into the pivot cache without
causing excel to complain or die which is good. But haven't yet
persuaded it to update the displayed table based on the modified
cache. I hope to play some more with this but tomorrow is a public
holiday here in Ireland so might be delayed till Tuesday.

If it proves difficult or impossible to make this happen then we might
need to think some more. An odbc connection is only one way (and
potentially a slow way) for a pivottable to get hold of its data.
particularly if it has to do it to, say a postgres instance through
windoze crappy tcp layer. Pivottables can also read from excel data
sources and I think even csv (using odbc csv driver). If we are
thinking about offline use, then it might make sense, instead of doing
surgery on missing db conections, to instead explicitly pass the data
in an excel or csv file (starting to look a bit like my earlier
reporttable export). The point being that there might be an initial
enormous one, but subsequent appended amendments would be more
reasonable. Then - contrary to my earlier suggestion - the pivottable
should be in a separate workbook from the data because if a pivottable
makes use of excel data in the same workbook, its going to transfer to
pivotcache and you will end up with two copies of the data in the
workbook and which excel will also try to hold in memory. Which can
be bad. (I would have to look into whether its more efficient to
represent data as compressed csv or excel, but the principle is the
same).

But if pivottable and data are separate then the process of refreshing
the pivottable will involve copying the data from the external data
source workbook into the pivotcache of the pivottable. This might
still turn out to be unacceptably slow if we are talking 100's of MB
of data but at least we can update the data workbook incrementally
from dhis. Then transferring into pivotcache will take how long it
takes

I still think directly appending rows to the pivot cache will be the
most optimal but I'm talking alternatives out loud in case it doesn't
work out. I hope to answer this question definitively by Tuesday.
Meanwhile I'm sharing more info on excel internals than you probably
are interested in knowing, partly because its probably good for you
designing queries and partly because it helps me straighten my
thoughts

Regards
Bob

···

2010/10/22 Lars Helge Øverland <larshelge@gmail.com>:

On Fri, Oct 22, 2010 at 2:32 PM, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

2010/10/22 Ola Hodne Titlestad <olati@ifi.uio.no>:
> 2010/10/22 Lars Helge Øverland <larshelge@gmail.com>
>> 2010/10/22 Bob Jolliffe <bobjolliffe@gmail.com>

Lars

Lars · 25 October 2010 14:09

Hi thanks for the nice update, interesting. The thinking sounds very reasonable to me.

Ola and I will work a bit on the optimization of SQL view. One issue here is that when writing “generic” sql queries for tables like orgunitgroupsetstructure, one needs to select * (all) columns in order to be portable because the table has each column per groupset, ie. its dynamic. But I guess we can fix these things when generating the SQL from Java / DHIS.

Re size, a good thing is that a typical pivot table won’t contain the whole database, rather the data for a district and its facilities. We want to be able to compare with other districts (maybe inside the same province only) but for those we only need the aggregated data at the district level (not facility data) - which means a lot less data.

Its also interesting to see how excel uses a “minimal” xml ( etc). We could consider that for DXF 2 datavalues. Size/performance vs readability.

Lars

bobj · 25 October 2010 19:28

Hi thanks for the nice update, interesting. The thinking sounds very
reasonable to me.
Ola and I will work a bit on the optimization of SQL view. One issue here is
that when writing "generic" sql queries for tables like
orgunitgroupsetstructure, one needs to select * (all) columns in order to be
portable because the table has each column per groupset, ie. its dynamic.
But I guess we can fix these things when generating the SQL from Java /
DHIS.

Yes. The more we optimize these (in terms of number of fields) the
less enormous the excel file will be. And given that excel holds all
in memory we must do the best we can.

Re size, a good thing is that a typical pivot table won't contain the whole
database, rather the data for a district and its facilities. We want to be
able to compare with other districts (maybe inside the same province only)
but for those we only need the aggregated data at the district level (not
facility data) - which means a lot less data.

I'm trying to focus on the particular problems related to enormous
spreadsheets. If they are small we don't really have much of a
problem either way. But if we can deal with the really large ones
then we can maybe deal with medium large ones quite snappily.

Its also interesting to see how excel uses a "minimal" xml (<r><x
v="0"/></r> etc). We could consider that for DXF 2 datavalues.
Size/performance vs readability.

This is an OOXML "feature" which has been heavily criticized by almost
all in the xml world. Not being religious, I am in two minds about
it. I think it might be reasonable for the datavalues in dxf, but I'd
stick my heels in if we started doing the same thing for metadata. In
fact I think there is just such a minimal syntax which I have
suggested to Jo some time back as being maybe useful for a lightweight
mobile version of sdmx. I wouldn't have a problem with:

(I'd still like to see the other attributes, comment, createdby etc
being aggregated up to a 'datavalueset' but that's another story)

Regarding ooxml I think there is a better justification doing it for a
data oriented document format like a spreadsheet, than there is for
word docs and presentations. Certainly M$ has achieved significant
performance benefits to Excel in terms of load times. File size is
not that significant a benefit because the compressed files shouldn't
be that different. What also really helps is the use of
sharedstrings. You can see that they do a similar shared strings type
optimization in the pivotcache. All of which allows us to stretch the
use of spreadsheets into domains where they shouldn't really go.

Cheers
Bob

···

2010/10/25 Lars Helge Øverland <larshelge@gmail.com>:

Lars

Lars · 26 October 2010 07:41

Hi thanks for the nice update, interesting. The thinking sounds very

reasonable to me.

Ola and I will work a bit on the optimization of SQL view. One issue here is

that when writing “generic” sql queries for tables like

orgunitgroupsetstructure, one needs to select * (all) columns in order to be

portable because the table has each column per groupset, ie. its dynamic.

But I guess we can fix these things when generating the SQL from Java /

DHIS.

Yes. The more we optimize these (in terms of number of fields) the

less enormous the excel file will be. And given that excel holds all

in memory we must do the best we can.

Re size, a good thing is that a typical pivot table won’t contain the whole

database, rather the data for a district and its facilities. We want to be

able to compare with other districts (maybe inside the same province only)

but for those we only need the aggregated data at the district level (not

facility data) - which means a lot less data.

I’m trying to focus on the particular problems related to enormous

spreadsheets. If they are small we don’t really have much of a

problem either way. But if we can deal with the really large ones

then we can maybe deal with medium large ones quite snappily.

Yes definitely.

Its also interesting to see how excel uses a “minimal” xml (<x

v=“0”/> etc). We could consider that for DXF 2 datavalues.

Size/performance vs readability.

This is an OOXML “feature” which has been heavily criticized by almost

all in the xml world. Not being religious, I am in two minds about

it. I think it might be reasonable for the datavalues in dxf, but I’d

stick my heels in if we started doing the same thing for metadata. In

fact I think there is just such a minimal syntax which I have

suggested to Jo some time back as being maybe useful for a lightweight

mobile version of sdmx. I wouldn’t have a problem with:

<dv de=‘32’ o=‘3’ p=‘4’ d='‘2’ />

(I’d still like to see the other attributes, comment, createdby etc

being aggregated up to a ‘datavalueset’ but that’s another story)

Regarding ooxml I think there is a better justification doing it for a

data oriented document format like a spreadsheet, than there is for

word docs and presentations. Certainly M$ has achieved significant

performance benefits to Excel in terms of load times. File size is

not that significant a benefit because the compressed files shouldn’t

be that different. What also really helps is the use of

sharedstrings. You can see that they do a similar shared strings type

optimization in the pivotcache. All of which allows us to stretch the

use of spreadsheets into domains where they shouldn’t really go.

Yes, was only thinking about data values. We can leave this for later.

···

2010/10/25 Bob Jolliffe bobjolliffe@gmail.com

2010/10/25 Lars Helge Øverland larshelge@gmail.com:

bobj · 26 October 2010 22:11

Attached is a simple skeleton overview diagram of how it *can* be
done. I'm not sure if I'm 100% comfortable with the approach yet but
I can verify that it works (caveat below).

From the diagram, the point ultimately is to populate the

pivottablecache of the pivottable within the local excel spreadsheet
with the data from the view in the remote database. This is what you
would normally do (in a connected setting) by setting an odbc type
connection to the postgres database.

What I have done is instead to create a connection to a local csv
file. This is done by creating a odc connection description file and
pointing the pivottable at that. Selected nippet from example odc
connection file:
<xml id='msodc'>
  <odc:OfficeDataConnection
  xmlns:odc="urn:schemas-microsoft-com:office:odc"
  xmlns="http://www.w3.org/TR/REC-html40">
  <odc:Connection odc:Type="OLEDB">
   <odc:ConnectionString>Provider=MSDASQL.1;Persist Security
Info=False;Data Source=csv;Extended
Properties="DSN=csv;DefaultDir=C:\Users\bobj\Documents\excel;DriverId=27;FIL=text;MaxBufferSize=2048;PageTimeout=5;";Initial
Catalog=C:\Users\bobj\Documents\excel</odc:ConnectionString>
   <odc:CommandType>Table</odc:CommandType>
   <odc:CommandText>`C:\Users\bobj\Documents\excel\test.csv`</odc:CommandText>
  </odc:Connection>
</odc:OfficeDataConnection>
</xml>

I've taken a 30MB csv file dumped from the
_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL view of Ola's sample data.
(Grabbed from h2 with: CALL CSVWRITE('test.csv', 'SELECT * FROM
_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL'); )

Using the odc above, refreshing the pivottable causes a copy from the
csv file into the pivot cache which seems pretty quick (4-5 seconds) -
certainly quicker than pulling it via tcp-odbc-jdbc or whatever.

Note that worksheet can have more than one pivottable and more than
one pivotcache and that more than one pivottable can share the same
pivot cache. I haven't gone into those details.

So the upside is local pivottable refresh is quick and appending
monthly csv snippets is trivial. Another positive spinoff in havng
the datset as a local csv is that you can also pivot with all kinds of
other tools other than excel (ranging from primitive awk to fancy 'r'
reshaping - both of which I think I prefer to excel ). But
managing this setup is fairly straightforward - 3 files: the excel
file, the odc and the csv files which grow with monthly updates (I
know there must also be a facility to amend changed data but we'll
cross that bridge ..) . And there are of course downsides.

These csv files are really pretty big and hugely sub optimal. The
normalization of the database has been flattened out for convenience
of pivoting so we are dealing with tens of thousands of repeated
strings (datelement names, ou names etc). From a transport
perspective this is maybe not too bad - I'm guessing a reasonable
compression might factor out some of this. And appending monthly
chunks wouldn't be too bad. But they are pretty big files to manage.
But I think they are maybe more robust than depending solely on the
pivotcache as a local primary store.

That's it for now. Thought to sleep on is whether we can't somehow
just maintain a local database which replicates the necessary tables
from the online server. But its complicated. Maintaining a couple
(hopefully not too many) csv files might be easier.

Bob

PS. The caveat. My csv source for pivottables sort of works but all
my datavalues are being treated as zero I think this is probably
to do with setting the datatype of the field somehow. I think its
fixable.

···

2010/10/26 Lars Helge Øverland <larshelge@gmail.com>:

2010/10/25 Bob Jolliffe <bobjolliffe@gmail.com>

2010/10/25 Lars Helge Øverland <larshelge@gmail.com>:
>
> Hi thanks for the nice update, interesting. The thinking sounds very
> reasonable to me.
> Ola and I will work a bit on the optimization of SQL view. One issue
> here is
> that when writing "generic" sql queries for tables like
> orgunitgroupsetstructure, one needs to select * (all) columns in order
> to be
> portable because the table has each column per groupset, ie. its
> dynamic.
> But I guess we can fix these things when generating the SQL from Java /
> DHIS.

Yes. The more we optimize these (in terms of number of fields) the
less enormous the excel file will be. And given that excel holds all
in memory we must do the best we can.

> Re size, a good thing is that a typical pivot table won't contain the
> whole
> database, rather the data for a district and its facilities. We want to
> be
> able to compare with other districts (maybe inside the same province
> only)
> but for those we only need the aggregated data at the district level
> (not
> facility data) - which means a lot less data.

I'm trying to focus on the particular problems related to enormous
spreadsheets. If they are small we don't really have much of a
problem either way. But if we can deal with the really large ones
then we can maybe deal with medium large ones quite snappily.

Yes definitely.

> Its also interesting to see how excel uses a "minimal" xml (<r><x
> v="0"/></r> etc). We could consider that for DXF 2 datavalues.
> Size/performance vs readability.

This is an OOXML "feature" which has been heavily criticized by almost
all in the xml world. Not being religious, I am in two minds about
it. I think it might be reasonable for the datavalues in dxf, but I'd
stick my heels in if we started doing the same thing for metadata. In
fact I think there is just such a minimal syntax which I have
suggested to Jo some time back as being maybe useful for a lightweight
mobile version of sdmx. I wouldn't have a problem with:

<dv de='32' o='3' p='4' d=''2' /> 

(I'd still like to see the other attributes, comment, createdby etc
being aggregated up to a 'datavalueset' but that's another story)

Regarding ooxml I think there is a better justification doing it for a
data oriented document format like a spreadsheet, than there is for
word docs and presentations. Certainly M$ has achieved significant
performance benefits to Excel in terms of load times. File size is
not that significant a benefit because the compressed files shouldn't
be that different. What also really helps is the use of
sharedstrings. You can see that they do a similar shared strings type
optimization in the pivotcache. All of which allows us to stretch the
use of spreadsheets into domains where they shouldn't really go.

Yes, was only thinking about data values. We can leave this for later.

Lars · 27 October 2010 15:06

Attached is a simple skeleton overview diagram of how it can be

done. I’m not sure if I’m 100% comfortable with the approach yet but

I can verify that it works (caveat below).

From the diagram, the point ultimately is to populate the

pivottablecache of the pivottable within the local excel spreadsheet

with the data from the view in the remote database. This is what you

would normally do (in a connected setting) by setting an odbc type

connection to the postgres database.

What I have done is instead to create a connection to a local csv

file. This is done by creating a odc connection description file and

pointing the pivottable at that. Selected nippet from example odc

connection file:

<odc:OfficeDataConnection

xmlns:odc=“urn:schemas-microsoft-com:office:odc”

xmlns=“http://www.w3.org/TR/REC-html40”>

<odc:Connection odc:Type=“OLEDB”>

odc:ConnectionStringProvider=MSDASQL.1;Persist Security

Info=False;Data Source=csv;Extended

Properties="DSN=csv;DefaultDir=C:\Users\bobj\Documents\excel;DriverId=27;FIL=text;MaxBufferSize=2048;PageTimeout=5;";Initial

Catalog=C:\Users\bobj\Documents\excel</odc:ConnectionString>

odc:CommandTypeTable</odc:CommandType>

odc:CommandTextC:\Users\bobj\Documents\excel\test.csv</odc:CommandText>

</odc:Connection>

</odc:OfficeDataConnection>

I’ve taken a 30MB csv file dumped from the

_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL view of Ola’s sample data.

(Grabbed from h2 with: CALL CSVWRITE(‘test.csv’, 'SELECT * FROM

_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL’); )

Using the odc above, refreshing the pivottable causes a copy from the

csv file into the pivot cache which seems pretty quick (4-5 seconds) -

certainly quicker than pulling it via tcp-odbc-jdbc or whatever.

Note that worksheet can have more than one pivottable and more than

one pivotcache and that more than one pivottable can share the same

pivot cache. I haven’t gone into those details.

So the upside is local pivottable refresh is quick and appending

monthly csv snippets is trivial. Another positive spinoff in havng

the datset as a local csv is that you can also pivot with all kinds of

other tools other than excel (ranging from primitive awk to fancy ‘r’

reshaping - both of which I think I prefer to excel ). But

managing this setup is fairly straightforward - 3 files: the excel

file, the odc and the csv files which grow with monthly updates (I

know there must also be a facility to amend changed data but we’ll

cross that bridge …) . And there are of course downsides.

These csv files are really pretty big and hugely sub optimal. The

normalization of the database has been flattened out for convenience

of pivoting so we are dealing with tens of thousands of repeated

strings (datelement names, ou names etc). From a transport

perspective this is maybe not too bad - I’m guessing a reasonable

compression might factor out some of this. And appending monthly

chunks wouldn’t be too bad. But they are pretty big files to manage.

But I think they are maybe more robust than depending solely on the

pivotcache as a local primary store.

That’s it for now. Thought to sleep on is whether we can’t somehow

just maintain a local database which replicates the necessary tables

from the online server. But its complicated. Maintaining a couple

(hopefully not too many) csv files might be easier.

Nice. Explicitly separating presentation from data with csv sounds more manageable. I think we can live with “yearly” pivot tables, in that these things won’t keep growing forever.

Bob

PS. The caveat. My csv source for pivottables sort of works but all

my datavalues are being treated as zero I think this is probably

to do with setting the datatype of the field somehow. I think its

fixable.

Might be related to the fact that DataValue.value is a String. Maybe Ola has some experience?

···

2010/10/27 Bob Jolliffe bobjolliffe@gmail.com

bobj · 27 October 2010 15:14

Attached is a simple skeleton overview diagram of how it *can* be
done. I'm not sure if I'm 100% comfortable with the approach yet but
I can verify that it works (caveat below).

From the diagram, the point ultimately is to populate the
pivottablecache of the pivottable within the local excel spreadsheet
with the data from the view in the remote database. This is what you
would normally do (in a connected setting) by setting an odbc type
connection to the postgres database.

What I have done is instead to create a connection to a local csv
file. This is done by creating a odc connection description file and
pointing the pivottable at that. Selected nippet from example odc
connection file:
<xml id='msodc'>
<odc:OfficeDataConnection
xmlns:odc="urn:schemas-microsoft-com:office:odc"
xmlns="http://www.w3.org/TR/REC-html40">
<odc:Connection odc:Type="OLEDB">
<odc:ConnectionString>Provider=MSDASQL.1;Persist Security
Info=False;Data Source=csv;Extended

Properties="DSN=csv;DefaultDir=C:\Users\bobj\Documents\excel;DriverId=27;FIL=text;MaxBufferSize=2048;PageTimeout=5;";Initial
Catalog=C:\Users\bobj\Documents\excel</odc:ConnectionString>
<odc:CommandType>Table</odc:CommandType>

<odc:CommandText>`C:\Users\bobj\Documents\excel\test.csv`</odc:CommandText>
</odc:Connection>
</odc:OfficeDataConnection>
</xml>

I've taken a 30MB csv file dumped from the
_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL view of Ola's sample data.
(Grabbed from h2 with: CALL CSVWRITE('test.csv', 'SELECT * FROM
_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL'); )

Using the odc above, refreshing the pivottable causes a copy from the
csv file into the pivot cache which seems pretty quick (4-5 seconds) -
certainly quicker than pulling it via tcp-odbc-jdbc or whatever.

Note that worksheet can have more than one pivottable and more than
one pivotcache and that more than one pivottable can share the same
pivot cache. I haven't gone into those details.

So the upside is local pivottable refresh is quick and appending
monthly csv snippets is trivial. Another positive spinoff in havng
the datset as a local csv is that you can also pivot with all kinds of
other tools other than excel (ranging from primitive awk to fancy 'r'
reshaping - both of which I think I prefer to excel ). But
managing this setup is fairly straightforward - 3 files: the excel
file, the odc and the csv files which grow with monthly updates (I
know there must also be a facility to amend changed data but we'll
cross that bridge ..) . And there are of course downsides.

These csv files are really pretty big and hugely sub optimal. The
normalization of the database has been flattened out for convenience
of pivoting so we are dealing with tens of thousands of repeated
strings (datelement names, ou names etc). From a transport
perspective this is maybe not too bad - I'm guessing a reasonable
compression might factor out some of this. And appending monthly
chunks wouldn't be too bad. But they are pretty big files to manage.
But I think they are maybe more robust than depending solely on the
pivotcache as a local primary store.

That's it for now. Thought to sleep on is whether we can't somehow
just maintain a local database which replicates the necessary tables
from the online server. But its complicated. Maintaining a couple
(hopefully not too many) csv files might be easier.

Nice. Explicitly separating presentation from data with csv sounds more
manageable. I think we can live with "yearly" pivot tables, in that these
things won't keep growing forever.

Bob

PS. The caveat. My csv source for pivottables sort of works but all
my datavalues are being treated as zero I think this is probably
to do with setting the datatype of the field somehow. I think its
fixable.

Might be related to the fact that DataValue.value is a String. Maybe Ola has
some experience?

Yes it has to do with interpreting fields as numeric rather than
string. csv driver defaults to string assumption. Haven't figured
the best place to tell it otherwise yet. But I think Ola has ..

···

2010/10/27 Lars Helge Øverland <larshelge@gmail.com>:

2010/10/27 Bob Jolliffe <bobjolliffe@gmail.com>

olatitle · 27 October 2010 21:15

Yes, I casted the Value field to number using the val() function, val(Value). I modified the Query used by the pivot table in:

Pivot Tables top menu->Change Data Source->Connection Properties->Definition->Command text

Not sure where this query is stored as these changes did not show in the .odc after saving the Excel file.

There is probably connection information also in Excel somewhere, since the query is on top of the .odc.

If possible, it sounds like a better deal to modify this in the .odc file and just make sure Excel gets the data in the correct format.

···

Ola Hodne Titlestad (Mr)
HISP
Department of Informatics
University of Oslo

Mobile: +47 48069736
Home address: Vetlandsvn. 95B, 0685 Oslo, Norway. Googlemaps link

2010/10/27 Bob Jolliffe bobjolliffe@gmail.com

2010/10/27 Lars Helge Øverland larshelge@gmail.com:

2010/10/27 Bob Jolliffe bobjolliffe@gmail.com

Attached is a simple skeleton overview diagram of how it can be

done. I’m not sure if I’m 100% comfortable with the approach yet but

I can verify that it works (caveat below).

From the diagram, the point ultimately is to populate the

pivottablecache of the pivottable within the local excel spreadsheet

with the data from the view in the remote database. This is what you

would normally do (in a connected setting) by setting an odbc type

connection to the postgres database.

What I have done is instead to create a connection to a local csv

file. This is done by creating a odc connection description file and

pointing the pivottable at that. Selected nippet from example odc

connection file:

<odc:OfficeDataConnection

xmlns:odc=“urn:schemas-microsoft-com:office:odc”

xmlns=“http://www.w3.org/TR/REC-html40”>

<odc:Connection odc:Type=“OLEDB”>

odc:ConnectionStringProvider=MSDASQL.1;Persist Security

Info=False;Data Source=csv;Extended

Properties="DSN=csv;DefaultDir=C:\Users\bobj\Documents\excel;DriverId=27;FIL=text;MaxBufferSize=2048;PageTimeout=5;";Initial

Catalog=C:\Users\bobj\Documents\excel</odc:ConnectionString>

odc:CommandTypeTable</odc:CommandType>

odc:CommandTextC:\Users\bobj\Documents\excel\test.csv</odc:CommandText>

</odc:Connection>

</odc:OfficeDataConnection>

I’ve taken a 30MB csv file dumped from the

_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL view of Ola’s sample data.

(Grabbed from h2 with: CALL CSVWRITE(‘test.csv’, 'SELECT * FROM

_VIEW_PIVOTSOURCEROUTINEDATAOU3ALL’); )

Using the odc above, refreshing the pivottable causes a copy from the

csv file into the pivot cache which seems pretty quick (4-5 seconds) -

certainly quicker than pulling it via tcp-odbc-jdbc or whatever.

Note that worksheet can have more than one pivottable and more than

one pivotcache and that more than one pivottable can share the same

pivot cache. I haven’t gone into those details.

So the upside is local pivottable refresh is quick and appending

monthly csv snippets is trivial. Another positive spinoff in havng

the datset as a local csv is that you can also pivot with all kinds of

other tools other than excel (ranging from primitive awk to fancy ‘r’

reshaping - both of which I think I prefer to excel ). But

managing this setup is fairly straightforward - 3 files: the excel

file, the odc and the csv files which grow with monthly updates (I

know there must also be a facility to amend changed data but we’ll

cross that bridge …) . And there are of course downsides.

These csv files are really pretty big and hugely sub optimal. The

normalization of the database has been flattened out for convenience

of pivoting so we are dealing with tens of thousands of repeated

strings (datelement names, ou names etc). From a transport

perspective this is maybe not too bad - I’m guessing a reasonable

compression might factor out some of this. And appending monthly

chunks wouldn’t be too bad. But they are pretty big files to manage.

But I think they are maybe more robust than depending solely on the

pivotcache as a local primary store.

That’s it for now. Thought to sleep on is whether we can’t somehow

just maintain a local database which replicates the necessary tables

from the online server. But its complicated. Maintaining a couple

(hopefully not too many) csv files might be easier.

Nice. Explicitly separating presentation from data with csv sounds more

manageable. I think we can live with “yearly” pivot tables, in that these

things won’t keep growing forever.

Bob

PS. The caveat. My csv source for pivottables sort of works but all

my datavalues are being treated as zero I think this is probably

to do with setting the datatype of the field somehow. I think its

fixable.

Might be related to the fact that DataValue.value is a String. Maybe Ola has

some experience?

Yes it has to do with interpreting fields as numeric rather than

string. csv driver defaults to string assumption. Haven’t figured

the best place to tell it otherwise yet. But I think Ola has …

bobj · 27 October 2010 21:58

Yes, I casted the Value field to number using the val() function,
val(Value). I modified the Query used by the pivot table in:
Pivot Tables top menu->Change Data Source->Connection
Properties->Definition->Command text

I'm a bit confused. The 'command" text in my connections is just the
csv file name.

···

2010/10/27 Ola Hodne Titlestad <olati@ifi.uio.no>:

Not sure where this query is stored as these changes did not show in the
.odc after saving the Excel file.
There is probably connection information also in Excel somewhere, since the
query is on top of the .odc.
If possible, it sounds like a better deal to modify this in the .odc file
and just make sure Excel gets the data in the correct format.
----------------------------------
Ola Hodne Titlestad (Mr)
HISP
Department of Informatics
University of Oslo

Mobile: +47 48069736
Home address: Vetlandsvn. 95B, 0685 Oslo, Norway. Googlemaps link

2010/10/27 Bob Jolliffe <bobjolliffe@gmail.com>

2010/10/27 Lars Helge Øverland <larshelge@gmail.com>:
>
>
> 2010/10/27 Bob Jolliffe <bobjolliffe@gmail.com>
>>
>> Attached is a simple skeleton overview diagram of how it *can* be
>> done. I'm not sure if I'm 100% comfortable with the approach yet but
>> I can verify that it works (caveat below).
>>
>> From the diagram, the point ultimately is to populate the
>> pivottablecache of the pivottable within the local excel spreadsheet
>> with the data from the view in the remote database. This is what you
>> would normally do (in a connected setting) by setting an odbc type
>> connection to the postgres database.
>>
>> What I have done is instead to create a connection to a local csv
>> file. This is done by creating a odc connection description file and
>> pointing the pivottable at that. Selected nippet from example odc
>> connection file:
>> <xml id='msodc'>
>> <odc:OfficeDataConnection
>> xmlns:odc="urn:schemas-microsoft-com:office:odc"
>> xmlns="http://www.w3.org/TR/REC-html40">
>> <odc:Connection odc:Type="OLEDB">
>> <odc:ConnectionString>Provider=MSDASQL.1;Persist Security
>> Info=False;Data Source=csv;Extended
>>
>>
>> Properties="DSN=csv;DefaultDir=C:\Users\bobj\Documents\excel;DriverId=27;FIL=text;MaxBufferSize=2048;PageTimeout=5;";Initial
>> Catalog=C:\Users\bobj\Documents\excel</odc:ConnectionString>
>> <odc:CommandType>Table</odc:CommandType>
>>
>>
>> <odc:CommandText>`C:\Users\bobj\Documents\excel\test.csv`</odc:CommandText>
>> </odc:Connection>
>> </odc:OfficeDataConnection>
>> </xml>
>>
>> I've taken a 30MB csv file dumped from the
>> _VIEW_PIVOTSOURCEROUTINEDATAOU3ALL view of Ola's sample data.
>> (Grabbed from h2 with: CALL CSVWRITE('test.csv', 'SELECT * FROM
>> _VIEW_PIVOTSOURCEROUTINEDATAOU3ALL'); )
>>
>> Using the odc above, refreshing the pivottable causes a copy from the
>> csv file into the pivot cache which seems pretty quick (4-5 seconds) -
>> certainly quicker than pulling it via tcp-odbc-jdbc or whatever.
>>
>> Note that worksheet can have more than one pivottable and more than
>> one pivotcache and that more than one pivottable can share the same
>> pivot cache. I haven't gone into those details.
>>
>> So the upside is local pivottable refresh is quick and appending
>> monthly csv snippets is trivial. Another positive spinoff in havng
>> the datset as a local csv is that you can also pivot with all kinds of
>> other tools other than excel (ranging from primitive awk to fancy 'r'
>> reshaping - both of which I think I prefer to excel ). But
>> managing this setup is fairly straightforward - 3 files: the excel
>> file, the odc and the csv files which grow with monthly updates (I
>> know there must also be a facility to amend changed data but we'll
>> cross that bridge ..) . And there are of course downsides.
>>
>> These csv files are really pretty big and hugely sub optimal. The
>> normalization of the database has been flattened out for convenience
>> of pivoting so we are dealing with tens of thousands of repeated
>> strings (datelement names, ou names etc). From a transport
>> perspective this is maybe not too bad - I'm guessing a reasonable
>> compression might factor out some of this. And appending monthly
>> chunks wouldn't be too bad. But they are pretty big files to manage.
>> But I think they are maybe more robust than depending solely on the
>> pivotcache as a local primary store.
>>
>> That's it for now. Thought to sleep on is whether we can't somehow
>> just maintain a local database which replicates the necessary tables
>> from the online server. But its complicated. Maintaining a couple
>> (hopefully not too many) csv files might be easier.
>
>
> Nice. Explicitly separating presentation from data with csv sounds more
> manageable. I think we can live with "yearly" pivot tables, in that
> these
> things won't keep growing forever.
>
>>
>> Bob
>>
>> PS. The caveat. My csv source for pivottables sort of works but all
>> my datavalues are being treated as zero I think this is probably
>> to do with setting the datatype of the field somehow. I think its
>> fixable.
>
> Might be related to the fact that DataValue.value is a String. Maybe Ola
> has
> some experience?

Yes it has to do with interpreting fields as numeric rather than
string. csv driver defaults to string assumption. Haven't figured
the best place to tell it otherwise yet. But I think Ola has ..

>