Multidimensional data analysis and Postgresql scripts

Hi there. Lars set me off on a winding path the other day, and asked me to whip up some SQL for the representation of a specific structure of data.

I have added two Postgresql scripts to the resources/sql folder which contain a partial solution to two separate problems. Lars and I have had many mails back and forth privately, but now that I have some results, I thought I would try and describe what has been done.

First, the ct_source_period_from_deid.sql script will transform the datavalue table for a given data element into a different form, with categoryoptioncombos on a seperate column. . I needed to constrain it to some dimension, as there is no way to know how many columns we would end up with, as the table is crosstabbed now essentially on the category options. Without this contrstraint, if you tried to crosstab the entire table, you might end up with a very “wide” table with too many columns. Anyway, hopefully this meets Lars needs, but this procedure would of course need to be generalized to Java/Hibernate.

As as example of how to execute this script, just so "SELECT * FROM create_ct_source_period(36) where 36 is a given data element id. A new table will be created _ct_source_period_36 which will contain all datavalues for each source/period combination with a seperate column for each category option.

The second (interesting) by product of this little journey into the depths of the DHIS2 data model, is a second script to decompose the datavalue table into an entity-attribute-value representation. This has been a nagging problem with me for several years when dealing with data models such as the one that DHIS employs, namely how to decompose the data into a format that can then be pulled into an aribtrary crosstab function. This script will do this for a given dataelement, sourceid combination. Again, I have employed somewhat of an artifical contraint here, as of course you would likely want to combine multiple dataelements, orgunits and periods, likely with some sort of filters as well. I think this could be achieved fairly easily with some modifications to the script or better yet through some other procedural login, but since we are staying to try database agnostic here, I really wanted to make this more as an example of we can achieve this representation of the data. Once this script is executed with the right parameters, a separate data table is generated _eav_dataset that looks something like this…

“0”;“dataelementname”;“Fully Immunized child”
“0”;“enddatte”;“2008-01-31”
“0”;“Fixed+Outreach”;“Outreach”
“0”;“orgunitname”;“Gbonkobana CHP”

“0”;“startdate”;“2008-01-01”
“0”;“Under12mths+12to59mths”;“12 to 59 mths”
“1”;“dataelementname”;“Fully Immunized child”
“1”;“enddatte”;“2008-01-31”

“1”;“Fixed+Outreach”;“Outreach”
“1”;“orgunitname”;“Gbonkobana CHP”
“1”;“startdate”;“2008-01-01”
“1”;“Under12mths+12to59mths”;“< 12 mths”

“1”;“value”;“15”

So, various “dimensions” have been transformed into separate attributes and value pairs, and assigned an arbitrary ObjectID. EAV is a useful data representation, with advantages and disadvantages.

Why would anyone want to see the data like this? Well, the reason is that the “crosstab” function of Postgresql likes this type of data, and can generate an crosstab table on any dimension once the data is represented in this format.

Once I get my data looking like this, I can utilize the crosstab function of postgres like this…

SELECT * from crosstab( 'select objectid::text, attribute, value
from _eav_dataset where attribute = ‘‘orgunitname’’

or attribute = ‘‘dataelementname’’
or attribute = ‘‘startdate’’
or attribute = ‘‘Under12mths+12to59mths’’
or attribute = ‘‘value’’
ORDER BY 1,2,3’)

as ct(row_name text, dataelementname text, orgunit text, startdate text, age text, value text)

which returns this

“0”;“Fully Immunized child”;“Gbonkobana CHP”;“2008-01-01”;“12 to 59 mths”;“”

“1”;“Fully Immunized child”;“Gbonkobana CHP”;“2008-01-01”;“< 12 mths”;“15”
“10”;“Fully Immunized child”;“Gbonkobana CHP”;“2008-02-01”;“12 to 59 mths”;“2”

“11”;“Fully Immunized child”;“Gbonkobana CHP”;“2008-02-01”;“< 12 mths”;“4”

Of course, the crosstab query could be rewritten in essentially any way, with different dimensions as columns as needed.

So, in conclusion, I just wanted to bring this up the the devs/community for more discussion.

Best regards,
Jason