DataElement -> PeriodType association

Lars · 13 March 2009 06:52

Hi,

I see the need for introducing an association between DataElement and PeriodType in the DHIS 2 data model. Currently you can look up the PeriodType of the DataSet of which the DataElement is a member, but this is not enforced as a DataElement can be a member of many DataSets. The need is based on a few new requirements:

Sierra Leone: Regression analysis where missing values are left out of the aggregated indicator value. In order to define a missing value for a data element, we need to know the PeriodType of the Periods to look for.
South Africa: Gap analysis. In order to define a gap, we need to know the PeriodType of the Periods to look for.
South Africa: Alignment of the DHIS 2 and DHIS 1.4 data model which is necessary for migration and data dictionary use.
India: Aggregated data export. We must know the PeriodType of the Periods that should be used to aggregate data for each DataElement in order to avoid duplication of data on higher levels.
General: Improved performance in datamart as you can reduce the number of crosstabulated Periods.

The downside of this is that all current DHIS 2 databases must be updated by assigning a PeriodType to each DataElement. Also this will affect places where data is captured with different frequencies for the same DataElement (this might be considered a bad practise anyway).

As an intermediate step we could disable this functionality for databases where DataElements registered for DataSets with different PeriodTypes exist.

I would prefer introducing this association. Please comment on this.

regards, Lars

bobj · 14 March 2009 22:03

Hi Lars

Hi,

I see the need for introducing an association between DataElement and
PeriodType in the DHIS 2 data model. Currently you can look up the
PeriodType of the DataSet of which the DataElement is a member, but this is
not enforced as a DataElement can be a member of many DataSets. The need is
based on a few new requirements:

- Sierra Leone: Regression analysis where missing values are left out of the
aggregated indicator value. In order to define a missing value for a data
element, we need to know the PeriodType of the Periods to look for.

- South Africa: Gap analysis. In order to define a gap, we need to know the
PeriodType of the Periods to look for.

- South Africa: Alignment of the DHIS 2 and DHIS 1.4 data model which is
necessary for migration and data dictionary use.

- India: Aggregated data export. We must know the PeriodType of the Periods
that should be used to aggregate data for each DataElement in order to avoid
duplication of data on higher levels.

- General: Improved performance in datamart as you can reduce the number of
crosstabulated Periods.

The downside of this is that all current DHIS 2 databases must be updated by
assigning a PeriodType to each DataElement. Also this will affect places
where data is captured with different frequencies for the same DataElement
(this might be considered a bad practise anyway).

If I read you correctly, you are effectively moving the association
down from the dataset to the dataelement. I do agree, this looks like
a move in the right direction. From: <dataset name='aDataset'
periodtype='xxx'> to <dataelement name='aDataelement'
periodtype='xxx'>.

I can't help wondering at the merits of taking it a step further, ie
associating the periodtype with the datavalue:
<datelement name='aDataelement'>
   <value periodtype='xxx' >23</value>
   <value periodtype='yyy' >2</value>
   <value periodtype='xxx' >3</value>
</dataelement>

This would obviously be more costly in terms of efficiency (an extra
FK field per value rather than per element) but with the upside that
you can flexibly cater for the 'bad practice'. I am too inexperienced
in the domain to know whether the added flexibility would justify the
cost, and what does or does not constitute good practice, but it is
perhaps something to consider. In the end what you gain on the swings
you lose on the roundabouts

Regards
Bob

···

2009/3/13 Lars Helge Øverland <larshelge@gmail.com>:

As an intermediate step we could disable this functionality for databases
where DataElements registered for DataSets with different PeriodTypes exist.

I would prefer introducing this association. Please comment on this.

regards, Lars

bobj · 14 March 2009 22:05

Hi Calle

Hi,

I would strongly support this – it’s been part of the 1.4 data model from
the start, and I think it was a mistake to weaken this in the 2.0 data model
(I’m saying “weaken” because this type of constraint should be regarded as a
form of “strong data typing”) due to lack of standardised data collection
frequencies in Vietnam. As far as I can recall from previous discussions
with Ola around this, the initial requirement decision in 2.0 to allow
multiple data collection frequencies for single data elements was caused by
the fact that there was no consistency between districts in Vietnam – some
collected monthly data, some collected the data quarterly.

In 1.4, we are regularly introducing new “layers” of user-defined
constraints, in order to reduce the possibility of errors but also to
enhance analysis and customisation. Two of the recent ones are the perceived
need for DataElementAndIndicatorGroup SETS, and the need to fully support
multiple Organisational Hierarchies with multiple OU level names etc.

By the way, two other pieces of news that relates to what I’ve previously
denoted as “DHIS 3” (the next generation DHIS which incorporates both 1.4
modules and dhis 2):

- We recently got the source code (written in Delphi) to the HR
Admin, a human resource database developed here in SA. It’s release was
predicated on a Memorandum of Understanding between the developers and HISP.
The developers have also designed a new more limited version of it for
Nigeria (mostly using C# and .NET) that will be interfaced to DHIS 1.4 (and
thus indirectly to DHIS 2).

- The new DHIS Referral Module – a hybrid of the CORE and PAT
modules with specific functionality to capture and analyse patient referral
data – will be ready for piloting in early April.

- The DHIS Client Satisfaction Survey module has finally been
upgraded from 1.3 to 1.4.

- The use of DHIS 2 for the national South African data dictionary
seem to be successful, even if there’s considerable work remaining to
actually create and refine all the relevant dictionaries.

- The Debo ART patient module is more or less ready for piloting in
the Eastern Cape province (this version of Debo is customised to handle the
so called “adult clinical record” used in the eastern Cape).

Would love to get an update on what has happened with Debo
customisation. Where can I get it?

Regards
Bob

···

2009/3/14 Calle Hedberg <chedberg@telkomsa.net>:

Regards

Calle

From: Lars Helge Øverland [mailto:larshelge@gmail.com]
Sent: 13 March 2009 08:52 AM
To: Dev list hisp.info; dhis2-devs@lists.launchpad.net; Pond, Bob; edem;
John Lewis; Calle Hedberg
Subject: DataElement -> PeriodType association

Hi,

I see the need for introducing an association between DataElement and
PeriodType in the DHIS 2 data model. Currently you can look up the
PeriodType of the DataSet of which the DataElement is a member, but this is
not enforced as a DataElement can be a member of many DataSets. The need is
based on a few new requirements:

- Sierra Leone: Regression analysis where missing values are left out of the
aggregated indicator value. In order to define a missing value for a data
element, we need to know the PeriodType of the Periods to look for.

- South Africa: Gap analysis. In order to define a gap, we need to know the
PeriodType of the Periods to look for.

- South Africa: Alignment of the DHIS 2 and DHIS 1.4 data model which is
necessary for migration and data dictionary use.

- India: Aggregated data export. We must know the PeriodType of the Periods
that should be used to aggregate data for each DataElement in order to avoid
duplication of data on higher levels.

- General: Improved performance in datamart as you can reduce the number of
crosstabulated Periods.

The downside of this is that all current DHIS 2 databases must be updated by
assigning a PeriodType to each DataElement. Also this will affect places
where data is captured with different frequencies for the same DataElement
(this might be considered a bad practise anyway).

As an intermediate step we could disable this functionality for databases
where DataElements registered for DataSets with different PeriodTypes exist.

I would prefer introducing this association. Please comment on this.

regards, Lars

Lars · 17 March 2009 08:38

If I read you correctly, you are effectively moving the association

down from the dataset to the dataelement.

Actually this implies keeping the DataSet-PeriodType association as you need it to enforce that only DataElements with corresponding PeriodTypes are members.

Lars

bobj · 18 March 2009 12:30

Hi Lars

OK, after much poking around the schema, I realize that I have
misinterpreted something. You have to impose one of two constraints.
Either:
1. A DataElement can only (must) be a member of one DataSet; or
2. A DataElement can only (must) have one PeriodType.

And also a DataValue already has a period associated, so there is no
need to also associate the PeriodType explicitly - ie. my suggestion
above is nonsense.

If you could impose (1) it would be cleaner as this would implicitly
enforce (2) but I can see it might be more disruptive to the existing
corpus of data.

Though I am still nagged by doubt that there seems to be some
redundancy in your proposal and also that it will be hard to enforce
constraints at the database level ie. you require a well behaved
application. And in the end it is the datavalues which are to be
aggregated, reported on etc.

If we cannot impose (1), or we shouldn't or we don't want to, then is
it instead possible to think of a "fully qualified" DataElement as
DataSet::DataElement? Practically this means having DataSet,
DataElement and Period association with the DataValue. The advantage
would be that we still only need to associate the PeriodType with the
DataSet.

Of course this probably does not work if DataElements are meant to be
aggregated across different DataSets as it assumes a DataElement in
one DataSet is not the same apple or orange as the same DataElement in
another DataSet. If this is the case, then I do agree, the only
solution is to make the schema change you suggest.

Regards
Bob

PS. I am still looking for sample datasets. I tried the link at
http://208.76.222.114/confluence/display/DHIS2/Downloads but it is
broken. Can anyone please point me to some.

···

2009/3/17 Lars Helge Øverland <larshelge@gmail.com>:

If I read you correctly, you are effectively moving the association
down from the dataset to the dataelement.

Actually this implies keeping the DataSet-PeriodType association as you need
it to enforce that only DataElements with corresponding PeriodTypes are
members.

Lars

Lars · 18 March 2009 12:51

OK, after much poking around the schema, I realize that I have
misinterpreted something. You have to impose one of two constraints.

Either:

A DataElement can only (must) be a member of one DataSet; or

A DataElement can only (must) have one PeriodType.

True.

If we cannot impose (1), or we shouldn’t or we don’t want to, then is
it instead possible to think of a “fully qualified” DataElement as

DataSet::DataElement? Practically this means having DataSet,
DataElement and Period association with the DataValue. The advantage
would be that we still only need to associate the PeriodType with the
DataSet.

Enforcing 1) cannot be done, as data elements frequently appear in multiple datasets, which is a part of the HISP “appproach”.

PS. I am still looking for sample datasets. I tried the link at
http://208.76.222.114/confluence/display/DHIS2/Downloads but it is

broken. Can anyone please point me to some.
You can find a postgres backup of the sample database here: http://folk.uio.no/larshelg/files/dhis2sample.backup.

I will create a mysql dump later today and let you know:)

Lars

bobj · 18 March 2009 13:18

Hi

OK, after much poking around the schema, I realize that I have
misinterpreted something. You have to impose one of two constraints.
Either:
1. A DataElement can only (must) be a member of one DataSet; or
2. A DataElement can only (must) have one PeriodType.

True.

If we cannot impose (1), or we shouldn't or we don't want to, then is
it instead possible to think of a "fully qualified" DataElement as
DataSet::DataElement? Practically this means having DataSet,
DataElement and Period association with the DataValue. The advantage
would be that we still only need to associate the PeriodType with the
DataSet.

Enforcing 1) cannot be done, as data elements frequently appear in multiple
datasets, which is a part of the HISP "appproach".

Then we must indeed do as you suggest.

Regards
Bob

···

2009/3/18 Lars Helge Øverland <larshelge@gmail.com>:

PS. I am still looking for sample datasets. I tried the link at
http://208.76.222.114/confluence/display/DHIS2/Downloads but it is
broken. Can anyone please point me to some.

You can find a postgres backup of the sample database here:
http://folk.uio.no/larshelg/files/dhis2sample.backup\.

I will create a mysql dump later today and let you know:)

Lars

Lars · 18 March 2009 16:16

MySQL dump of sample data can be found here:

http://folk.uio.no/larshelg/files/dhis2sample.zip

bobj · 19 March 2009 14:10

Hi all

I am still grappling a bit to make complete sense of the data model.
I would be grateful if someone with greater domain knowledge might
enlighten me on my query below.

OK, after much poking around the schema, I realize that I have
misinterpreted something. You have to impose one of two constraints.
Either:
1. A DataElement can only (must) be a member of one DataSet; or
2. A DataElement can only (must) have one PeriodType.

True.

If we cannot impose (1), or we shouldn't or we don't want to, then is
it instead possible to think of a "fully qualified" DataElement as
DataSet::DataElement? Practically this means having DataSet,
DataElement and Period association with the DataValue. The advantage
would be that we still only need to associate the PeriodType with the
DataSet.

Enforcing 1) cannot be done, as data elements frequently appear in multiple
datasets, which is a part of the HISP "appproach".

OK. I've taken a really long look at the schema and here's my first
pass at understanding the HISP approach. It seems to me that if data
elements appear in different datasets, there is therefore no way to be
able to link any particular datavalue with a dataset.

For example if there is DataSet DS_a with DataElements DE_a, DE_b and DE_c.

And there is DataSet DS_b with DataElements DE_a and DE_d.

If we were to try and export DataValues from dataset DS_a, the best we
can do is identify all DataValues with DateElement types DE_a, DE_b
and DE_c. There will be overlap with the elements of DS_b. Put
differently, if there are 100 values in total and we export DS_a and
DS_b we might end up with 110 values.

The DataValue only knows its source, dataelement and period, not the
dataset it belongs to.

In fact the idea of a DataSet becomes quite a "weak" one for grouping
datavalues.

In which case the only sensible way of identifying sets of datavalues
is by organisation unit and/or by period. Is this correct?

Regards
Bob

···

2009/3/18 Lars Helge Øverland <larshelge@gmail.com>:

PS. I am still looking for sample datasets. I tried the link at
http://208.76.222.114/confluence/display/DHIS2/Downloads but it is
broken. Can anyone please point me to some.

You can find a postgres backup of the sample database here:
http://folk.uio.no/larshelg/files/dhis2sample.backup\.

I will create a mysql dump later today and let you know:)

Lars

Lars · 19 March 2009 14:39

OK. I’ve taken a really long look at the schema and here’s my first

pass at understanding the HISP approach. It seems to me that if data

elements appear in different datasets, there is therefore no way to be

able to link any particular datavalue with a dataset.

For example if there is DataSet DS_a with DataElements DE_a, DE_b and DE_c.

And there is DataSet DS_b with DataElements DE_a and DE_d.

If we were to try and export DataValues from dataset DS_a, the best we

can do is identify all DataValues with DateElement types DE_a, DE_b

and DE_c. There will be overlap with the elements of DS_b. Put

differently, if there are 100 values in total and we export DS_a and

DS_b we might end up with 110 values.

The DataValue only knows its source, dataelement and period, not the

dataset it belongs to.

In fact the idea of a DataSet becomes quite a “weak” one for grouping

datavalues.

In which case the only sensible way of identifying sets of datavalues

is by organisation unit and/or by period. Is this correct?

All of this is correct.

Though there is no data model enforcement, one orgunit cannot enter data for the same data element in different data sets. This would simply overwrite the existing data, which again implies that we won’t end up with duplicates.

bobj · 19 March 2009 14:49

OK. I've taken a really long look at the schema and here's my first
pass at understanding the HISP approach. It seems to me that if data
elements appear in different datasets, there is therefore no way to be
able to link any particular datavalue with a dataset.

For example if there is DataSet DS_a with DataElements DE_a, DE_b and
DE_c.

And there is DataSet DS_b with DataElements DE_a and DE_d.

If we were to try and export DataValues from dataset DS_a, the best we
can do is identify all DataValues with DateElement types DE_a, DE_b
and DE_c. There will be overlap with the elements of DS_b. Put
differently, if there are 100 values in total and we export DS_a and
DS_b we might end up with 110 values.

The DataValue only knows its source, dataelement and period, not the
dataset it belongs to.

In fact the idea of a DataSet becomes quite a "weak" one for grouping
datavalues.

In which case the only sensible way of identifying sets of datavalues
is by organisation unit and/or by period. Is this correct?

All of this is correct.

Though there is no data model enforcement, one orgunit cannot enter data for
the same data element in different data sets. This would simply overwrite
the existing data, which again implies that we won't end up with duplicates.

OrgUnit+children? So for India or for South Africa or Sri Lanka data
elements are "effectively" only members of one dataset?

···

2009/3/19 Lars Helge Øverland <larshelge@gmail.com>:

Lars · 19 March 2009 15:05

So for India or for South Africa or Sri Lanka data

elements are “effectively” only members of one dataset?

Not quite, a data element can be a member of many data sets, but those data sets cannot be entered by the same orgunit. I.e. data sets containing common data elements should be assigned to different orgunits in the hierarchy.

bobj · 20 March 2009 11:23

It seems to me that adding a PeriodType to the DataElement is
definitely redundant.

DataElement should always be a member of at least one DataSet - and
there is already be an (implicit) constraint that DataElements of a
DataSet will be of the same PeriodType.

What is probably required (if it doesn't yet exist) is an application
enforced constraint that a DataElement can only be a member of
different DataSets which share the same PeriodType. Given that some
of those do exist (the bad practice) then it is considerably better to
rename (and re-id) the dataElement. This will anyway be necessary if
DataElements become directly associated with PeriodType.

I think the only other thing which is achieved by associating the
PeriodType with the DataElement is that it would allow for DataSets
which a heterogenous mix of DataElement PeriodTypes, which I don't
think is a design goal.

Regards
Bob

···

2009/3/19 Lars Helge Øverland <larshelge@gmail.com>:

So for India or for South Africa or Sri Lanka data
elements are "effectively" only members of one dataset?

Not quite, a data element can be a member of many data sets, but those data
sets cannot be entered by the same orgunit. I.e. data sets containing common
data elements should be assigned to different orgunits in the hierarchy.

Lars · 20 March 2009 11:38

It seems to me that adding a PeriodType to the DataElement is

definitely redundant.

DataElement should always be a member of at least one DataSet - and

there is already be an (implicit) constraint that DataElements of a

DataSet will be of the same PeriodType.

What is probably required (if it doesn’t yet exist) is an application

enforced constraint that a DataElement can only be a member of

different DataSets which share the same PeriodType. Given that some

of those do exist (the bad practice) then it is considerably better to

rename (and re-id) the dataElement. This will anyway be necessary if

DataElements become directly associated with PeriodType.

Yes but this is exactly why we want to do this - create the mentioned application enforced constraint. Firstly there is no constraint saying a DataElement MUST be a member of a DataSet. Secondly we are making things a whole lot easier by making this association explicit; think of assigning data elements to datasets (are the dataelement already a member of another DataSet with a different PeriodType?), gap analysis, regression analysis, datamart (which PeriodType is the DataElement associated with?), alignment with the lecacy DHIS 1.4 model (how do we manage import?).

I agree that there is a slight redundancy associated with this, but I think what we gain in regard to simplicity and performance exceeds the minor cost of this association.

I think the only other thing which is achieved by associating the

PeriodType with the DataElement is that it would allow for DataSets

which a heterogenous mix of DataElement PeriodTypes, which I don’t

think is a design goal.

This is definitely something we don’t want to allow.

bobj · 20 March 2009 12:25

It seems to me that adding a PeriodType to the DataElement is
definitely redundant.

DataElement should always be a member of at least one DataSet - and
there is already be an (implicit) constraint that DataElements of a
DataSet will be of the same PeriodType.

What is probably required (if it doesn't yet exist) is an application
enforced constraint that a DataElement can only be a member of
different DataSets which share the same PeriodType. Given that some
of those do exist (the bad practice) then it is considerably better to
rename (and re-id) the dataElement. This will anyway be necessary if
DataElements become directly associated with PeriodType.

Yes but this is exactly why we want to do this - create the mentioned
application enforced constraint. Firstly there is no constraint saying a
DataElement MUST be a member of a DataSet. Secondly we are making things a
whole lot easier by making this association explicit; think of assigning
data elements to datasets (are the dataelement already a member of another
DataSet with a different PeriodType?), gap analysis, regression analysis,
datamart (which PeriodType is the DataElement associated with?), alignment
with the lecacy DHIS 1.4 model (how do we manage import?).

I agree that there is a slight redundancy associated with this, but I think
what we gain in regard to simplicity and performance exceeds the minor cost
of this association.

There is not much to be gained in terms of simplicity if you drive the
functionality down to the data model API. The way you make the
relationship explicit is to provide the DataElement class with a
method called getPeriodType(). The detail of how it implements that
is invisible to the user of the method. True the implementation of
getPeriodType() will be slightly less efficient, but how many times
will it be called in a data intensive operation anyway? In most cases
you could probably call it once in the constructor and reuse the
result as much as you can.

I'm thinking that DataElements which are not members of any DataSet
should probably be considered as "inactive" or not yet assigned in the
application (I might be wrong - see below). Assigning a DataElement
to a DataSet would involve calling getPeriodType() on it. If it
returns NULL then the assignment will always succeed. If it returns a
PeriodType then it will succeed if the DataSet you are assigning to is
of the same PeriodType():

DataSet::assignDataElement(DataElement de)
{
   PeriodType pt=de.getPeriodType();
   if (pt != NULL && pt!=this.getPeriodType()) {
       throw new PeriodTypeMisMatch()
   }

etc ...
}

Of course all of this does fall apart if you are saying that it is
acceptable that active DataElements can indeed be legitimately not
part of any DataSet. Is this the case or is it to be considered a
data integrity problem?

My fear of introducing redundancy in the datamodel is driven more by
the fear of introducing accidental complexity. You can always make a
relational database more efficient (and arguably simpler) by removing
the relations or just using a flat file, but it is rarely a good idea.
If what you really require is a getPeriod() method on a DataElement
and that information can already be obtained from the current schema
then I think it is better not to duplicate in the schema.

Forgive me if I am misinterpreting the rationale of having DataSets.
Its possible

Regards
Bob

···

2009/3/20 Lars Helge Øverland <larshelge@gmail.com>:

I think the only other thing which is achieved by associating the
PeriodType with the DataElement is that it would allow for DataSets
which a heterogenous mix of DataElement PeriodTypes, which I don't
think is a design goal.

This is definitely something we don't want to allow.

Lars · 20 March 2009 12:51

There is not much to be gained in terms of simplicity if you drive the

functionality down to the data model API. The way you make the

relationship explicit is to provide the DataElement class with a

method called getPeriodType(). The detail of how it implements that

is invisible to the user of the method. True the implementation of

getPeriodType() will be slightly less efficient, but how many times

will it be called in a data intensive operation anyway? In most cases

you could probably call it once in the constructor and reuse the

result as much as you can.

Yes I agree that putting this in the API makes it more usable - but currently there are no association from DataElement → DataSet, only DataSet → DataElement, meaning you cannot access a DataElement’s DataSets directly without using a query. Of course we could make this association bi-directional but then again this involves another association, which you want to avoid.

I’m thinking that DataElements which are not members of any DataSet

should probably be considered as “inactive” or not yet assigned in the

application (I might be wrong - see below). Assigning a DataElement

to a DataSet would involve calling getPeriodType() on it. If it

returns NULL then the assignment will always succeed. If it returns a

PeriodType then it will succeed if the DataSet you are assigning to is

of the same PeriodType():

DataSet::assignDataElement(DataElement de)

{

PeriodType pt=de.getPeriodType();

if (pt != NULL && pt!=this.getPeriodType()) {
   throw new PeriodTypeMisMatch()
}

etc …

}

Of course all of this does fall apart if you are saying that it is

acceptable that active DataElements can indeed be legitimately not

part of any DataSet.

A DataElement can exist perfectly without a DataSet. On the other hand without a DataSet there wouldn’t be any data registered for it, and things like gap analysis, regression analysis become irrelevant. It would be a “soft” data integrity problem, but not a “database error”.

My fear of introducing redundancy in the datamodel is driven more by

the fear of introducing accidental complexity.

I still believe deriving PeriodType from DataSet would be more complex; think of a scenario where you eg. want to edit a monthy DataSet and list all available “monthly” DataElements. How do you query this?

bobj · 20 March 2009 13:30

There is not much to be gained in terms of simplicity if you drive the
functionality down to the data model API. The way you make the
relationship explicit is to provide the DataElement class with a
method called getPeriodType(). The detail of how it implements that
is invisible to the user of the method. True the implementation of
getPeriodType() will be slightly less efficient, but how many times
will it be called in a data intensive operation anyway? In most cases
you could probably call it once in the constructor and reuse the
result as much as you can.

Yes I agree that putting this in the API makes it more usable - but
currently there are no association from DataElement -> DataSet, only DataSet
-> DataElement, meaning you cannot access a DataElement's DataSets directly
without using a query. Of course we could make this association
bi-directional but then again this involves another association, which you
want to avoid.

Agreed. The modal scenario is that you have a DataSet and you want to
list DataElements. The reverse scenario, finding the DataSet (in fact
Sets) which an Element is a member of, is less common and doesn't
justify a direct lookup.

But yes, it would be very useful to have a method in the DataElement
API something like a

Collection<DataSet> DataElement::getDataSets()

I'm thinking that DataElements which are not members of any DataSet
should probably be considered as "inactive" or not yet assigned in the
application (I might be wrong - see below). Assigning a DataElement
to a DataSet would involve calling getPeriodType() on it. If it
returns NULL then the assignment will always succeed. If it returns a
PeriodType then it will succeed if the DataSet you are assigning to is
of the same PeriodType():

DataSet::assignDataElement(DataElement de)
{
PeriodType pt=de.getPeriodType();
if (pt != NULL && pt!=this.getPeriodType()) {
throw new PeriodTypeMisMatch()
}

etc ...
}

Of course all of this does fall apart if you are saying that it is
acceptable that active DataElements can indeed be legitimately not
part of any DataSet.

A DataElement can exist perfectly without a DataSet. On the other hand
without a DataSet there wouldn't be any data registered for it, and things
like gap analysis, regression analysis become irrelevant. It would be a
"soft" data integrity problem, but not a "database error".

So that is ok. It can exist but doesn't really become relevant unless
its part of at least one set. So I guess it doesn't need to have a
PeriodType associated with it while it is in this state.

My fear of introducing redundancy in the datamodel is driven more by
the fear of introducing accidental complexity.

I still believe deriving PeriodType from DataSet would be more complex;
think of a scenario where you eg. want to edit a monthy DataSet and list all
available "monthly" DataElements. How do you query this?

You would simply return all DataElements which are members of
"monthly" DataSets (easy) plus the unassigned ones (less easy).

You are right that not having a reverse association makes selecting
all the "unassigned" DataElements a bit tricky. Perhaps the most
robust solution is to have a default unassigned DataSet - with a NULL
periodType. All new DataElements start life in this DataSet. This
makes finding the available dataElements pretty trivial and we can
also rigorously enforce that a dataElement MUST be a member of a
dataSet which is a good thing.

You might be right on the performance of dataMart etc. It will really
come down to how often you have to call getPeriodType() which
shouldn't be too much - the real data crunching meat is in the
datavalues not the dataelements I think. If indeed it proves very
costly then it might justify the hacking the schema, but I think we
should try implementing at the API level first. Reimplementing
DataElement::getPeriodType() can be done if necessary.

Cheers
Bob

···

2009/3/20 Lars Helge Øverland <larshelge@gmail.com>:

bobj · 20 March 2009 13:47

There is not much to be gained in terms of simplicity if you drive the
functionality down to the data model API. The way you make the
relationship explicit is to provide the DataElement class with a
method called getPeriodType(). The detail of how it implements that
is invisible to the user of the method. True the implementation of
getPeriodType() will be slightly less efficient, but how many times
will it be called in a data intensive operation anyway? In most cases
you could probably call it once in the constructor and reuse the
result as much as you can.

Yes I agree that putting this in the API makes it more usable - but
currently there are no association from DataElement -> DataSet, only DataSet
-> DataElement, meaning you cannot access a DataElement's DataSets directly
without using a query. Of course we could make this association
bi-directional but then again this involves another association, which you
want to avoid.

Agreed. The modal scenario is that you have a DataSet and you want to
list DataElements. The reverse scenario, finding the DataSet (in fact
Sets) which an Element is a member of, is less common and doesn't
justify a direct lookup.

But yes, it would be very useful to have a method in the DataElement
API something like a

Collection<DataSet> DataElement::getDataSets()

On second pass this is actually quite trivial to implement because we
have the go-between DataSetMembers(?). Finding the DataSets
associated with a DataElement should be as straightforward as finding
the DataElements associated with a DataSet.

···

2009/3/20 Bob Jolliffe <bobjolliffe@gmail.com>:

2009/3/20 Lars Helge Øverland <larshelge@gmail.com>:

I'm thinking that DataElements which are not members of any DataSet
should probably be considered as "inactive" or not yet assigned in the
application (I might be wrong - see below). Assigning a DataElement
to a DataSet would involve calling getPeriodType() on it. If it
returns NULL then the assignment will always succeed. If it returns a
PeriodType then it will succeed if the DataSet you are assigning to is
of the same PeriodType():

DataSet::assignDataElement(DataElement de)
{
PeriodType pt=de.getPeriodType();
if (pt != NULL && pt!=this.getPeriodType()) {
throw new PeriodTypeMisMatch()
}

etc ...
}

Of course all of this does fall apart if you are saying that it is
acceptable that active DataElements can indeed be legitimately not
part of any DataSet.

A DataElement can exist perfectly without a DataSet. On the other hand
without a DataSet there wouldn't be any data registered for it, and things
like gap analysis, regression analysis become irrelevant. It would be a
"soft" data integrity problem, but not a "database error".

So that is ok. It can exist but doesn't really become relevant unless
its part of at least one set. So I guess it doesn't need to have a
PeriodType associated with it while it is in this state.

My fear of introducing redundancy in the datamodel is driven more by
the fear of introducing accidental complexity.

I still believe deriving PeriodType from DataSet would be more complex;
think of a scenario where you eg. want to edit a monthy DataSet and list all
available "monthly" DataElements. How do you query this?

You would simply return all DataElements which are members of
"monthly" DataSets (easy) plus the unassigned ones (less easy).

You are right that not having a reverse association makes selecting
all the "unassigned" DataElements a bit tricky. Perhaps the most
robust solution is to have a default unassigned DataSet - with a NULL
periodType. All new DataElements start life in this DataSet. This
makes finding the available dataElements pretty trivial and we can
also rigorously enforce that a dataElement MUST be a member of a
dataSet which is a good thing.

You might be right on the performance of dataMart etc. It will really
come down to how often you have to call getPeriodType() which
shouldn't be too much - the real data crunching meat is in the
datavalues not the dataelements I think. If indeed it proves very
costly then it might justify the hacking the schema, but I think we
should try implementing at the API level first. Reimplementing
DataElement::getPeriodType() can be done if necessary.

Cheers
Bob

Lars · 20 March 2009 14:24

You are right that not having a reverse association makes selecting

all the “unassigned” DataElements a bit tricky. Perhaps the most

robust solution is to have a default unassigned DataSet - with a NULL

periodType. All new DataElements start life in this DataSet. This

makes finding the available dataElements pretty trivial and we can

also rigorously enforce that a dataElement MUST be a member of a

dataSet which is a good thing.

This will be a complexity trade off between data model / database schema and application logic. Getting DataElements for a PeriodType will definitely involve a more complex and slower query or additional application logic.

As for accidental complexity / future implications the DataElement → PeriodType association has been in use in DHIS 1.4 for several years and proved to be working. What do we know about this approach?

You might be right on the performance of dataMart etc. It will really

come down to how often you have to call getPeriodType() which

shouldn’t be too much - the real data crunching meat is in the

datavalues not the dataelements I think. If indeed it proves very

costly then it might justify the hacking the schema, but I think we

should try implementing at the API level first. Reimplementing

DataElement::getPeriodType() can be done if necessary.

I guess Hibernate will cache the DataSet lookup anyway. But remember we cannot implement getPeriodType in the DataElement object if there is no association DataElement → DataSet.

On second pass this is actually quite trivial to implement because we

have the go-between DataSetMembers(?). Finding the DataSets

associated with a DataElement should be as straightforward as finding

the DataElements associated with a DataSet.

Yes, but “datasetmembers” is a mapping table in the database, we still need an association in the object model.

The “implicit” approach can be done, but it involves a DataElement → DataSet association, and more complex queries/programming model. (If the DataElement → DataSet association is omitted; a call to the service layer to get the PeriodType for a DataElement.)

The “explicit” approcah also involves an association (DataElement → PeriodType). I am not sure if one could say it involves redundancy, as we don’t have a model enforcement of one-or-more DataSet memberships for a DataElement.

I opt for the latter approach. Of course I might be wrong:)

Lars · 20 March 2009 14:37

The “explicit” approcah also involves an association (DataElement → PeriodType). I am not sure if one could say it involves redundancy, as we don’t have a model enforcement of one-or-more DataSet memberships for a DataElement.

Another matter that favours the “implicit” approach is that we will avoid updating existing databases.