dhis2 dxf data import

I've been re-looking at dxf stuff in line with this blueprint:
https://blueprints.launchpad.net/dhis2/+spec/separation-of-meta-data-and-data-values\.

A driving use case here is the import of HR data from iHRIS in Kenya.
With over 8000 orgunits, the current scheme of importing metadata then
mapping will scale badly and introduce fragility here.

So I started to review this thread,
https://blueprints.launchpad.net/dhis2/+spec/separation-of-meta-data-and-data-values,
where Jo introduced the concept of datavalueset and persisting
datavalues individually rather than via multiple inserts. That's a
really useful construct which I want to reuse. (BTW one of the reasons
the datavalueset is useful is where you want to check credentials for
importing. Currently its an admin only task. Checking for each
individual datavalue would be unworkable. Checking for permission to
import a datavalueset for an orgunit will scale better).

I've attached a schema which is extracted from Jo's existing annotated
code ( schemagen src/main/java/org/hisp/dhis/importexport/dxf2/model/
). XML schema language is ugly compared to relaxNg but never mind
that for now :slight_smile:

Much of what Jo has done here is geared towards solving his particular
requirements re the mobile interface, and much of it could be reused.

As a first step I am interested in reusing the DataValueSet stuff from
the model rather than the representation of metadata, which I think
needs to be done more completely and not until changes to the
dimensional model are realized in the not too distant future.

So before airing a few thoughts about the DataValueSet, lets look at
an example conforming to the existing schema for those of you less
comfortable with reading schema:

<dataValueSet orgUnit="550e8400-e29b-41d4-a716-446655440000" period="201004">
  <dataValue dataElement="550e8400-e29b-41d4-a716-446655440000"
value="4" categoryOptionCombo="550e8400-e29b-41d4-a716-446655440000"
storedBy="Bob"/>
  <dataValue dataElement="550e8400-e29b-41d4-a716-446655440000"
value="54" categoryOptionCombo="3" storedBy="Bob"/>
  <dataValue dataElement="550e8400-e29b-41d4-a716-446655440000"
value="43" categoryOptionCombo="550e8400-e29b-41d4-a716-446655440000"
storedBy="Bob"/>
  <dataValue dataElement="550e8400-e29b-41d4-a716-446655440000"
value="44" categoryOptionCombo="550e8400-e29b-41d4-a716-446655440000"
storedBy="Bob"/>
  <dataValue dataElement="550e8400-e29b-41d4-a716-446655440000"
value="67" categoryOptionCombo="550e8400-e29b-41d4-a716-446655440000"
storedBy="Bob"/>
  <dataValue dataElement="550e8400-e29b-41d4-a716-446655440000"
value="100" categoryOptionCombo="550e8400-e29b-41d4-a716-446655440000"
storedBy="Bob"/>
</dataValueSet>

1. We should shift storedBy up to the dataValueSet level. I'm
assuming all datavalues in a datavalueset will be stored by the same
user. I'd put back an optional Comment attribute here as well.
Currently its only useful for rolling back imports. Not the most
efficient way to implement it but still useful.

2. I don't think categoryOptionCombo should *necessarily* be exposed
to the external world. Its very much an internal arrangement of DHIS.
Its useful enough in cases where HISP folk are involved on both
producer and consumer side of the equation, but for other 3rd parties
in the world it is best to hide this internal arrangement. I suggest
that dataElement and value are *required* attributes,
categoryOptionCombo is optional and in addition we have have an
<xs:anyAttribute> extension point which allows for additional
attributes. The implication would be that the above dataset will
remain valid (so existing stuff is still working),

3. On the question of identifiers .... the schema as it stands
accepts any string identifier. The current model implementation makes
use of uuids for this. As we have all come to understand, the outside
world is more complex and there are many possible ways that different
systems will identify things. This can be via uuid, urn or some
mutually exchanged codelists of integer or other identifiers or even
identification by name. Try as we might to coerce the world into
using our one true identifier, all of the above might/will crop up
from time to time. For example we have a case in Kenya, where there
is a nationally agreed upon set of facility codes, which will be used
in data exchange

So I am going to suggest two additional attribute, probably at the
dataValueSets level which indicates the id system to use. Currently I
can think of internal, code, uuid and map as possible candidates for
these attribute values. Where map would imply that ids need to be
resolved using an aliases table keyed by a naming context, possibly
using some of Lars' objectmapper or perhaps simpler. To maintain
compatibility with existing web service api this attribute can be
optional and default to uuid.

The implication of adding all the above will be that whereas the
datavalueset above will remain valid (except perhaps shifting to
storedBy), the following would also be valid:

<dataValueSets orgUnitId="code" dataElementId="internal"
  <dataValueSet orgUnit="23" period="201004" storedBy="Bob" >
    <dataValue dataElement="2" value="4" Sex="1" />
    <dataValue dataElement="2" value="5" Sex="2"/>
    <dataValue dataElement="4" value="43" Sex="1" Age="3" />
    <dataValue dataElement="5" value="44" Sex="1" Age="3" />
  </dataValueSet>
</dataValueSets>

I am pretty sure I can implement the above without breaking what is
currently there. One possible but minor breaking change I would
suggest to improving parsing of very large datasets might be to
abbreviate some well known element names to dv, de and v for
compactness.

Please give me some feedback. I'll do up a temp branch with model
changes shortly.

Cheers
Bob

instance1.xml (307 Bytes)

Great that you're looking at this. Some immediate feedback (pardon the lack of structure:)

As a first step I am interested in reusing the DataValueSet stuff from
the model rather than the representation of metadata, which I think
needs to be done more completely and not until changes to the
dimensional model are realized in the not too distant future.

The only metadata stuff I've done was basically just to serve up some basic html, that should not be used, and certainly not reused :slight_smile:

1. We should shift storedBy up to the dataValueSet level. I'm
assuming all datavalues in a datavalueset will be stored by the same
user. I'd put back an optional Comment attribute here as well.
Currently its only useful for rolling back imports. Not the most
efficient way to implement it but still useful.

I agree it would be nice if we could move it up. I am a bit unsure of the semantics of our data model and the use cases for this. If this were to be used to communicate between dhis instances, I guess it wont be an unthinkable situation that I have edited/added a value in a set that you originally stored, and that granularity would be lost. If that is something we should rethink in our data model rather than inherit to the xml structure, I don't know.

2. I don't think categoryOptionCombo should *necessarily* be exposed
to the external world. Its very much an internal arrangement of DHIS.
Its useful enough in cases where HISP folk are involved on both
producer and consumer side of the equation, but for other 3rd parties
in the world it is best to hide this internal arrangement. I suggest
that dataElement and value are *required* attributes,
categoryOptionCombo is optional and in addition we have have an
<xs:anyAttribute> extension point which allows for additional
attributes. The implication would be that the above dataset will
remain valid (so existing stuff is still working),

I think I agree that we need another model to better "externalize" dimensions. But it would become a bit more complex to implement if dataElement+optionCombo is not a "simple" identifier to the datavalue any more. It would be good to hear a little more about how you plan to implement it in the short run and if you think it should be combined with changes inside dhis..

- Are you thinking of modelling this anyattribute extension point on the sdmx model in some way?
- If there is a more explicit way to describe this in the schema than just anyattribute, I think it could help?
- And I think it would be advantageous if we could rework the internal data model to better fit this more general "schema" at the same time, or at least know a little bit more about how the internal changes would look.
- We need to stay backwards compatible with existing meta models, are we sure that the rules for names of dimensions (Sex, Age) is compatible with xml attribute names?
- We might need to think through how these dimensions would look in the metamodel xml, and how the link between this anyAttribute space and that model would be?

Overall I guess allowing the two identifier schemes to coexist for a while, seems like a good idea. Though we should probably look to get rid of optionComboId asap, then.

3. On the question of identifiers ....

So I am going to suggest two additional attribute, probably at the
dataValueSets level which indicates the id system to use. Currently I
can think of internal, code, uuid and map as possible candidates for
these attribute values. Where map would imply that ids need to be
resolved using an aliases table keyed by a naming context, possibly
using some of Lars' objectmapper or perhaps simpler. To maintain
compatibility with existing web service api this attribute can be
optional and default to uuid.

Yep. I'm not sure what should be the default, though. Maybe just the internal id? For simple cases that looks easier than uuids (at least if we are thinking about the metamodel and how to communicate these id's *to* other systems?). Since we would maybe want to reuse this id model for the meta model as well, you think it would fit there?

I am pretty sure I can implement the above without breaking what is
currently there. One possible but minor breaking change I would
suggest to improving parsing of very large datasets might be to
abbreviate some well known element names to dv, de and v for
compactness.

I am not sure if these element names would really be that well known and obvious for the target people having to work with the schema.
- Is there any alias mechanism for xml easily used with jaxb?
- Wouldn't we want explicit streaming/"batch" handling for use cases where sizes grew to this size, anyway?

Overall, though, if you think abbreviated names are better, I'm all for it.

Jo

···

Den 1. sep. 2011 kl. 13.55 skrev Bob Jolliffe:

Great that you're looking at this. Some immediate feedback (pardon the lack of structure:)

Thanks for feedback ...

As a first step I am interested in reusing the DataValueSet stuff from
the model rather than the representation of metadata, which I think
needs to be done more completely and not until changes to the
dimensional model are realized in the not too distant future.

The only metadata stuff I've done was basically just to serve up some basic html, that should not be used, and certainly not reused :slight_smile:

1. We should shift storedBy up to the dataValueSet level. I'm
assuming all datavalues in a datavalueset will be stored by the same
user. I'd put back an optional Comment attribute here as well.
Currently its only useful for rolling back imports. Not the most
efficient way to implement it but still useful.

I agree it would be nice if we could move it up. I am a bit unsure of the semantics of our data model and the use cases for this. If this were to be used to communicate between dhis instances, I guess it wont be an unthinkable situation that I have edited/added a value in a set that you originally stored, and that granularity would be lost. If that is something we should rethink in our data model rather than inherit to the xml structure, I don't know.

Me neither so it was a bit of a tentative suggestion :slight_smile: I think the
semantics have their origin in an earlier era of standalone and
isolated dhis. My thinking would be that what is relevant is who has
stored this value in *this* database. Usernames from strange
databases wouldn't make much sense anyway. And if one wanted to audit
it's absolute origin, one would have to follow the trail back to the
producer of the datavalueset - which might or might not be a dhis
instance. Of course persisting the datavalueset would be immensely
helpful for this, but as Lars has pointed out, no requirement for this
has emerged yet so we hold off on that for now.

Its not critical either way at the moment - just looks a bit untidy.
Thus far, unless I hear compelling argument to the contrary, it seems
better to move it up. Will wait and listen.

2. I don't think categoryOptionCombo should *necessarily* be exposed
to the external world. Its very much an internal arrangement of DHIS.
Its useful enough in cases where HISP folk are involved on both
producer and consumer side of the equation, but for other 3rd parties
in the world it is best to hide this internal arrangement. I suggest
that dataElement and value are *required* attributes,
categoryOptionCombo is optional and in addition we have have an
<xs:anyAttribute> extension point which allows for additional
attributes. The implication would be that the above dataset will
remain valid (so existing stuff is still working),

I think I agree that we need another model to better "externalize" dimensions. But it would become a bit more complex to implement if dataElement+optionCombo is not a "simple" identifier to the datavalue any more. It would be good to hear a little more about how you plan to implement it in the short run and if you think it should be combined with changes inside dhis..

I think that the simple {dataElement,optionCombo} tuple will remain
the internal identifier to the datavalue for the foreseeable future.
There's a lot of stuff built on top of it, it has some merits and it
can be coerced to behave reasonably well with some tightening of
constraints at the level of our java model.

- Are you thinking of modelling this anyattribute extension point on the sdmx model in some way?

Well, similar enough I guess.

- If there is a more explicit way to describe this in the schema than just anyattribute, I think it could help?

Schema languages are better at some things than others. The problem
here is that we would be required to constrain the attributes on the
basis of a dynamic list which would vary from the concept list of one
application to another. This would not be friendly to annotating
bindings for use on any system. This is also the sdmx-hd problem. As
it is, the xmlanyattribute annotation would bind to a map like:

@XmlAnyAttribute
    public Map<QName,Object> getAny(){
        if( any == null ){
            any = new HashMap<QName,Object>();
        }
        return any;
    }

The datavalue service can determine whether attributes are invalid or
not (in much the same way it determines whether orgunits, dataelements
really exist etc. It could do this fairly painlessly by looking at
the categorycombo of the dataelement - which I think we need to do now
anyway to determine if the optioncombo is valid.

Of course it would be fairly trivial for a running instance of dhis to
generate a *strict* schema with the anyattributes replaced by fixed
attributes, which might be of value to producers. But the internal
parser would have to be a bit agnostic.

- And I think it would be advantageous if we could rework the internal data model to better fit this more general "schema" at the same time, or at least know a little bit more about how the internal changes would look.

Internally I want to change very little. The most fundamental change
being to implement the category-concept-categoryoption binding in the
model and put strict constraints on concept names so that they are
obliged to conform to the intersection of requirements for sql column
names and xml attribute names.

Breaking mcdonalds and replacing with a star or snowflake type schema
is not really a sensible option at this juncture.

- We need to stay backwards compatible with existing meta models, are we sure that the rules for names of dimensions (Sex, Age) is compatible with xml attribute names?

That we must impose through inspired regex on concept names which
should be relatively easy. Category names can remain what they like.

- We might need to think through how these dimensions would look in the metamodel xml, and how the link between this anyAttribute space and that model would be?

Will get to metamodel.xml next. But the link between anyattribute
space and the model is essentially a fairly trivial one through
categorycombo and conceptname.

Overall I guess allowing the two identifier schemes to coexist for a while, seems like a good idea. Though we should probably look to get rid of optionComboId asap, then.

Don't know. Could well be that optionComboid has long legs. It has
its uses between dhis systems which both understand the notion.

3. On the question of identifiers ....

So I am going to suggest two additional attribute, probably at the
dataValueSets level which indicates the id system to use. Currently I
can think of internal, code, uuid and map as possible candidates for
these attribute values. Where map would imply that ids need to be
resolved using an aliases table keyed by a naming context, possibly
using some of Lars' objectmapper or perhaps simpler. To maintain
compatibility with existing web service api this attribute can be
optional and default to uuid.

Yep. I'm not sure what should be the default, though. Maybe just the internal id? For simple cases that looks easier than uuids (at least if we are thinking about the metamodel and how to communicate these id's *to* other systems?). Since we would maybe want to reuse this id model for the meta model as well, you think it would fit there?

I agree that uuid is not the most gentle default. I just suggested it
because you were already using it.

I am pretty sure I can implement the above without breaking what is
currently there. One possible but minor breaking change I would
suggest to improving parsing of very large datasets might be to
abbreviate some well known element names to dv, de and v for
compactness.

I am not sure if these element names would really be that well known and obvious for the target people having to work with the schema.
- Is there any alias mechanism for xml easily used with jaxb?

Not really. There is a standard called DSRL which is designed to
alias/transform element names but not really applicable here. Its not
that important. I can live with long names or short.

- Wouldn't we want explicit streaming/"batch" handling for use cases where sizes grew to this size, anyway?

I think for really large cases, database dumps and other tools are
maybe more appropriate anyway. Of course one problem is that you
don't know the size of the stream when you start consuming it from the
head ... I am sure some snakes have this problem :slight_smile:

Bob

···

On 1 September 2011 15:02, Jo Størset <storset@gmail.com> wrote:

Den 1. sep. 2011 kl. 13.55 skrev Bob Jolliffe:

Overall, though, if you think abbreviated names are better, I'm all for it.

Jo

1. We should shift storedBy up to the dataValueSet level.

My thinking would be that what is relevant is who has
stored this value in *this* database. Usernames from strange
databases wouldn't make much sense anyway.

Its not critical either way at the moment - just looks a bit untidy.
Thus far, unless I hear compelling argument to the contrary, it seems
better to move it up. Will wait and listen.

We haven't really talked about who is "doing the storing" in cases like this, but I would think a system user scenario, where a system user ("systemX") would be the most correct?

In the current handling, this attribute is an optional value, if it's empty currentUserService.getCurrentUsername() will be used. So I think maybe just removing this option for the time being (until it is actually needed) is just as well. Unless we have a use case where we explicitly need to allow the external system to set it, in which case we could let that use case decide.

2. I don't think categoryOptionCombo should *necessarily* be exposed
to the external world.

Trusting you to have to oversight over the mapping to the internal model, I think this makes sense, it should make the meta model easier to articulate in a rest api, as well.

3. On the question of identifiers ....

I agree that uuid is not the most gentle default. I just suggested it
because you were already using it.

I just used uuids because you were already using it :slight_smile: I guess choosing the default is not an immediate problem, anyway..

I can live with long names or short.

Me too (with an ever so slight preference for understandability vs. "premature" optimization)

I certainly support the changes. With a couple of cases of external mobile solutions needing an api for this, it is a good time to stabilize this and get it clearly documented. What is your use case, btw? Thought you were all sdmx :slight_smile:

Jo

···

Den 1. sep. 2011 kl. 17.04 skrev Bob Jolliffe:

1. We should shift storedBy up to the dataValueSet level.

My thinking would be that what is relevant is who has
stored this value in *this* database. Usernames from strange
databases wouldn't make much sense anyway.

Its not critical either way at the moment - just looks a bit untidy.
Thus far, unless I hear compelling argument to the contrary, it seems
better to move it up. Will wait and listen.

We haven't really talked about who is "doing the storing" in cases like this, but I would think a system user scenario, where a system user ("systemX") would be the most correct?

In the current handling, this attribute is an optional value, if it's empty currentUserService.getCurrentUsername() will be used. So I think maybe just removing this option for the time being (until it is actually needed) is just as well. Unless we have a use case where we explicitly need to allow the external system to set it, in which case we could let that use case decide.

+1 to that

2. I don't think categoryOptionCombo should *necessarily* be exposed
to the external world.

Trusting you to have to oversight over the mapping to the internal model, I think this makes sense, it should make the meta model easier to articulate in a rest api, as well.

3. On the question of identifiers ....

I agree that uuid is not the most gentle default. I just suggested it
because you were already using it.

I just used uuids because you were already using it :slight_smile: I guess choosing the default is not an immediate problem, anyway..

I can live with long names or short.

Me too (with an ever so slight preference for understandability vs. "premature" optimization)

right. we'll stick with long for now.

I certainly support the changes. With a couple of cases of external mobile solutions needing an api for this, it is a good time to stabilize this and get it clearly documented. What is your use case, btw? Thought you were all sdmx :slight_smile:

The sdmx mapping strategy is never far in the background. But mapping
sdmx to categorycombos is a terrible headache. Regarding flexible
identifiers, we have a (good) situation in Kenya where there is an
authoritative 3rd party master facility registry which dictates the
codes for facilities - the same codes we use in the code field in our
orgunits. External systems are going to identify facilities using
these codes so the little attribute can help greatly. Truth is we
will often not be able to play god when dictating identifiers so we
need a flexible mechanism to deal with what's out there. It's always
possible to translate these externally (using xslt for example) and
that will remain an option, but translating codes to ids when we
already have the codes in our model is a bit redundant.

The data will actually be coming in as an sdmx-hd dataset, but the
closer our datavalueset looks like an sdmx dataset, the easier it is.

Cheers
Bob

···

On 1 September 2011 20:49, Jo Størset <storset@gmail.com> wrote:

Den 1. sep. 2011 kl. 17.04 skrev Bob Jolliffe:

Jo

Hi Bob,

This is super neat, and its a use case that is very useful.
I’ve had to generate all that extra stuff I didnt ‘need’.

This will also streamline metedata generation because one will only need to generate and pass metadata for only de/ou/period.

But I wonder; what’s the difference between orgunitId and orgunit in your example.
Also, some elements don’t use any categories, but the model references a default categorycombo. How will that look in your proposed schema?

Would you branch Jo’s code in a way we could use easily test yours as a module? or…

Thanks

Ime

···

— On Thu, 9/1/11, Bob Jolliffe bobjolliffe@gmail.com wrote:

The implication of adding all the above will be that whereas the
datavalueset above will remain valid (except perhaps shifting to
storedBy), the following would also be valid:

<dataValueSets orgUnitId=“code” dataElementId=“internal”






I am pretty sure I can implement the above without breaking what is
currently there. One possible but minor breaking change I would
suggest to improving parsing of very large datasets might be to
abbreviate some well known element names to dv, de and v for
compactness.

Hi Ime

Hi Bob,

This is super neat, and its a use case that is very useful.
I’ve had to generate all that extra stuff I didnt ‘need’.

This will also streamline metedata generation because one will only need to generate and pass metadata for only de/ou/period.

But I wonder; what’s the difference between orgunitId and orgunit in your example.

That’s a typo. Please take a look at the pdf file I sent out earlier this week as that is more correct.

Also, some elements don’t use any categories, but the model references a default categorycombo. How will that look in your proposed schema?

The default categorycombo is just that - default. So in the absence of any categories the categorycombo is automatically set to default when saving datavalues.

Would you branch Jo’s code in a way we could use easily test yours as a module? or…

The reading of this format is already implemented in the import/export module. It is tightly coupled with Jo’s code in the sense of making use of the same element/attribute name strings defined in his beans. So you can already use it by just importing the xml file. To test you should ideally set up some codes in your database. We should try and do this in the demo instance so people can try it there. Meanwhile I would suggest to test:

(i) pick an orgunit and assign it a code if it does not already have one (eg ou1)
(ii) pick a small dataset and assign it a code (eg dataset1)
(iii) assign codes to the dataelements within the dataset
(iv) assign the dataset to the orgunit

Then you should be able to import datavaluesets according to the examples given.

Alternatively you can use the existing uuids instead of the codes.

(It might be worth having a startup routine which automatically assigns codes based on the existing internal ids where they do not already exist.)

Regards
Bob

···

On 15 September 2011 06:06, Ime Asangansi asangansi@yahoo.com wrote:

Thanks

Ime

— On Thu, 9/1/11, Bob Jolliffe bobjolliffe@gmail.com wrote:

The implication of adding all the above will be that whereas the
datavalueset above will remain valid (except perhaps shifting to
storedBy), the following would also be valid:

<dataValueSets orgUnitId=“code” dataElementId=“internal”

<dataValue dataElement="2" value="4" Sex="1" />
<dataValue dataElement="2" value="5" Sex="2"/>
<dataValue dataElement="4" value="43" Sex="1" Age="3" />

<dataValue dataElement="5" value="44" Sex="1" Age="3" />

I am pretty sure I can implement the above without breaking what is

currently there. One possible but minor breaking change I would
suggest to improving parsing of very large datasets might be to
abbreviate some well known element names to dv, de and v for
compactness.

Thanks Bob, the pdf is useful.

When you mean codes, you mean the unique id for the record?

Secondly, +1 for an internal routine to assign ids

Thirdly, please how are you generating the dxf for ihris?

Thanks

Ime

···

— On Thu, 9/15/11, Bob Jolliffe bobjolliffe@gmail.com wrote:

This will also streamline metedata generation because one will only need to generate and pass metadata for only de/ou/period.

But I wonder; what’s the difference between orgunitId and orgunit in your example.

That’s a typo. Please take a look at the pdf file I sent out earlier this week as that is more correct.

Also, some elements don’t use any categories, but the model references a default categorycombo. How will that look in your proposed schema?

The default categorycombo is just that - default. So in the absence of any categories the categorycombo is automatically set to default when saving datavalues.

Would you branch Jo’s code in a way we could use easily test yours as a module? or…

The reading of this format is already implemented in the import/export module. It is tightly coupled with Jo’s code in the sense of making use of the same element/attribute name strings defined in his beans. So you can already use it by just importing the xml file. To test you should ideally set up some codes in your database. We should try and do this in the demo instance so people can try it there. Meanwhile I would suggest to test:

(i) pick an orgunit and assign it a code if it does not already have one (eg ou1)
(ii) pick a small dataset and assign it a code (eg dataset1)
(iii) assign codes to the dataelements within the dataset
(iv) assign the dataset to the orgunit

Then you should be able to import datavaluesets according to the examples given.

Alternatively you can use the existing uuids instead of the codes.

(It might be worth having a startup routine which automatically assigns codes based on the existing internal ids where they do not already exist.)

Regards
Bob

Thanks

Ime

— On Thu, 9/1/11, Bob Jolliffe bobjolliffe@gmail.com wrote:

The implication of adding all the above will be that whereas
the
datavalueset above will remain valid (except perhaps shifting to
storedBy), the following would also be valid:

<dataValueSets orgUnitId=“code” dataElementId=“internal”

<dataValue dataElement="2" value="4" Sex="1" />
<dataValue dataElement="2" value="5" Sex="2"/>
<dataValue dataElement="4" value="43" Sex="1" Age="3" />

<dataValue dataElement="5" value="44" Sex="1" Age="3" />

I am pretty sure I can implement the above without breaking what is

currently there. One possible but minor breaking change I would
suggest to improving parsing of very large datasets might be to
abbreviate some well known element names to dv, de and v
for
compactness.

Thanks Bob, the pdf is useful.

When you mean codes, you mean the unique id for the record?

I wish life were so simple :slight_smile: There are quite a few ways that an identifiable object (eg Orgunit) can be judged unique:

  1. the primary database key
  2. the name
  3. the uuid
  4. the code

1 and 2 are not good for a number of reasons.

3 is quite ok except that (a) its a bit long and (b) we might have to map to data from elsewhere which doesn’t use a uuid.

This latter case is quite common - if dhis was the central authority in the world for assigning metadata (sometimes it feels like it is designed as if it is :slight_smile: life might be better - but the reality is that sometimes there are other authorities and it is good that there are. The case we have been dealing with in Kenya for example - where they have an official Master Facility List which is responsible for registering facilities and assigning codes. In which case we use these official codes in the code field of orgunit.

Secondly, +1 for an internal routine to assign ids

I’m in two minds about this. For sure it might be better to have generated codes ou23, de456 etc rather than leave the field blank. But codes generally work best when assigned, such as the MFL case above.

Thirdly, please how are you generating the dxf for ihris?

iHRIS is doing the generating the dxf for us ie. they are generating HR dataelement values (number of docs, nurses etc)

Bob

···

On 15 September 2011 10:01, Ime Asangansi asangansi@yahoo.com wrote:

Thanks

Ime

— On Thu, 9/15/11, Bob Jolliffe bobjolliffe@gmail.com wrote:

This will also streamline metedata generation because one will only need to generate and pass metadata for only de/ou/period.

But I wonder; what’s the difference between orgunitId and orgunit in your example.

That’s a typo. Please take a look at the pdf file I sent out earlier this week as that is more correct.

Also, some elements don’t use any categories, but the model references a default categorycombo. How will that look in your proposed schema?

The default categorycombo is just that - default. So in the absence of any categories the categorycombo is automatically set to default when saving datavalues.

Would you branch Jo’s code in a way we could use easily test yours as a module? or…

The reading of this format is already implemented in the import/export module. It is tightly coupled with Jo’s code in the sense of making use of the same element/attribute name strings defined in his beans. So you can already use it by just importing the xml file. To test you should ideally set up some codes in your database. We should try and do this in the demo instance so people can try it there. Meanwhile I would suggest to test:

(i) pick an orgunit and assign it a code if it does not already have one (eg ou1)
(ii) pick a small dataset and assign it a code (eg dataset1)
(iii) assign codes to the dataelements within the dataset
(iv) assign the dataset to the orgunit

Then you should be able to import datavaluesets according to the examples given.

Alternatively you can use the existing uuids instead of the codes.

(It might be worth having a startup routine which automatically assigns codes based on the existing internal ids where they do not already exist.)

Regards
Bob

Thanks

Ime

— On Thu, 9/1/11, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

The implication of adding all the above will be that whereas the
datavalueset above will remain valid (except perhaps shifting to
storedBy), the following would also be valid:

<dataValueSets orgUnitId=“code” dataElementId=“internal”

<dataValue dataElement="2" value="4" Sex="1" />
<dataValue dataElement="2" value="5" Sex="2"/>
<dataValue dataElement="4" value="43" Sex="1" Age="3" />


<dataValue dataElement="5" value="44" Sex="1" Age="3" />

I am pretty sure I can implement the above without breaking what is

currently there. One possible but minor breaking change I would
suggest to improving parsing of very large datasets might be to
abbreviate some well known element names to dv, de and v for
compactness.

Hi Bob,
Thanks for this.
This will be very useful.

what do you think about openmrs generating dxf like ihris?

Ime

···

--- On Thu, 9/15/11, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

Hi Bob,
Thanks for this.
This will be very useful.

what do you think about openmrs generating dxf like ihris?

Hi Ime

That’s a pertinent question and one that has preoccupied me for some time. If you have been following discussion on the openmrs devs list some weeks back you might have picked up that

  1. openmrs can already generate all manner of flexible renderings of what they call indicator reports using its reporting framework

  2. one of the renderings which already exist for openmrs is the SDMX-HD module developed by Jembi. SDMX-HD cross sectional data is very similar to this dxf2 data - to the extent that it is easily transformed from one to the other. Having had something of a hand in both, I can assure you this convergence is not entirely accidental :slight_smile: This is what we have done with iHRIS in Kenya for example. iHRIS is actually producing (well it can produce) SDMX-HD data and we simply transform it to dxf2 during import. The process is very efficient and we can comfortably import 300 000 datavalues in 1 minute.

  3. we have similarly imported SDMX-HD data from openmrs in Sierra Leone though this was very much a proof of concept. An important detail which escaped me then and which I have only come to understand fairly recently, is that the type of “indicators” which can be generated through the openmrs reporting module framework and rendered with the jembi module are something called “Cohort Indicators” and these, while being very powerful, are also quite resrticted in what they can measure.

  4. A cohort indicator is an aggregation or calculation based upon a cohort of patients - so you can easily generate dataelements like “number of patients with reduced CD4 count this month”. These are really useful indicators from a practitioner or clinical research perspective. But many (perhaps most) of our HMIS indicators are not cohort based - they are often based on a measure of service delivery, like for example “how many malaria cases treated this month”.

  5. so we are a bit back to the drawing board here :frowning: The SDMX-HD capability in openmrs is restricted to cohort indicators, but in general these form only a subset of of what a facility might typically have to routinely report. So we can currently read data from OpenMRS, but only a small subset of what is realistically required for a typical facility.

Having said that all far from lost. But it does mean that further customisation of openmrs is required in order to produce typical dataelements. Developers in Rwanda for example have created a solution to produce more flexible reports. There is work underway at present to define a new SQL Indicator type in the openmrs reporting module which will allow these to be mainstreamed into the core reporting framework. Though I am not optimistic this will easily tie in to the existing openmrs sdmx integration module, the critical thing is to be able to produce the right data. The data format, as you have observed, is fairly trivial to render. Chuyen, a Vietnamese developer working with HISP india, is also working on a (hopefully) simple openmrs module to create a basic aggregate reporting capability. The format here is not the major issue. Once we have the means to generate the dataelement values required it can easily enough be rendered in this dxf2 format or in sdmx-hd. In fact as long as its got a dataelement code, a period and a value I think we can (and happily will) swallow any xml representation.

There you go … long answer to short question.

Regards
Bob

PS. short answer to another point you raised right at the start re streamlining of metadata export. I think the requirement is actually even simpler than you imagine. Whereas a system can determine the codes it needs from a “complete” dhis metadata export, its actually much simpler for both parties for dhis to simply export codelists for dataelements and orgunits, rather than the kitchen sink of shortnames, alternative names, geo-coordinates etc etc.

···

On 16 September 2011 02:36, Ime Asangansi asangansi@yahoo.com wrote:

Ime

— On Thu, 9/15/11, Bob Jolliffe bobjolliffe@gmail.com wrote:

From: Bob Jolliffe bobjolliffe@gmail.com
Subject: Re: [Dhis2-devs] dhis2 dxf data import
To: “Ime Asangansi” asangansi@yahoo.com

Cc: “dhis2-devs” dhis2-devs@lists.launchpad.net
Date: Thursday, September 15, 2011, 11:36 AM

On 15 September 2011 10:01, Ime Asangansi <asangansi@yahoo.com> > > wrote:

Thanks Bob, the pdf is useful.

When you mean codes, you mean the unique id for the record?

I wish life were so simple :slight_smile: There are quite a few ways that an identifiable object (eg Orgunit) can be judged unique:

  1. the primary database key
  2. the name
  3. the uuid
  4. the code

1 and 2 are not good for a number of reasons.

3 is quite ok except that (a) its a bit long and (b) we might have to map to data from elsewhere which doesn’t use a uuid.

This latter case is quite common - if dhis was the central authority in the world for assigning metadata (sometimes it feels like it is designed as if it is :slight_smile: life might be better - but the reality is that sometimes there are other authorities and it is good that there are. The case we have been dealing with in Kenya for example - where they have an official Master Facility List which is responsible for registering facilities and assigning codes. In which case we use these official codes in the code field of orgunit.

Secondly, +1 for an internal routine to assign ids

I’m in two minds about this. For sure it might be better to have generated codes ou23, de456 etc rather than leave the field blank. But codes generally work best when assigned, such as the MFL case above.

Thirdly, please how are you generating the dxf for ihris?

iHRIS is doing the generating the dxf for us ie. they are generating HR dataelement values (number of docs, nurses etc)

Bob

Thanks

Ime

— On Thu, 9/15/11, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

This will also streamline metedata generation because one will only need to generate and pass metadata for only de/ou/period.

But I wonder; what’s the difference between orgunitId and orgunit in your example.

That’s a typo. Please take a look at the pdf file I sent out earlier this week as that is more correct.

Also, some elements don’t use any categories, but the model references a default categorycombo. How will that look in your proposed schema?

The default categorycombo is just that - default. So in the absence of any categories the categorycombo is automatically set to default when saving datavalues.

Would you branch Jo’s code in a way we could use easily test yours as a module? or…

The reading of this format is already implemented in the import/export module. It is tightly coupled with Jo’s code in the sense of making use of the same element/attribute name strings defined in his beans. So you can already use it by just importing the xml file. To test you should ideally set up some codes in your database. We should try and do this in the demo instance so people can try it there. Meanwhile I would suggest to test:

(i) pick an orgunit and assign it a code if it does not already have one (eg ou1)
(ii) pick a small dataset and assign it a code (eg dataset1)
(iii) assign codes to the dataelements within the dataset
(iv) assign the dataset to the orgunit

Then you should be able to import datavaluesets according to the examples given.

Alternatively you can use the existing uuids instead of the codes.

(It might be worth having a startup routine which automatically assigns codes based on the existing internal ids where they do not already exist.)

Regards
Bob

Thanks

Ime

— On Thu, 9/1/11, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

The implication of adding all the above will be that whereas the
datavalueset above will remain valid (except perhaps shifting to
storedBy), the following would also be valid:

<dataValueSets orgUnitId=“code” dataElementId=“internal”

<dataValue dataElement="2" value="4" Sex="1" />
<dataValue dataElement="2" value="5" Sex="2"/>
<dataValue dataElement="4" value="43" Sex="1" Age="3" />



<dataValue dataElement="5" value="44" Sex="1" Age="3" />

I am pretty sure I can implement the above without breaking what is

currently there. One possible but minor breaking change I would
suggest to improving parsing of very large datasets might be to
abbreviate some well known element names to dv, de and v for
compactness.