[Branch ~dhis2-devs-core/dhis2/trunk] Rev 2851: Added spike for storing dataValueSets through a simple http post (see <dhis-root-url>/api/rpc)

Hi,

One thing to have in mind here is that a data element can be captured in multiple datasets (but still refer to the same value).
This has been a popular mechanism to help implementers work around the many duplicating forms and still keep the database as clean and consistent as possible.

In this scenario it would be difficult to know which form/dataset actually collected the value I guess.

I know there have been some requests to add more properties to the values, e.g. how they were captured, who is the owner etc. but it should be possible to accommodate this and still keep the original primary keys / references for data values (orgunit, period, data element, catoptioncombo).

If your grouping of values only concerns how data values are collected and transferred and not how they are persisted, then it seems fine to me. This is the role of the dataset in today’s model.

Ola

···

On 16 February 2011 11:57, Bob Jolliffe bobjolliffe@gmail.com wrote:

On 16 February 2011 07:37, Abyot Gizaw abyota@gmail.com wrote:

2011/2/15 Bob Jolliffe bobjolliffe@gmail.com

On 15 February 2011 14:09, Lars Helge Øverland larshelge@gmail.com > > >> wrote:

On Tue, Feb 15, 2011 at 2:34 PM, Jo Størset storset@gmail.com wrote:

Den 15. feb. 2011 kl. 18.40 skrev Bob Jolliffe:

Simple validation seems to work ok. I get an “Aw, Snap! …” when

posting twice with the same period but that is probably something you

are not catching yet.

Should work now.

I don’t agree with this.

I know :slight_smile: I don’t necessarily agree myself, but it is also a matter of

what is practically possible… (And it might make sense to have a

simpler

json-oriented web api vs. a more fullfledged xml format for heavy

imp/exp.

Are you coming to Oslo in March by any chance, then we can fight it

out! )

Can you please explain why it is not practically possible to have dxf as

the

root element?

I don’t have anything against grouping datavalues in sets to make the

format

more compact. But, first, we currently don’t have any real requirements

or

use-case where we want to persist the “datavalueset”. Second we

currently

have no support for it in the model. So whats the point of modeling our

exchange format this way?

Well partly because this structure models the way data is produced.

In sets. Off a form or off an import. SDMX data for example also

arrives in sets. While there is no support in the model it simply

means that we can lose information regarding the set. It becomes

important where you might want to rollback a set or identify where

some particular has come from. Currently this is sort of implicitly

keyed but there are benefits in making it explicit. For example you

can’t currently trace a datavalue back to whether it was entered

through a dhis form, whether it arrived from one of Jo’s 5000 phone’s

or whether it was imported from iHRIS (or whatever). You can populate

the comment of all the datavalues but that’s really expensive.

There are also savings to be had on storage by inheriting atttributes

like period and orrgunit from a dataset rather replicating in each

datavalue.

It’s not a model change I would propose immediately (I think we have

enough zooks to sort out) but surely it is hard to argue that its not

the (proverbial) right thing to do. Meanwhile the way Jo has it in

his xml looks fine to me.

Hmmm… I think if we are to with this datavalueset concept and take away

orgunit&period from datavalue and leave it with the groupset - we will be

hitting a big trouble!

The flexibility we have right now - the way we define Indicators, design

reports,… - is down to the independence of the datavalue. Each and every

piece of datavalue can stand by itself and make sense - allowing monitoring

and evaluation people greater flexibility and harmonization.

Again, with the datavalueset approach mentioned here, we will be against the

minimum-dataset concept. For me the minimum dataset concept worked because

users/healthprograms can share dataelement/datavalue.

I doubt the trouble would be as big as you think. But you might be

right and could be I’m missing something. But regarding just posting

of data it makes no difference at all other than making the message

more efficient.

What is minimum-dataset concept?

Bob

BTW its not really to do with groupset. But I guess that was a typo.

Abyot.

Cheers

Bob

Yes we might need it sometime in the future but

then we should implement it when we need it.

I also find it weird that we really need to implement two parsers for

this.

More work and more code to maintain.

The uuids will go for a new Identifier property for version 2.2 and make

things less verbose btw.

Lars

Your use of DataValueSet here is very welcome - as you know I have

been advocating this for a while. Would be nice also to persist it

to

provide audit (and simplify dtavalue store) but that is maybe too

much

for now.

Yes, that would have to be the next topic. Let’s see if anyone else

take

the bait :slight_smile:

Jo


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

Hi,
One thing to have in mind here is that a data element can be captured in
multiple datasets (but still refer to the same value).
This has been a popular mechanism to help implementers work around the many
duplicating forms and still keep the database as clean and consistent as
possible.
In this scenario it would be difficult to know which form/dataset actually
collected the value I guess.
I know there have been some requests to add more properties to the values,
e.g. how they were captured, who is the owner etc. but it should be possible
to accommodate this and still keep the original primary keys / references
for data values (orgunit, period, data element, catoptioncombo).
If your grouping of values only concerns how data values are collected and
transferred and not how they are persisted, then it seems fine to me. This
is the role of the dataset in today's model.

No its not and that is the shortcoming. The dataset is a grouping of
dataelements not datavalues. Hence there is no explicit relationship
between datavalue and dataset. You can implicitly figure it out by
"calculating" the dataset as a f(dataelement, orgunit) where
periodType you calculate as f(periodid). But bearing in mind what you
start off with above, even this might be considered a best effort.

All of this is much more entangled than it needs to be but it has
evolved that way and, as Lars says, it won't be changed in a hurry.

Anyway, more later ..

Bob

···

On 16 February 2011 11:41, Ola Hodne Titlestad <olati@ifi.uio.no> wrote:

Ola
-------

On 16 February 2011 11:57, Bob Jolliffe <bobjolliffe@gmail.com> wrote:

On 16 February 2011 07:37, Abyot Gizaw <abyota@gmail.com> wrote:
>
>
> 2011/2/15 Bob Jolliffe <bobjolliffe@gmail.com>
>>
>> On 15 February 2011 14:09, Lars Helge Øverland <larshelge@gmail.com> >> >> wrote:
>> >
>> >
>> > On Tue, Feb 15, 2011 at 2:34 PM, Jo Størset <storset@gmail.com> >> >> > wrote:
>> >>
>> >> Den 15. feb. 2011 kl. 18.40 skrev Bob Jolliffe:
>> >>
>> >> > Simple validation seems to work ok. I get an "Aw, Snap! ..." when
>> >> > posting twice with the same period but that is probably something
>> >> > you
>> >> > are not catching yet.
>> >>
>> >> Should work now.
>> >>
>> >> > I don't agree with this.
>> >>
>> >> I know :slight_smile: I don't necessarily agree myself, but it is also a matter
>> >> of
>> >> what is practically possible.. (And it might make sense to have a
>> >> simpler
>> >> json-oriented web api vs. a more fullfledged xml format for heavy
>> >> imp/exp.
>> >> Are you coming to Oslo in March by any chance, then we can fight it
>> >> out! )
>> >>
>> >
>> > Can you please explain why it is not practically possible to have dxf
>> > as
>> > the
>> > root element?
>> > I don't have anything against grouping datavalues in sets to make the
>> > format
>> > more compact. But, first, we currently don't have any real
>> > requirements
>> > or
>> > use-case where we want to persist the "datavalueset". Second we
>> > currently
>> > have no support for it in the model. So whats the point of modeling
>> > our
>> > exchange format this way?
>>
>> Well partly because this structure models the way data is produced.
>> In sets. Off a form or off an import. SDMX data for example also
>> arrives in sets. While there is no support in the model it simply
>> means that we can lose information regarding the set. It becomes
>> important where you might want to rollback a set or identify where
>> some particular has come from. Currently this is sort of implicitly
>> keyed but there are benefits in making it explicit. For example you
>> can't currently trace a datavalue back to whether it was entered
>> through a dhis form, whether it arrived from one of Jo's 5000 phone's
>> or whether it was imported from iHRIS (or whatever). You can populate
>> the comment of all the datavalues but that's really expensive.
>>
>> There are also savings to be had on storage by inheriting atttributes
>> like period and orrgunit from a dataset rather replicating in each
>> datavalue.
>>
>> It's not a model change I would propose immediately (I think we have
>> enough zooks to sort out) but surely it is hard to argue that its not
>> the (proverbial) right thing to do. Meanwhile the way Jo has it in
>> his xml looks fine to me.
>
>
> Hmmm... I think if we are to with this datavalueset concept and take
> away
> orgunit&period from datavalue and leave it with the groupset - we will
> be
> hitting a big trouble!
>
> The flexibility we have right now - the way we define Indicators, design
> reports,... - is down to the independence of the datavalue. Each and
> every
> piece of datavalue can stand by itself and make sense - allowing
> monitoring
> and evaluation people greater flexibility and harmonization.
>
> Again, with the datavalueset approach mentioned here, we will be against
> the
> minimum-dataset concept. For me the minimum dataset concept worked
> because
> users/healthprograms can share dataelement/datavalue.

I doubt the trouble would be as big as you think. But you might be
right and could be I'm missing something. But regarding just posting
of data it makes no difference at all other than making the message
more efficient.

What is minimum-dataset concept?

Bob

BTW its not really to do with groupset. But I guess that was a typo.

>
> Abyot.
>
>>
>> Cheers
>> Bob
>>
>> > Yes we might need it sometime in the future but
>> > then we should implement it when we need it.
>> > I also find it weird that we really need to implement two parsers for
>> > this.
>> > More work and more code to maintain.
>> > The uuids will go for a new Identifier property for version 2.2 and
>> > make
>> > things less verbose btw.
>> > Lars
>> >
>> >>
>> >> > Your use of DataValueSet here is very welcome - as you know I have
>> >> > been advocating this for a while. Would be nice also to persist
>> >> > it
>> >> > to
>> >> > provide audit (and simplify dtavalue store) but that is maybe too
>> >> > much
>> >> > for now.
>> >>
>> >> Yes, that would have to be the next topic. Let's see if anyone else
>> >> take
>> >> the bait :slight_smile:
>> >>
>> >> Jo
>> >> _______________________________________________
>> >> Mailing list: DHIS 2 developers in Launchpad
>> >> Post to : dhis2-devs@lists.launchpad.net
>> >> Unsubscribe : DHIS 2 developers in Launchpad
>> >> More help : ListHelp - Launchpad Help
>> >
>> >
>>
>> _______________________________________________
>> Mailing list: DHIS 2 developers in Launchpad
>> Post to : dhis2-devs@lists.launchpad.net
>> Unsubscribe : DHIS 2 developers in Launchpad
>> More help : ListHelp - Launchpad Help
>
>

_______________________________________________
Mailing list: DHIS 2 developers in Launchpad
Post to : dhis2-devs@lists.launchpad.net
Unsubscribe : DHIS 2 developers in Launchpad
More help : ListHelp - Launchpad Help

Ola’s point here is important and that’s why it is wrong and ambiguous to include the dataset on the datavalueset like Jo has implemented it: A data element can appear in multiple datasets. So there is no guarantee that a data value is coming from data set a since it was received from a datavalueset b. A datavalue might very well be subsequently updated from any number of other data set/datavalueset. So a datavalue can be added from a dataset a, updated from a dataset b, updated again from a dataset c… Where would you say it comes from?

And if we had a one-to-one relationship between data element and dataset it would be unnecessary to add the dataset to the datavalueset since it could be derived from data element. I was trying to explain this before this was commited but it was ignored.

That said I don’t have anything against groping datavalues in the exchange format to save space, which is a different question.

The dataset thing works quite well and lets not complicate this more than necessary. If users one day require improved tracking of datavalues lets deal with it then.

Lars

Ola's point here is important

Agree.

and that's why it is wrong and ambiguous to include the dataset on the datavalueset like Jo has implemented it:

I anything is "wrong and ambiguous" it is inherited from the design already there, it's not something I'm implementing. And it is certainly not something that comes from including a dataset identifier on *POSTED* datavaluesets.

A data element can appear in multiple datasets. So there is no guarantee that a data value is coming from data set a since it was received from a datavalueset b. A datavalue might very well be subsequently updated from any number of other data set/datavalueset. So a datavalue can be added from a dataset a, updated from a dataset b, updated again from a dataset c... Where would you say it comes from?

I would say when the user has just edited and posted the form for dataset A it comes from dataset A. Do you seriously mean to say that that is ambiguous while *guessing* is unambigous? DataSet A might be locked while dataSet B is not. You are saying that guessing what datavalueset to check for locking on is *the right way*, while knowing is wrong? I mean, seriously... It's not that there aren't plenty of real concerns here, this is just sour grapes.

And if we had a one-to-one relationship between data element and dataset it would be unnecessary to add the dataset to the datavalueset since it could be derived from data element. I was trying to explain this before this was commited but it was ignored.

And of course everyone obviously agrees.. if you have a one-to-one relation you can the deduce one from the other. But we don't, and if we had we wouldn't have this discussion, so then the point is rather mute, wouldn't you say?

That said I don't have anything against groping datavalues in the exchange format to save space, which is a different question.

The dataset thing works quite well and lets not complicate this more than necessary. If users one day require improved tracking of datavalues lets deal with it then.

k

Jo

···

Den 16. feb. 2011 kl. 22.07 skrev Lars Helge Øverland:

Ok maybe i was unreasonably blaming you for this, sorry about that. Including dataset in the exchange format for completeness and locking purposes is fine and makes sense. Its the idea of directly linking datavalue to dataset on the persistence side for tracking purposes i am against. Lars

···

On 16 Feb 2011 13:48, “Jo Størset” storset@gmail.com wrote:

Den 16. feb. 2011 kl. 22.07 skrev Lars Helge Øverland:

Ola’s point here is important

Agree.

and that’s why it is wrong and ambiguous to include the dataset on the datavalueset like Jo has implemented it:

I anything is “wrong and ambiguous” it is inherited from the design already there, it’s not something I’m implementing. And it is certainly not something that comes from including a dataset identifier on POSTED datavaluesets.

A data element can appear in multiple datasets. So there is no guarantee that a data value is coming from data set a since it was received from a datavalueset b. A datavalue might very well be subsequently updated from any number of other data set/datavalueset. So a datavalue can be added from a dataset a, updated from a dataset b, updated again from a dataset c… Where would you say it comes from?

I would say when the user has just edited and posted the form for dataset A it comes from dataset A. Do you seriously mean to say that that is ambiguous while guessing is unambigous? DataSet A might be locked while dataSet B is not. You are saying that guessing what datavalueset to check for locking on is the right way, while knowing is wrong? I mean, seriously… It’s not that there aren’t plenty of real concerns here, this is just sour grapes.

And if we had a one-to-one relationship between data element and dataset it would be unnecessary to add the dataset to the datavalueset since it could be derived from data element. I was trying to explain this before this was commited but it was ignored.

And of course everyone obviously agrees… if you have a one-to-one relation you can the deduce one from the other. But we don’t, and if we had we wouldn’t have this discussion, so then the point is rather mute, wouldn’t you say?

That said I don’t have anything against groping datavalues in the exchange format to save space, which is a different question.

The dataset thing works quite well and lets not complicate this more than necessary. If users one day require improved tracking of datavalues lets deal with it then.

k

Jo

Ok maybe i was unreasonably blaming you for this, sorry about that.
Including dataset in the exchange format for completeness and locking
purposes is fine and makes sense. Its the idea of directly linking datavalue
to dataset on the persistence side for tracking purposes i am against. Lars

And you are right to be against that. But if not the dataset then we
are short of something else ... :slight_smile: Whether it's important or not is
another question.

···

On 16 February 2011 21:23, Lars Helge Øverland <larshelge@gmail.com> wrote:

On 16 Feb 2011 13:48, "Jo Størset" <storset@gmail.com> wrote:

Den 16. feb. 2011 kl. 22.07 skrev Lars Helge Øverland:

Ola's point here is important

Agree.

and that's why it is wrong and ambiguous to include the dataset on the
datavalueset like Jo has implemented it:

I anything is "wrong and ambiguous" it is inherited from the design
already there, it's not something I'm implementing. And it is certainly not
something that comes from including a dataset identifier on *POSTED*
datavaluesets.

A data element can appear in multiple datasets. So there is no guarantee
that a data value is coming from data set a since it was received from a
datavalueset b. A datavalue might very well be subsequently updated from any
number of other data set/datavalueset. So a datavalue can be added from a
dataset a, updated from a dataset b, updated again from a dataset c... Where
would you say it comes from?

I would say when the user has just edited and posted the form for dataset
A it comes from dataset A. Do you seriously mean to say that that is
ambiguous while *guessing* is unambigous? DataSet A might be locked while
dataSet B is not. You are saying that guessing what datavalueset to check
for locking on is *the right way*, while knowing is wrong? I mean,
seriously... It's not that there aren't plenty of real concerns here, this
is just sour grapes.

And if we had a one-to-one relationship between data element and dataset
it would be unnecessary to add the dataset to the datavalueset since it
could be derived from data element. I was trying to explain this before this
was commited but it was ignored.

And of course everyone obviously agrees.. if you have a one-to-one
relation you can the deduce one from the other. But we don't, and if we had
we wouldn't have this discussion, so then the point is rather mute, wouldn't
you say?

That said I don't have anything against groping datavalues in the
exchange format to save space, which is a different question.

The dataset thing works quite well and lets not complicate this more than
necessary. If users one day require improved tracking of datavalues lets
deal with it then.

k

Jo

Ok maybe i was unreasonably blaming you for this, sorry about that.

Including dataset in the exchange format for completeness and locking

purposes is fine and makes sense. Its the idea of directly linking datavalue

to dataset on the persistence side for tracking purposes i am against. Lars

And you are right to be against that. But if not the dataset then we

are short of something else … :slight_smile: Whether it’s important or not is

another question.

Yes we might be short of something else. I am just explaining that the way dataset is today it is unsuitable for tracking purposes. And as Ola says there are good reasons for having dataset the way it is.

And whether its important depends on whether users requires it. When they do we will have to invent something.

···

On 16 Feb 2011 16:32, “Bob Jolliffe” bobjolliffe@gmail.com wrote:

On 16 February 2011 21:23, Lars Helge Øverland larshelge@gmail.com wrote:

On 16 Feb 2011 13:48, “Jo Størset” storset@gmail.com wrote:

Den 16. feb. 2011 kl. 22.07 skrev Lars Helge Øverland:

Ola’s point here is important

Agree.

and that’s why it is wrong and ambiguous to include the dataset on the

datavalueset like Jo has implemented it:

I anything is “wrong and ambiguous” it is inherited from the design

already there, it’s not something I’m implementing. And it is certainly not

something that comes from including a dataset identifier on POSTED

datavaluesets.

A data element can appear in multiple datasets. So there is no guarantee

that a data value is coming from data set a since it was received from a

datavalueset b. A datavalue might very well be subsequently updated from any

number of other data set/datavalueset. So a datavalue can be added from a

dataset a, updated from a dataset b, updated again from a dataset c… Where

would you say it comes from?

I would say when the user has just edited and posted the form for dataset

A it comes from dataset A. Do you seriously mean to say that that is

ambiguous while guessing is unambigous? DataSet A might be locked while

dataSet B is not. You are saying that guessing what datavalueset to check

for locking on is the right way, while knowing is wrong? I mean,

seriously… It’s not that there aren’t plenty of real concerns here, this

is just sour grapes.

And if we had a one-to-one relationship between data element and dataset

it would be unnecessary to add the dataset to the datavalueset since it

could be derived from data element. I was trying to explain this before this

was commited but it was ignored.

And of course everyone obviously agrees… if you have a one-to-one

relation you can the deduce one from the other. But we don’t, and if we had

we wouldn’t have this discussion, so then the point is rather mute, wouldn’t

you say?

That said I don’t have anything against groping datavalues in the

exchange format to save space, which is a different question.

The dataset thing works quite well and lets not complicate this more than

necessary. If users one day require improved tracking of datavalues lets

deal with it then.

k

Jo

tl;dr.. I should have cleaned this up, but have to get it sent before other work takes precedence..

So, the intention of this prototyping was to get some kind of general discussion going on what I see as challenges with the current system (in addition to hopefully ending up with a working api of some sort). And I wanted to base it on concrete examples, so that it would hopefully not be as abstract. I have commited a rework of the api in terms of how I understand a "common dxf format" version could look. The xml would basically look like this:

<dxf xmlns="http://dhis2.org/schema/dxf/x.x&quot;&gt;
  <dataValues>
    <dataValue
      dataSet="uuid - only required when there is an actual ambiguity in the system"
      period="period"
      orgUnit="uuid"
      storedBy="string"
      dataElement="uuid"
      value="value" />
    </dataValues>
  <dataValueSets>
    <dataValueSet
      dataSet="uuid"
      orgUnit="uuid"
      period="period in iso format"
      complete="date (yyyymmdd)"/>
  </dataValueSets>
</dxf>

sataValueSets here is more or less just a renamed completeDataSetRegistrations (not necessary if we don't want to accomodate locks, they seem missing?), so basically this should almost exactly mirror current dxf (except uuids instead of ids). This is quite verbose, and it doesn't really mirror what the api clients we currently know of wants to do, but that is not a fundamental problem (making usage and implementation a bit more difficult and messy, but not overly so).

In my view dxf's primary objective is as complete serialization of the domain model as it is implemented. It's basic mode is that it should represent the system state exactly, so that you can take the serialized format and put up more or less a clone of the exported system. An api, on the other hand is a message oriented protocol for changing state on the system. So, there will be competing interest, here.

The significant thing to notice, is that I have ended up adding dataSet as an attribute on dataValue, even if it doesn't belong there for regular dxf use. This is difficult to avoid in a clean way: A client sending a message to the system to request a state change needs to be able to tell the system things that does not necessarily belong in the stable state description of the system. And if we want to avoid the possibility for ambiguity in certain cases, it is necessary to have dataSet as an attribute on datavalues, even though it does not belong there in dxf.

The best I have been able to do in my spike is make it optional (only required when the system actually have an ambiguity). This is of course not the only possible solution to this problem, I can at least think of the following approaches to try to handle this:

1) Do like above and have the rule about not using the dataSet attribute for regular uses of dxf specified somewhere else than in the format.
2) Not insist on one xml format for "incompatible" use cases.
3) Not allow functionality that is not representable in "canonical" dxf.
4) disallow uses that are ambigous (i.e. you cant post values if it is ambigous how it should be treated).
5) "Approximate" functionality - allow all uses and make a "qualified" guess as to how it should be resolved.
6) rework the domain model to accomodate the revealed inconsistencies.
7) rework the api so that knowledge not in the dxf model is represented in other ways (as a contrived example, have a http header with the dataset uuid instead of in the dxf).

Still, my current preference ends up being for 2 in the short term and more 6 in the longer term (I'm not ruling out the other approaches, just emphasizing what I think is the best bet as of now).

Of course 2 has the obvious drawback that more formats means more maintainance and for 6 it is always the case that it is much easier to see the problems with what you have rather than what think you want. And there are of course a whole lot of other concerns and implications whatever way we want to go. But overall I think something like my initial proposal [1] makes more sense than what I have above for clients needing to send (at least in my experience) dataValueSets.

I also think it could be a good idea to just admit that dataValueSets exists and see if we can introduce the concept to the domain model in an unobtrusive way (to group completeness and locking?), not rally changing much, just stating more clearly a concept that is implicitly there. But I'm guessing this might have a hard time justifying priority over other things, and it is not essential (but would be a more iterative approach to domain model changes than having to do a big bang redesign when it gets critical, which in the long run could have significant advantages in itself).

In the longer term I think that it is pretty clear that as dhis2 is moving to be more and more of a "datawarehouse" rather than a selfcontained system, we have to find a way to keep reevaluating the solution and the needs we have to support (like data values in multiple dataValueSets). I'm not saying that we don't need to keep supporting existing needs, and I'm not saying we should lightly part with what we have. But we must strive to have a core domain model that is reasonably simple, consistent and answers new needs and developments as they come along. And the growing integration requirements I think justifies changes to the core model (even if it might be a while, and it makes supporting some older requirements a bit harder).

Btw, if we someday ended up with a model where datavalues belong to a datavalueset, I think my initial xml [1] would in effect be the new dxf..

Jo

[1] For reference, my initial proposal was something like this:

<dataValueSet xmlns="http://dhis2.org/schema/dataValueSet/0.1&quot;&gt;
    dataSet="dataSet UUID"
    period="periodInIsoFormat"
    orgUnit="unit UUID">
  <dataValue dataElement="data element UUID" categoryOptionCombo="UUID, only specify if used" storedBy="string" value="value"/>
</dataValueSet>

Keeping this short. I am on quota of a few email lines per day ...

tl;dr.. I should have cleaned this up, but have to get it sent before other work takes precedence..

So, the intention of this prototyping was to get some kind of general discussion going on what I see as challenges with the current system (in addition to hopefully ending up with a working api of some sort). And I wanted to base it on concrete examples, so that it would hopefully not be as abstract. I have commited a rework of the api in terms of how I understand a "common dxf format" version could look. The xml would basically look like this:

<dxf xmlns="http://dhis2.org/schema/dxf/x.x&quot;&gt;
<dataValues>
<dataValue
dataSet="uuid - only required when there is an actual ambiguity in the system"
period="period"
orgUnit="uuid"
storedBy="string"
dataElement="uuid"
value="value" />
</dataValues>
<dataValueSets>
<dataValueSet
dataSet="uuid"
orgUnit="uuid"
period="period in iso format"
complete="date (yyyymmdd)"/>
</dataValueSets>
</dxf>

This is clearly ugly. I don't like it. You don't like. I doubt
anyone would like it.

sataValueSets here is more or less just a renamed completeDataSetRegistrations (not necessary if we don't want to accomodate locks, they seem missing?), so basically this should almost exactly mirror current dxf (except uuids instead of ids). This is quite verbose, and it doesn't really mirror what the api clients we currently know of wants to do, but that is not a fundamental problem (making usage and implementation a bit more difficult and messy, but not overly so).

In my view dxf's primary objective is as complete serialization of the domain model as it is implemented. It's basic mode is that it should represent the system state exactly, so that you can take the serialized format and put up more or less a clone of the exported system. An api, on the other hand is a message oriented protocol for changing state on the system. So, there will be competing interest, here.

There you are wrong. You keep making this point to the extent that I
think you now thoroughly believe it :slight_smile: Complete cloning of systems
by serializing to dxf was (a long time ago) what I thought was the
primary objective, and perhaps it even was once, but now it definitely
is not. For one thing, practice seems to show that people tend to
share their databases via postgres/mysql dumps in the real world.
Where complete serialization is important is in the migrating of data
from say postgres to h2 or mysql to postgres or what have you. There
it would be really good if dxf had a more complete representation but
it always lags a bit and probably always will. I don't want to
downplay the importance too much but my recent experience has been
that judicious use of sed might be a more reasonable and efficient way
of mangaing these sql dialect conversions anyway. Or a more inward
looking xml serialization like xstream or something along the lines of
Murod's recent ideas.

The real principle use of dxf, as it has emerged, is as an
interoperable medium of exchange between systems - which I think
includes your use case. But you should appreciate it also includes
others such as iHRIS or OpenMRS data and so the format needs to be
complete enough to be able to map effectively against things like sdmx
and maybe also the likes of the google xml Knut recently drew ourt
attention to Dataset Publishing Language (DSPL).

Now you shouldn't be surprized to notice that structurally all of
these have a tendency to look like what you have below labelled [1]
with some extra richness here and there. With the exception of
dataset which is I think a bit of a peculiarity.

[The dataset (and the categorycombo for that matter) has only a very
weak relationship with the structure of data. In fact, from a MVC
perspective, both of these constructs have more to do with the view
layer. They determine what appears on forms. If they were renamed
Form and FormDimensions or FormSectionDimensions we might be a lot
clearer. But we live with names.

So far as datavalues are concerned, when we capture a bundle from the
wild, what is important is the dataelement and the categoryoptioncombo
(and period, orgunit etc). Its really not that important what the
categorycombo was on the form, nor as (Ola and Abyot have pointed out)
nor it seems even necessarily which dataset it was part part of when
it was collected. At least not from an analysis perspective.]

The significant thing to notice, is that I have ended up adding dataSet as an attribute on dataValue, even if it doesn't belong there for regular dxf use. This is difficult to avoid in a clean way: A client sending a message to the system to request a state change needs to be able to tell the system things that does not necessarily belong in the stable state description of the system. And if we want to avoid the possibility for ambiguity in certain cases, it is necessary to have dataSet as an attribute on datavalues, even though it does not belong there in dxf.

The best I have been able to do in my spike is make it optional (only required when the system actually have an ambiguity). This is of course not the only possible solution to this problem, I can at least think of the following approaches to try to handle this:

1) Do like above and have the rule about not using the dataSet attribute for regular uses of dxf specified somewhere else than in the format.
2) Not insist on one xml format for "incompatible" use cases.
3) Not allow functionality that is not representable in "canonical" dxf.
4) disallow uses that are ambigous (i.e. you cant post values if it is ambigous how it should be treated).
5) "Approximate" functionality - allow all uses and make a "qualified" guess as to how it should be resolved.
6) rework the domain model to accomodate the revealed inconsistencies.
7) rework the api so that knowledge not in the dxf model is represented in other ways (as a contrived example, have a http header with the dataset uuid instead of in the dxf).

Still, my current preference ends up being for 2 in the short term and more 6 in the longer term (I'm not ruling out the other approaches, just emphasizing what I think is the best bet as of now).

Of course 2 has the obvious drawback that more formats means more maintainance and for 6 it is always the case that it is much easier to see the problems with what you have rather than what think you want. And there are of course a whole lot of other concerns and implications whatever way we want to go. But overall I think something like my initial proposal [1] makes more sense than what I have above for clients needing to send (at least in my experience) dataValueSets.

I also think it could be a good idea to just admit that dataValueSets exists and see if we can introduce the concept to the domain model in an unobtrusive way (to group completeness and locking?), not rally changing much, just stating more clearly a concept that is implicitly there. But I'm guessing this might have a hard time justifying priority over other things, and it is not essential (but would be a more iterative approach to domain model changes than having to do a big bang redesign when it gets critical, which in the long run could have significant advantages in itself).

In the longer term I think that it is pretty clear that as dhis2 is moving to be more and more of a "datawarehouse" rather than a selfcontained system, we have to find a way to keep reevaluating the solution and the needs we have to support (like data values in multiple dataValueSets). I'm not saying that we don't need to keep supporting existing needs, and I'm not saying we should lightly part with what we have. But we must strive to have a core domain model that is reasonably simple, consistent and answers new needs and developments as they come along. And the growing integration requirements I think justifies changes to the core model (even if it might be a while, and it makes supporting some older requirements a bit harder).

Btw, if we someday ended up with a model where datavalues belong to a datavalueset, I think my initial xml [1] would in effect be the new dxf..

Jo

[1] For reference, my initial proposal was something like this:

<dataValueSet xmlns="http://dhis2.org/schema/dataValueSet/0.1&quot;&gt;
dataSet="dataSet UUID"
period="periodInIsoFormat"
orgUnit="unit UUID">
<dataValue dataElement="data element UUID" categoryOptionCombo="UUID, only specify if used" storedBy="string" value="value"/>
</dataValueSet>

So this really looks fine enough. Except you have an extra '>' in
line 1 and maybe missing a few optional attributes. And there is
still some discomfort around the dataSet attribute. If you made that
optional would you not find yourself with the same benefit as above,
ie looking substantially like legacy dxf, still meet your original
requirement simply and elegantly. And in the short term we don't
persist the datavalueset but just use it as a convenience for
datavalues to inherit repeated attributes from.

Anyway .. thats my quota up.

Regards
Bob

···

On 17 February 2011 12:34, Jo Størset <storset@gmail.com> wrote:

Hi,

first of all thanks for working on this Jo, had a look at the code and it seems quite fine.

Second I think your first version “[1]” was good, almost just fine and much better than the second. I would prefer having dxf as root but if you really really want to have isolated datavalusets I guess we can pipe the message through an xslt and end up with dxf since the datavalueset part can be similar. The important thing for me is that we have one parser. We will see enough complexity through maintenance of new minor/major versions of the format/parser and I would really like to only have one of these things to maintain:)

Re the dxf objective there is no doubt that there is currently more use than complete serialization. To be concrete we can already create “tailored” export files through the detailed meta-data export UI, we do almost-full meta-data exports to update meta-data in different databases and in India/SL they do lots of exports of datavalues only between offline and online systems. Now we are sending datavaluesets. And I think all of these messages can belong in a common DXF2 schema.

Anyway, I think this will be good. I am hoping you can take the effort to put it back to [1]. We can then build on the parser you have made to make it work with DXF 2 when we have more time.

cheers

Lars

There you are wrong. You keep making this point to the extent that I
think you now thoroughly believe it :slight_smile:

Let me try to rephrase (without wanting to continue the semantics debate in any way :slight_smile:

I think I view the uses you describe as api uses. They are external systems, and I'm not at all surprised that they tend to group values in sets :slight_smile:

The point I was trying to make, is that as long as
- we have values belonging to multiple sets in dhis
- *one* of the uses of dxf is to be able to serialize/deserialize the dhis structure
it is difficult to group values in datavaluesets in dxf.

I haven't thought it through, but I guess it would be possible to just duplicate the value accross datavaluesets and let the metadata guide us to the fact that this is a duplicated value on deserialization. But this sounds a bit complicating to me.

Secondly, I think it might be a matter of performance and scaling. The jaxb parser approach is not really tailored for large imports (at least not my vanilla use of it). I guess there is quite a
bit of room for improvement, but I am not sure that it will be an equal replacement for the batch handler stuff (which seem to have performance as it's primary concern).

If we were to replace the dxf parser with a new one based on something like what I have made here, that is at least a concern we should explicitly think through. While it is easy to write a "validating parser" for small stuff (like I have done here), I'm not sure how well it would scale.I think there is a possibility that we would end up having two parsers with different models of "content validation"? I think we should be able to avoid this, but at I don't know it. I don't have much experience with this kind of jaxb use, and I'm not sure my approach to "manual" content validation is the right way to make that happen.

I guess another thing is the medatadata and data mangling. That confuses me quite a bit, and I guess I might "uncorrectly" label api uses as cases where those are "separate" concerns. And before this turns into an argument, yes, I agree that a split would probably be advantageous for all use cases (and stable ids is the first step towards that). So thinking of that as "api uses" is not really useful, but as a first step they sort of are..

In the end I think it reasonable to start down this road with a small (in the worst case totally reversible) implementation now, and hopefully we can look at how best to redesign the internals of this implementation as we hopefully have more time for this later this year. But I felt I needed some kind of general acceptance before doing this, and it is good to try to be explicit about the risks before making such a choice.

So far as datavalues are concerned, when we capture a bundle from the
wild, what is important is the dataelement and the categoryoptioncombo
(and period, orgunit etc). Its really not that important what the
categorycombo was on the form, nor as (Ola and Abyot have pointed out)
nor it seems even necessarily which dataset it was part part of when
it was collected. At least not from an analysis perspective.]

I would of course agree, where it not for the fact that we *do* have the concept of locks and complete (and things like the required dataElements in the community module, that I think people want in general dhis as well). Yes it is a "data input" concern, but that *is* the concern when external systems send data..

And there is still some discomfort around the dataSet attribute. If you made that
optional would you not find yourself with the same benefit as above,
ie looking substantially like legacy dxf, still meet your original
requirement simply and elegantly.

Yes, it would be possible to make it optional. It wouldn't be very elegant to get "right", though.

I would have to iterate through the elements to find one with only one dataset attached, and then validate/operate on that. If all elements are in more than one data set (like with mobile specific data sets, where all elements could easily be in another "web" set as well), I would have to find the potential data sets that cover all data elements, and in effect validate against all (locking, possibly other things like required element) before storing the values. And if things like complete (or other value set level properties) was set, I would probably have to deny it even if all values would be ok to save..

So yes, it would probably be optional if this grouping were to be used generally, but I'm just not sure of the value of promoting that for the current api uses? It sould be easy enough to relax that constraint later?

Jo

···

Den 17. feb. 2011 kl. 19.18 skrev Bob Jolliffe:

Hi,

Thanks for the heads-up, Lars, that's what I hoped for :slight_smile: I'll try to make a first-cut then. And thanks to everyone for the feedback, it has at least has cleared up things a bit in my head.

I would prefer having dxf as root but if you really really want to have isolated datavalusets I guess we can pipe the message through an xslt and end up with dxf since the datavalueset part can be similar.

If we think datavalueSets are workable for a future dxf, it is enough to put it in the new dxf version namespace (what would we want that to be, btw?). I don't really see the advantage of a wrapper for the "post datavalueset" case, so I'd like to keep it simple if this isn't an blocking preference :slight_smile: We should easily enough be able to accept receiving both wrapped and not, though. Would that cover your preference? It would probably require a little bit of consideration and it should be easy enough to add later, so I don't think I'll use time on it right now.

The important thing for me is that we have one parser. We will see enough complexity through maintenance of new minor/major versions of the format/parser and I would really like to only have one of these things to maintain:)

Yes, I tried saying something about this in my answer to Bob. There are always the chance that this strategy doesn't work out as good as hoped, but testing it with real code in smaller use cases first should at least make things a bit easier to evaluate (and in the worst case reversible). And as long as we think the new xml format will be ok..

Jo

···

Den 18. feb. 2011 kl. 01.58 skrev Lars Helge Øverland:

There you are wrong. You keep making this point to the extent that I
think you now thoroughly believe it :slight_smile:

Let me try to rephrase (without wanting to continue the semantics debate in any way :slight_smile:

I think I view the uses you describe as api uses. They are external systems, and I'm not at all surprised that they tend to group values in sets :slight_smile:

The point I was trying to make, is that as long as
- we have values belonging to multiple sets in dhis
- *one* of the uses of dxf is to be able to serialize/deserialize the dhis structure
it is difficult to group values in datavaluesets in dxf.

I haven't thought it through, but I guess it would be possible to just duplicate the value accross datavaluesets and let the metadata guide us to the fact that this is a duplicated value on deserialization. But this sounds a bit complicating to me.

Secondly, I think it might be a matter of performance and scaling. The jaxb parser approach is not really tailored for large imports (at least not my vanilla use of it). I guess there is quite a
bit of room for improvement, but I am not sure that it will be an equal replacement for the batch handler stuff (which seem to have performance as it's primary concern).

If we were to replace the dxf parser with a new one based on something like what I have made here, that is at least a concern we should explicitly think through. While it is easy to write a "validating parser" for small stuff (like I have done here), I'm not sure how well it would scale.I think there is a possibility that we would end up having two parsers with different models of "content validation"? I think we should be able to avoid this, but at I don't know it. I don't have much experience with this kind of jaxb use, and I'm not sure my approach to "manual" content validation is the right way to make that happen.

Using jaxb on small chunks is easy enough to do so should scale
reasonably. The problem is less about jaxb and more about hibernate
not handling multiple inserts so, yes we might always need some sort
of batch handler for really large data dumps, but a compromise which
would allow some validation and yet bypass hibernate and allow some
performance gain might look more like having a mode flag or something
on the datavalue(set) save. We can worry aboput that later

I guess another thing is the medatadata and data mangling. That confuses me quite a bit, and I guess I might "uncorrectly" label api uses as cases where those are "separate" concerns. And before this turns into an argument, yes, I agree that a split would probably be advantageous for all use cases (and stable ids is the first step towards that). So thinking of that as "api uses" is not really useful, but as a first step they sort of are..

In the end I think it reasonable to start down this road with a small (in the worst case totally reversible) implementation now, and hopefully we can look at how best to redesign the internals of this implementation as we hopefully have more time for this later this year. But I felt I needed some kind of general acceptance before doing this, and it is good to try to be explicit about the risks before making such a choice.

Agree. My sense is that you should do what is sensible on your use
case without being shackled too much by what has been done before, but
with an eye on the fact that we will reuse the elements you create
(rather than create yet another set). So you have quite a free hand
but .. :slight_smile:

So far as datavalues are concerned, when we capture a bundle from the
wild, what is important is the dataelement and the categoryoptioncombo
(and period, orgunit etc). Its really not that important what the
categorycombo was on the form, nor as (Ola and Abyot have pointed out)
nor it seems even necessarily which dataset it was part part of when
it was collected. At least not from an analysis perspective.]

I would of course agree, where it not for the fact that we *do* have the concept of locks and complete (and things like the required dataElements in the community module, that I think people want in general dhis as well). Yes it is a "data input" concern, but that *is* the concern when external systems send data..

And there is still some discomfort around the dataSet attribute. If you made that
optional would you not find yourself with the same benefit as above,
ie looking substantially like legacy dxf, still meet your original
requirement simply and elegantly.

Yes, it would be possible to make it optional. It wouldn't be very elegant to get "right", though.

I would have to iterate through the elements to find one with only one dataset attached, and then validate/operate on that. If all elements are in more than one data set (like with mobile specific data sets, where all elements could easily be in another "web" set as well), I would have to find the potential data sets that cover all data elements, and in effect validate against all (locking, possibly other things like required element) before storing the values. And if things like complete (or other value set level properties) was set, I would probably have to deny it even if all values would be ok to save..

So yes, it would probably be optional if this grouping were to be used generally, but I'm just not sure of the value of promoting that for the current api uses? It sould be easy enough to relax that constraint later?

Sure. It would be optional at the schema level, but your api can
certainly insist upon it.

Cheers
Bob

···

On 18 February 2011 05:29, Jo Størset <storset@gmail.com> wrote:

Den 17. feb. 2011 kl. 19.18 skrev Bob Jolliffe:

Jo

Hi,

Thanks for the heads-up, Lars, that’s what I hoped for :slight_smile: I’ll try to make a first-cut then. And thanks to everyone for the feedback, it has at least has cleared up things a bit in my head.

I would prefer having dxf as root but if you really really want to have isolated datavalusets I guess we can pipe the message through an xslt and end up with dxf since the datavalueset part can be similar.

If we think datavalueSets are workable for a future dxf, it is enough to put it in the new dxf version namespace (what would we want that to be, btw?). I don’t really see the advantage of a wrapper for the “post datavalueset” case, so I’d like to keep it simple if this isn’t an blocking preference :slight_smile: We should easily enough be able to accept receiving both wrapped and not, though. Would that cover your preference? It would probably require a little bit of consideration and it should be easy enough to add later, so I don’t think I’ll use time on it right now.

This is just fine with me.

The important thing for me is that we have one parser. We will see enough complexity through maintenance of new minor/major versions of the format/parser and I would really like to only have one of these things to maintain:)

Yes, I tried saying something about this in my answer to Bob. There are always the chance that this strategy doesn’t work out as good as hoped, but testing it with real code in smaller use cases first should at least make things a bit easier to evaluate (and in the worst case reversible). And as long as we think the new xml format will be ok…

Yes I also share the performance concern, but also think that persistence will be the bottleneck. The only way to find out is to test it…

···

On Fri, Feb 18, 2011 at 7:58 AM, Jo Størset storset@gmail.com wrote:

Den 18. feb. 2011 kl. 01.58 skrev Lars Helge Øverland: