Reconstructing a categoryoptioncombo (long story)

Hi

Here’s a problem. Apologies, its a long mail, but its a serious business and needs to be untangled.

Two or more systems have matching dataelements, categorycombos, categories and categoryoptions. They could be matched on uid, name, code or what ever. Assuming they also have matching orgunit identifiers, those two systems should be able to exchange data. There is really no need for either of them to know anything about the other’s categoryoptioncombos. Which is a good thing on a number of fronts. Not least being that if either one of the two is not dhis2 then it won’t have the faintest notion of a categoryoptioncombo anywat. And even if they were both dhis2, we all know that keeping these catoptcombos in synch is notoriously difficult.

So I’ve been over some of this ground before, but now thinking about implementation, there are some missing pieces in our model (and some shortcomings of the java language) which makes this a bit trickier than it should be. Picture this datavalue being imported (using codes for legibility):

<datavalue dataElement=‘MalariaCases’ sex=‘M’ age=‘under5’ … />

  1. Once we know the dataelement we can immediately retrieve the categorycombo, which tells us to expect two more attributes: sex and age in this case.

  2. We could go the database at this point and query from the categoryoptioncombos_categoryoptions table, having first retrieved the primary ids for the categoryoptions. This would certainly work, but the table might be quite big and the query would be required many times for a large datavalueset. Given that we know the categorycombo from 1 above, we should only need to query from a very much smaller set of data contained in an in-memory data structure.

  3. But what would such a data structure look like? Essentially what is required is a multidimensional associative array which is keyed along each of its dimensions using the categoryoptions of a category. For most of our categorycombos this would be a 1 or 2 dimensional array, but with some rarer cases of 3 or 4 categories. That would allow lookups of the sort getCatOptCombo(sex=‘M’, age=‘u5’, …)

Such a dynamic associative array is a natural paradigm in languages like perl, tcl, php, javascript, and probably R, but java leaves us a bit short. The structure is not easily expressed, at least not efficiently.

  1. One alternative is to model it as a tree structure. This has a minor drawback that a tree has to put the categories (the layers of the tree) in some order which is not implicit in our model, but that’s not a very big problem. If you know the order they were put in, you can use the same order to search them out. A bit of xml below shows more or less what the structure of that tree would be like for a typical age-sex combo:

Note the xml is incidental. The point is the tree structure. Mind you, java doesn’t have a built in tree type but it does have a DOM model which could be used very adequately for this kind of structure (tip off stackoverflow). Assuming we had created such a DOM model for a particular categorycombo, then we can answer our question of what the catoptcombo is for age=‘<5’ and sex=‘M’ with a relatively simple XPath query like:

//categoryOption[@code=‘M’]/category/categoryOption[@code=‘u5’]/catoptcombo/@id

Given these categorycombo trees will each individually be relatively small, these will in fact be quite efficient lookups.

So I’m left with a few questions.

  1. does it make sense to use a DOM tree for this rather than invent our own custom tree structure? I’m inclined to do the DOM first because its easy and will definitely work. But won’t be the fastest. Could optimize a custom tree later. Mind you, after considering 3 below, the DOM tree might make more sense than appears at first pass.

  2. Am I missing something. Is a tree the right way to do this? Is there something about java’s apparent lack of multi-associative-arrays which I am just not getting?

  3. Looking at the XML above (which was meant to be incidental), I now think it makes a lot more sense than what we actually output currently from our resources api. For example, cutting a few bits from https://apps.dhis2.org/demo/api/categoryCombos/dzjKKQq0cSO:

This representation in itself is actually not very useful and asks more questions than it answers. In particular, and this is important, how is that list of categoryoptioncombos related to that list of categories. And even more particularly their categoryoptions. A not inconsiderable number of additional api requests would have to be made before reaching sensible answers to this pretty fundamental question. In fact the arbitrary presentation of these two elements (optioncombos and categories) is even a bit odd in the sense that both of these lists are entirely derivable from the other. The more verbose but explicit tree model above is IMHO a much more useful representation of the resource.

So that’s my thinking on the problem at present. We need to make a new object (eg CategoryComboTree) to allow for simple lookups of the local catoptcombos. As we read in a datavalueset, we check each dataelement for its categorycombo. If we haven’t yet instantiated a tree for that combo, we instantiate one. (Typical datavaluesets might only have a small handful of these). Using our tree, we look up the catoptcombo for each set of category attributes in the datavalue.

Any better ideas?

Bob

Hi Bob,

Good question. I like the idea of an in-memory cache for speed, as you suggest. You might try using a HashTable where the key is an array of option value Strings and the value of the HashTable is the optionCombo. As you process the import, each time you get from the dataElement a categoryCombo you haven’t seen before, then get all the optionCombos for this categoryCombo and put them into your HashTable. The order you put them into the key array can be the same as the order of the DataElementCategoryCombo.getCategories() method, since it returns a list. When looking up a bunch of category values, just put them in the same order into the array.

Obviously once you’ve built the values->combo lookup, you will want to reuse it as much as possible. You could put this into a com.google.common.cache.Cache so that it can be resued not only by subsequent record in the same import, but by other imports that come before the cache entry ages out. The only danger of this in theory is that someone could extend a category combo or add new option values, and then try an import before the cache expires. Although this is extremely unlikely, you can protect against it: If a values->combo lookup fails, remove the cached HashTable for this categoryCombo and rebuild it. If it still fails, then you’ve got a real error. :slight_smile:

Cheers,

Jim

···

On Thu, Jan 29, 2015 at 1:38 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Hi

Here’s a problem. Apologies, its a long mail, but its a serious business and needs to be untangled.

Two or more systems have matching dataelements, categorycombos, categories and categoryoptions. They could be matched on uid, name, code or what ever. Assuming they also have matching orgunit identifiers, those two systems should be able to exchange data. There is really no need for either of them to know anything about the other’s categoryoptioncombos. Which is a good thing on a number of fronts. Not least being that if either one of the two is not dhis2 then it won’t have the faintest notion of a categoryoptioncombo anywat. And even if they were both dhis2, we all know that keeping these catoptcombos in synch is notoriously difficult.

So I’ve been over some of this ground before, but now thinking about implementation, there are some missing pieces in our model (and some shortcomings of the java language) which makes this a bit trickier than it should be. Picture this datavalue being imported (using codes for legibility):

<datavalue dataElement=‘MalariaCases’ sex=‘M’ age=‘under5’ … />

  1. Once we know the dataelement we can immediately retrieve the categorycombo, which tells us to expect two more attributes: sex and age in this case.

  2. We could go the database at this point and query from the categoryoptioncombos_categoryoptions table, having first retrieved the primary ids for the categoryoptions. This would certainly work, but the table might be quite big and the query would be required many times for a large datavalueset. Given that we know the categorycombo from 1 above, we should only need to query from a very much smaller set of data contained in an in-memory data structure.

  3. But what would such a data structure look like? Essentially what is required is a multidimensional associative array which is keyed along each of its dimensions using the categoryoptions of a category. For most of our categorycombos this would be a 1 or 2 dimensional array, but with some rarer cases of 3 or 4 categories. That would allow lookups of the sort getCatOptCombo(sex=‘M’, age=‘u5’, …)

Such a dynamic associative array is a natural paradigm in languages like perl, tcl, php, javascript, and probably R, but java leaves us a bit short. The structure is not easily expressed, at least not efficiently.

  1. One alternative is to model it as a tree structure. This has a minor drawback that a tree has to put the categories (the layers of the tree) in some order which is not implicit in our model, but that’s not a very big problem. If you know the order they were put in, you can use the same order to search them out. A bit of xml below shows more or less what the structure of that tree would be like for a typical age-sex combo:

Note the xml is incidental. The point is the tree structure. Mind you, java doesn’t have a built in tree type but it does have a DOM model which could be used very adequately for this kind of structure (tip off stackoverflow). Assuming we had created such a DOM model for a particular categorycombo, then we can answer our question of what the catoptcombo is for age=‘<5’ and sex=‘M’ with a relatively simple XPath query like:

//categoryOption[@code=‘M’]/category/categoryOption[@code=‘u5’]/catoptcombo/@id

Given these categorycombo trees will each individually be relatively small, these will in fact be quite efficient lookups.

So I’m left with a few questions.

  1. does it make sense to use a DOM tree for this rather than invent our own custom tree structure? I’m inclined to do the DOM first because its easy and will definitely work. But won’t be the fastest. Could optimize a custom tree later. Mind you, after considering 3 below, the DOM tree might make more sense than appears at first pass.

  2. Am I missing something. Is a tree the right way to do this? Is there something about java’s apparent lack of multi-associative-arrays which I am just not getting?

  3. Looking at the XML above (which was meant to be incidental), I now think it makes a lot more sense than what we actually output currently from our resources api. For example, cutting a few bits from https://apps.dhis2.org/demo/api/categoryCombos/dzjKKQq0cSO:

This representation in itself is actually not very useful and asks more questions than it answers. In particular, and this is important, how is that list of categoryoptioncombos related to that list of categories. And even more particularly their categoryoptions. A not inconsiderable number of additional api requests would have to be made before reaching sensible answers to this pretty fundamental question. In fact the arbitrary presentation of these two elements (optioncombos and categories) is even a bit odd in the sense that both of these lists are entirely derivable from the other. The more verbose but explicit tree model above is IMHO a much more useful representation of the resource.

So that’s my thinking on the problem at present. We need to make a new object (eg CategoryComboTree) to allow for simple lookups of the local catoptcombos. As we read in a datavalueset, we check each dataelement for its categorycombo. If we haven’t yet instantiated a tree for that combo, we instantiate one. (Typical datavaluesets might only have a small handful of these). Using our tree, we look up the catoptcombo for each set of category attributes in the datavalue.

Any better ideas?

Bob


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp

Yes I’ve been thinking a bit about HashTables and hashMaps. Effectively that’s the closest thing you get to an associative array. And can simulate a tree. If we concede the ordering of keys then a simple HashMap<String,Integer> would even do the trick if we concatenate the categoryoption identifiers together, like for example (a 2 dimensional categorycombo):

catcomboMap.put( “HbIugXRzrcK.mBl2TUqeODx”,43);

catcomboMap.put( “HbIugXRzrcK.h5uidbruf8nJ”,44);

where the “value” is the internal id of the catoptcombo, which is all we need to perform the insert.

This is effectively a tree with keys on the branches :slight_smile: A more efficient variant would be some sort of radix tree like what routers use to process IP addresses, but maybe the above would be “good enough”.

We could leave these things around in a cache, but I doubt if its a huge overhead to build up a few small maps during the import. One reason for doing them on the fly is that you need to cater for different identifier schemes being used. So the above will work for uids, but if the import was using names or codes you’d want to make different maps.

Agree with what you say about constructing them on encountering each new categorycombo. This is also what I was suggesting.

···

On 29 January 2015 at 22:07, Jim Grace jimgrace@gmail.com wrote:

Hi Bob,

Good question. I like the idea of an in-memory cache for speed, as you suggest. You might try using a HashTable where the key is an array of option value Strings and the value of the HashTable is the optionCombo. As you process the import, each time you get from the dataElement a categoryCombo you haven’t seen before, then get all the optionCombos for this categoryCombo and put them into your HashTable. The order you put them into the key array can be the same as the order of the DataElementCategoryCombo.getCategories() method, since it returns a list. When looking up a bunch of category values, just put them in the same order into the array.

Obviously once you’ve built the values->combo lookup, you will want to reuse it as much as possible. You could put this into a com.google.common.cache.Cache so that it can be resued not only by subsequent record in the same import, but by other imports that come before the cache entry ages out. The only danger of this in theory is that someone could extend a category combo or add new option values, and then try an import before the cache expires. Although this is extremely unlikely, you can protect against it: If a values->combo lookup fails, remove the cached HashTable for this categoryCombo and rebuild it. If it still fails, then you’ve got a real error. :slight_smile:

Cheers,

Jim

On Thu, Jan 29, 2015 at 1:38 PM, Bob Jolliffe bobjolliffe@gmail.com wrote:

Hi

Here’s a problem. Apologies, its a long mail, but its a serious business and needs to be untangled.

Two or more systems have matching dataelements, categorycombos, categories and categoryoptions. They could be matched on uid, name, code or what ever. Assuming they also have matching orgunit identifiers, those two systems should be able to exchange data. There is really no need for either of them to know anything about the other’s categoryoptioncombos. Which is a good thing on a number of fronts. Not least being that if either one of the two is not dhis2 then it won’t have the faintest notion of a categoryoptioncombo anywat. And even if they were both dhis2, we all know that keeping these catoptcombos in synch is notoriously difficult.

So I’ve been over some of this ground before, but now thinking about implementation, there are some missing pieces in our model (and some shortcomings of the java language) which makes this a bit trickier than it should be. Picture this datavalue being imported (using codes for legibility):

<datavalue dataElement=‘MalariaCases’ sex=‘M’ age=‘under5’ … />

  1. Once we know the dataelement we can immediately retrieve the categorycombo, which tells us to expect two more attributes: sex and age in this case.

  2. We could go the database at this point and query from the categoryoptioncombos_categoryoptions table, having first retrieved the primary ids for the categoryoptions. This would certainly work, but the table might be quite big and the query would be required many times for a large datavalueset. Given that we know the categorycombo from 1 above, we should only need to query from a very much smaller set of data contained in an in-memory data structure.

  3. But what would such a data structure look like? Essentially what is required is a multidimensional associative array which is keyed along each of its dimensions using the categoryoptions of a category. For most of our categorycombos this would be a 1 or 2 dimensional array, but with some rarer cases of 3 or 4 categories. That would allow lookups of the sort getCatOptCombo(sex=‘M’, age=‘u5’, …)

Such a dynamic associative array is a natural paradigm in languages like perl, tcl, php, javascript, and probably R, but java leaves us a bit short. The structure is not easily expressed, at least not efficiently.

  1. One alternative is to model it as a tree structure. This has a minor drawback that a tree has to put the categories (the layers of the tree) in some order which is not implicit in our model, but that’s not a very big problem. If you know the order they were put in, you can use the same order to search them out. A bit of xml below shows more or less what the structure of that tree would be like for a typical age-sex combo:

Note the xml is incidental. The point is the tree structure. Mind you, java doesn’t have a built in tree type but it does have a DOM model which could be used very adequately for this kind of structure (tip off stackoverflow). Assuming we had created such a DOM model for a particular categorycombo, then we can answer our question of what the catoptcombo is for age=‘<5’ and sex=‘M’ with a relatively simple XPath query like:

//categoryOption[@code=‘M’]/category/categoryOption[@code=‘u5’]/catoptcombo/@id

Given these categorycombo trees will each individually be relatively small, these will in fact be quite efficient lookups.

So I’m left with a few questions.

  1. does it make sense to use a DOM tree for this rather than invent our own custom tree structure? I’m inclined to do the DOM first because its easy and will definitely work. But won’t be the fastest. Could optimize a custom tree later. Mind you, after considering 3 below, the DOM tree might make more sense than appears at first pass.

  2. Am I missing something. Is a tree the right way to do this? Is there something about java’s apparent lack of multi-associative-arrays which I am just not getting?

  3. Looking at the XML above (which was meant to be incidental), I now think it makes a lot more sense than what we actually output currently from our resources api. For example, cutting a few bits from https://apps.dhis2.org/demo/api/categoryCombos/dzjKKQq0cSO:

This representation in itself is actually not very useful and asks more questions than it answers. In particular, and this is important, how is that list of categoryoptioncombos related to that list of categories. And even more particularly their categoryoptions. A not inconsiderable number of additional api requests would have to be made before reaching sensible answers to this pretty fundamental question. In fact the arbitrary presentation of these two elements (optioncombos and categories) is even a bit odd in the sense that both of these lists are entirely derivable from the other. The more verbose but explicit tree model above is IMHO a much more useful representation of the resource.

So that’s my thinking on the problem at present. We need to make a new object (eg CategoryComboTree) to allow for simple lookups of the local catoptcombos. As we read in a datavalueset, we check each dataelement for its categorycombo. If we haven’t yet instantiated a tree for that combo, we instantiate one. (Typical datavaluesets might only have a small handful of these). Using our tree, we look up the catoptcombo for each set of category attributes in the datavalue.

Any better ideas?

Bob


Mailing list: https://launchpad.net/~dhis2-devs

Post to : dhis2-devs@lists.launchpad.net

Unsubscribe : https://launchpad.net/~dhis2-devs

More help : https://help.launchpad.net/ListHelp