Aggregated data sets, representing suppressed data

Hello all!

I have a situation where some of the data I get will be “suppressed”: For example, due to privacy concerns/regulations, statistics with numbers (data-elements) less than 5 cannot be given to DHIS2. Usually this is done by coding the element with a special value (or missing).
So you will have possible values (assuming the

  • A number >=5
  • Missing value/NULL (no observation)
  • ‘.’ - to indicate 0-4

Another use of suppressed data is if the indicator (number of positive tests/total tests) is very close to 1 (indicating for example that everyone that was tested has a given disease).

I see no such datatypes available, and it of course causes problem when used to make indicators or aggregations.

What is best practices here?
The obvious candidate in the first case is to use 2, with the convention that it could be any number between 0 and 4.
For the second case (all tested are positive) - I see the need for the extra field.

Have there been discussion/need of another datatype that does support the setting “it’s a small number, but I can’t tell you what”

Hi @gutorm

Perhaps if the actual value of the number that is greater than or equal to five is not necessary then maybe your use case can be solved using an option set? The data element will be of the following options:

Would it take into account if we can simply assume that if it’s not the first ( >=5) nor the second (null/no observation) then it’s the last (0-4)?

If so then you could have a data element controlled by a program rule that doesn’t allow numbers less than five, and another data element that’s either yes or no with the question, “is it a small number?”

In this case, if the first data element isn’t empty (>=5) then automatically the second data element will be a ‘No’ (using a program rule), but if it’s empty then the second data element must be selected with Yes or No…

Just some thoughts, I don’t see my comment above as an exact nor perfect response. Please feel free to explain further or comment on my response if I misunderstood something.

It’s been a long time since I worked with the Australian Census data sets, but when I used to, they would always randomise numbers below 4. But they stored the same data at multiple levels, so that if you wanted to search the data at a more aggregated level, you would get true/exact numbers, rather than an aggregation of randomised numbers.

Before implementing an average of ‘2’, I would look at how sparse your indicator is, and try to estimate what level of error might be introduced when these ‘averages’ are then aggregated up to higher geographies. (If there aren’t many numbers much higher than 4, then it’s quite possible that the most frequent values would in fact be 0 or 1, and you’ll therefore be inflating your higher-level figures quite a lot with this ‘average’ of 2…)

If you do have a lot of suppressed data, it might be worth asking for the data for that indicator at a higher level of aggregation (eg districts instead of communes), which will give you much more accurate totals at higher levels. (If a lot of the data is suppressed, the lower-level data won’t be much real use anyway.) You could do this indicator-by-indicator, with more frequent indicators (eg ANC visits) stored in DHIS2 at a very granular level, and more sparse data (eg maternal deaths) stored in DHIS2 at a higher/more aggregated level.

I don’t know your use case, so this suggestion mightn’t be practical, but it’s probably still worth keeping the overall principle in mind.

Cheers, Sam.

1 Like

Thanks both!

@Gassim I will look into option sets, but I fear that I end up with a lot of hand coding later :slight_smile: And for the time being it is not worth it (for me).

@SamuelJohnson - The randomization is a really nice little trick and can be very versatile!
In the inverse problem (the numbers are exact but should not necessary be show - it is also possible to (always) add a little random noise ‘wherever there are calculations’.

My use case is actually ‘find a quick a dirty solution while we get our legal/security stuff together’ - but it is always nice to hear community wisdom - AND someone might stumble into it later.