While analyzing the data accuracy, the following question came to my mind.
When the data hierarchy is higher (e.g., provincial-level, national-level), assessing external consistency is almost impossible without the assumption that every piece of data stored in DHIS2 is accurate. In this case, what data analysts can do, I guess, is assess internal consistency alone.
If there are a total of three internal consistency lists (outliers, consistency between indicators, consistency over time), with two indicating inconsistency and one indicating consistency, then how can I conclude?
Thank you for posting this question to the community.
I would like to point out that while the questions in the post seem to be more general with regards to data quality, data accuracy is an important aspect that DHIS2 implementers can have great control over from the very beginning of data entry. After that when analyzing data quality, the analysis should help indicate whether the data is accurate or inaccurate rather than keeping them at an assumption that it is accurate.
Examples of these controls such as the ones mentioned in the docs:
You can verify data quality in different ways, for example:
At point of data entry, DHIS 2 can check the data entered to see if it falls within the minimum maximum value ranges of that data element (based on all previous data registered).
By defining validation rules, which can be run once the user has finished data entry. The user can also check the entered data for a particular period and organization unit(s) against the validation rules, and display the violations for these validation rules.
The lists you mentioned each provide different information for analysis so if one is indicating inconsistency then it’d be important to check for what’s causing the inconsistency. Each list will require a different sort of follow up, outliers will need to understand which data is “numerically distant from the rest of the data” and why, inconsistency between indicators will require one to understand the difference between these indicators and what’s causing them, and if there is an inconsistency over time then it might require to have a closer look at the time phases. My point is that each analysis method will require different follow-up strategy to ensure an overall coverage of data quality.
It’s important to bear in mind that any issues identified by data quality tools are obviously just “potential problems” not “definite problems”, and can in fact sometimes be real/accurate data. (For example, an outlier could be a real jump in data, rather than an error, and a validation rule might be poorly configured to generate occasional false positives.)
So in terms of your question around how to proceed, data validation itself should be done by staff as close to the frontline as possible (ideally by the staff capturing data). This is not only so the staff doing corrections understand the data and have access to original sources when correcting it; it also ensures that there’s a feedback loop to data capture staff, helping to prevent the same error being made in future.
If you’re not able to do this, and can only passively analyse DQ at a high level, I would (for the reasons described above) avoid looking at individual DQ issues, and instead only analyse aggregate trends (ie larger numbers of errors over time). That way, any false positives should hopefully be averaged out in these larger DQ statistics.