I am currently assessing the internal consistency of DHIS2 maternal and child health data using R and internal data quality check toolkit referencing the WHO data quality toolkit documents. However, the documents are not fully functional, so I would like to ask a few questions.
How should I get the threshold? Some pages indicate a threshold of 20%, while others suggest 33%. Which standard should I follow both consistency over time and consistency between indicators?
Is there any statistical evidence of internal consistency?
I suggest you review our youtube series on data quality and our online data quality academy
To answer your questions,
A threshold should be calculated using standard deviations from the median (modified-Z score) as opposed to percentages. The number of standard deviations is usually 2 or 3 depending on how sever of an outlier you want to find. Another method is to use interquartile ranges which is available in the scatter plot chart in the data visualizer application. If you’re looking for outliers in seasonal data these methods will not be appropriate and you will need to use a time-series model like the Mean Absolute Scales Error (MASE). You can see example of how this is done in R using data from DHIS2 in this presentation. https://www.youtube.com/watch?v=65GKAC64qIg
I’m not sure if there is a straight-forward answer to your question, unfortunately. DHIS2 has developed a number of data quality tools within DHIS2 to align with the guidelines and recommendations of the WHO Data Quality Review toolkit and facilitate analysis of DHIS2 data, but I see you are using an external R tool. The resources from Scott are excellent to explore further the types of analyses that can be performed within DHIS2.