Optimal Outlier Detection Method for Left-Skewed Indicator Data: Exploring Modified Z-Score vs. Mean and Standard Deviation

Tammy_Kim · 20 August 2023 05:31

I am currently attempting to identify outliers in my dataset.
The majority of the data I am working with exhibits a left-skewed distribution.
According to the data quality assessment toolkit, there are two approaches to detecting outliers; utilizing the mean and standard deviation or employing the modified z-score.
Due to the left-skewed distribution, I have chosen the modified z-score method for greater accuracy.

However, when applying the modified z-score, no outliers are detected, even though the data’s characteristics appear unusual.
For instance, the range of values spans from 1.1 to 56.8, and it appears reasonable to anticipate the presence of outliers.

Prior studies have predominantly used the mean and SD for outlier detection.
Yet, I believe that the modified z-score method is more robust than the mean and SD.

Given this context, I am uncertain about how to proceed with outlier detection.
Here are a few potential options:

Use mean and SD method instead of the modified z-score
Use a modified z-score but adjust the threshold
Can you suggest an alternative outlier detection method?

I deeply appreciate your valuable insights and opinions on this matter.
Thank you!

LutfullahShifaa · 20 August 2023 13:53

Hello Kim
Thank you for posting nice questions which help us learn together. Please see my response as below:

I think only mathematically identifying the outliers will not be very helpful. For example: if you are assessing a disease variable, the disease might be seasonal and/or an epidemic of the disease but from 1.1 and 55.8 seems that the variable is either about something else or rate of a disease. So what variable we are dealing with, is important.
Its important to chose a fair threshold level either for z-score or for SD. If we choose 3 SD or z-scor > 3 for not normally distributed variable, most likely we wont detect outliers for most of variables specially for timeseries data (trend of a variable)
Besides those two methods, you can use the below also.

Q3 + 1.5 * IQR
Q1 - 1.5 * IQR

jason · 21 August 2023 07:47

Hi @Tammy_Kim

If your data is skewed, then I do not think that the z-scores method is really going to help you much. The method uses the median as opposed to the mean, which helps in situations where the outlier itself influences the mean. However, if your data is not normalized to begin with, the median is also going to be skewed. It sounds like from the description of your data, that it does not follow a normal distribution, in which case, the use of another distribution ( possible a Poisson) would be more appropriate. However, this type of outlier analysis is not currently supported by DHIS2. Not sure there is a great solution here inside of DHIS2. Pulling the data out into R/Python/Statistical choice of your choice might be the only option at this point.

Best regards
Jason

Tammy_Kim · 21 August 2023 08:24

Dear @LutfullahShifaa ,

Thank you for your valuable answer!

The value I suggested to you was TT1 coverage (Tetanus toxoid vaccination first coverage among pregnant women who visit the first antenatal care). I believe these values are mildly influenced by seasonal and/or epidemic fluctuation.
According to the WHO guidelines, 2SD from the mean are categorized as mild outliers, while 3SD is regarded as extreme outliers. The recommended threshold for the modified z-score is 3.5. Interestingly, the modified z-score seems to consistently recommend a threshold of 3.5.
Given this, would it be acceptable for me to adjust the threshold independently?
I am curious about the foundational basis of your formula.

I’ve contemplated this matter from various angles, and I truly appreciate your collaborative thought process. I eagerly await your response.

Tammy_Kim · 21 August 2023 08:33

Dear @jason

Thank you for your amazing insight!

The data I am currently working with does not exhibit a normal distribution.
Do you think it would be appropriate to normalize the data first and then calculate the median or mean along with the standard deviation?

I have limited expertise in this area, so I am relying on the WHO guideline (Data quality assurance: module 2: discrete desk review of data quality).
It mentions that the ‘modified z-score method is useful for small samples (which is my case) and is more tolerant than the z-score to extreme values’.

I also agree with your approach ‘normalization’ seems like a promising solution.
Your insights would be greatly appreciated.
Thank you very much!

LutfullahShifaa · 21 August 2023 14:29

Hello again @Tammy_Kim
I suggets to apply your formula on TT1 and 1st ANC raw data separately instead of the TT1 coverage . Try it and see the result. And, Outlier detection will be more sensitive at lower level (facility level).

See the details of the formula i provided:
Q1: its the first quartile
Q3: its the 3rd quartile
IQR: its the Inter quartile range ( Q3-Q1)
1.5 : it increase the IQR by .5

Its the same formula calculating whisker of box plot but Its not available in DHIS2 so far.

Good luck