This article covers the statistical calculation of DetectOutliers for data scientists. For information about the implementation, see the documentation
Overview
DetectOutliers provides users with a list of the outliers detected in an array of numbers or strings.
This is useful in any case where you wish to identify unusual values in a set of values to extract them, delete them, or otherwise examine their properties.
DetectOutliers uses three common outlier detection algorithms to identify unusual values. Values identified by all three algorithms are returned in DetectOutliers. Our current simulations suggest this 'hybrid' method is more effective at detecting synthetic outliers than much more computationally advanced methods, and more efficient methods are added as they are published in peer-reviewed research.
In all cases, the mathematical difference in these algorithms lies in how the range of 'normal' values is calculated. The upper and lower bounds of this range are commonly called 'fences' in outlier detection literature.
Interquartile Range
The Interquartile Range (IQR) algorithm determines outlier values to be those that sit outside 1.5 times the range between the first and third quartiles of the data.
We can formulate this method as:
The IQR can be written as:
Standard Deviation
The standard deviation algorithm assumes a relatively normal distribution of scores, returning a positive result for any values outside 1.96 standard deviations above and below the mean of the data. As in all applications of this type of frequentist statistics, we use 1.96 as it represents 95% of the data (i.e. 2.5% of the distribution on each side will be considered outliers.)
p> We can formulate this method as:Where the standard deviation of the data is taken as:
Note that in the case of DetectOutliers we use the SD for a population rather than for a sample. This is appropriate for the case of the API, as the submitted data is the full population of points to be considered in the analysis, and we are not tring to extend our inference to other datasets.
Median Absolute Deviation
Median Absolute Deviation (MAD) is a useful algorithm in the case where data are skewed, as it is based on medians rather than means (which can be easily affected by large values). It can be seen as the median of deviations from the median value of the data.
We can formulate this method as:Where the MAD is defined as:
Calculating the result
The set of elements in the data are considered outliers if all three algorithms detect them. We can write this in set notation as:
References
- Yang, J., Susanto, R. and Franti P. Outlier detection: how to threshold outlier scores? AIIPCC '19: Proceedings of the International Conference on Artificial Intelligence, Information Processing and Cloud Computing, 2019, 37:1-6
- Leys, C., Klein, O., Bernard, P., Licata, L. Detecting outliers: Do not use standard deviation around the mean, use absolute deviation around the medianJournal of Experimental Social Psychology, 2013, 49:4