This article covers the statistical calculation of MissingBias for data scientists. For information about the implementation, see the documentation
Overview
MissingBias uses probability thresholds to decide whether missing data are biased. Bias is a persistent problem in statistics applications such as Machine Learning and AI, and there are many ways data can be biased. This algorithm specificially focusses on whether the probability of a cell being missing (NA) depends on the value of some other column in the data.
MissingBias transforms Y, the vector with missing data, to a binary indicator. The binary indicator records only two possible values: 0 if the corresponding score in Y is observed, 1 if it is missing.
Distribution of X
MissingBias selects a probability test based on the probability distribution most suitable for the type of data in X. MissingBias uses the probability test to calculate the probability of missing values in Y, given the value of X. The test is compared to the null hypothesis that missing values in Y are randomly distributed over all values of X.
Ratio Calculation for Categorical Variables
The ratio probability test uses the G test of independence to calculate whether missing values in Y are randomly distributed over all categories of X.
The log likelihood ratio, G, is the ratio of the likelihood of the null hypothesis (missing values are randomly distributed over all categories of X) to the likelihood of the alternative hypothesis (missing values are more likely in some categories of X than others).
We can formulate this method as:
MissingBias calculates a p value for the ratio probability test by comparing the G statistic to a chi square distribution with degrees of freedom equal to the number of unique categories in X - 1.
MissingBias calculates a p value for the ratio probability test using the Wald test.Regression model based calculation for continuous variables
The continuous probability test uses logistic regression to calculate whether missing values in Y are statistically predicted by the value of a continuous X.
We can formulate this method as:
Where:
- P ( Missing i | Y ) = Probability that an observation is missing in variable X, given the value of variable Y.
- e = The base of the natural logarithm (about 2.718).
- a = The constant (value of p when X is zero
- b = The coefficient of variable X, how much P ( Missing | Y ) changes for a unit change in X
Calculating the result
References
- AUTHORSLIST Article title Journal, year, issue:edition