This article covers the statistical calculation of MissingRowsCols for data scientists. For information about the implementation, see the documentation


Overview

MissingRowsCols provides users with a summary of the location and extent of missing data in the submitted dataset.

This is useful in cases where multiple datasets need to be ingested and combined, when monitoring automated pipelines, and in validation of datasets during database migrations.

The API response includes two tables formatted as a json. Each table has two columns reflecting the index of the row or column (‘row, or ‘column’ in the tables respectively), and ‘pct_missing’, reflecting the percentage of data missing in the row or column indicated as a decimal (0-1). To convert to true percentage multiply the response by 100 (e.g. .45 = 45%).

Missing data calculation by row

The missing data by row is based on the assumed number of columns in the submitted dataset, which is in turn based on the number of column names in the data.

Any cell in the table that is recorded as NA will be included in the reported percentage of missing data for that row.

The percentage of missing data is calculated as:

pct_missing = N(missing) r / N(columns) Where r represents the row in consideration

Missing data calculation by column

The calculation of missing data by column is based on the assumed number of rows in the submitted dataset based on the longest array sent in the body as json.

NOTE: all columns are assumed to have the same number of rows as the longest column.

The percentage of missing data in each column is calculated as:

pct_missing = N(missing) c / N(rows) Where c represents the column in consideration

If all columns except one show missing data, it may be worth validating that there are no extra cells at the bottom of the submitted sheet (e.g. a ‘Total’ cell at the bottom of a financial dataset).

To keep the response as small as possible, only columns that contain missing data are included in the response. An empty response indicates there is no missing data in the dataset.

References

  • Reference 1
  • Reference 2
  • Reference 3