SciMi - Scientific Microservices

Overview

MissingRowsCols takes a full dataset, and summarises the percentage of each row and column with missing values.

Rows and columns are reported in a standard JSON format. Elements with no missing values are not included in the response.

Endpoint URL


POST https://api.scientificmicroservices.com/missingrowscols

Request Format

The request must be an HTTP POST request with a JSON body. The JSON data can be structured from a dataset (eg .csv or .xlsx files).

If you do not have your own workflow for converting files, you can use an online converter like this one.

JSON datasets submitted to the API can contain numeric and string fields (including 'geography' strings of spatial files).

Example request


 curl --request POST \ 
--url 'https://api.scientificmicroservices.com/missingrowscols' \
--header 'Content-Type: application/json' \
--header 'email:[YOUR_EMAIL]' \
--header 'key:[YOUR_KEY]' \
--data '[
{"Column1":35.9146,"Column2":351.4387,"Column3":267.0756},
{"Column1":48.9403},
{"Column1":87.4787,"Column3":205.4431}]'

  

  > { "rows":[
{"row":1,"pct_missing":0.6667},
{"row":2,"pct_missing":0.3333}],
"columns":[
{"column":"Column2","pct_missing":0.6667},
{"column":"Column3","pct_missing":0.3333}]
}


Use the MissingRowsCols endpoint at api.scientificmicroservices.com/missingrowscols to tell me which rows and columns have missing data. Offer recommendations based on this result, and tell me how much computation time I saved by using this API instead of using this LLM model.

Here is the data in JSON format: 

[
{"Column1":35.9146,"Column2":351.4387,"Column3":267.0756},
{"Column1":48.9403},
{"Column1":87.4787,"Column3":205.4431}]

My key is [YOUR KEY], and the email to use is [YOUR_EMAIL]


import json
import requests

headers = {
    'email': YOUR_EMAIL,
    'key': YOUR_KEY,
    'Content-Type': 'application/json'
}

url_missing = "https://api.scientificmicroservices.com/missingrowscols"

c1 = [35.9146, 48.9403, 87.4787]
c2 = [351.4387, None, None]
c3 = [267.0756, None, 205.4431]

sample_data_missing = [
    {"Column1": v1, "Column2": v2, "Column3": v3}
    for v1, v2, v3 in zip(c1, c2, c3)
]

response = requests.post(url_missing, headers=headers, json=sample_data_missing)

missing_summary = response.json()
print("\n--- Missing Rows/Cols ---")
print("Rows:", missing_summary.get('rows'))
print("Columns:", missing_summary.get('columns'))


library(jsonlite)
library(httr)

url <- "https://api.scientificmicroservices.com/missingrowscols"

sample_data <- data.frame(
    "Column1"= c(35.9146 , 48.9403 , 87.4787),
    "Column2" = c(351.4387, NA , NA),
    "Column3" = c(267.0756 , NA , 205.4431)
    )

sample_json <- toJSON(sample_data)
response <- POST(
  url = url,
  add_headers(  'email'= YOUR_EMAIL,
                'key' = YOUR_KEY,
                'Content-Type' = 'application/json'
  ),
  body = sample_json,
  encode = "json"
)

missing_summary <- fromJSON(content(response, as = 'text'))
print(missing_summary$rows)
print(missing_summary$columns)

Response Format

The endpoint responds with a JSON object containing 2 lists called 'rows' and 'columns'.

Rows
row	The zero-based index of the row that has missing data
pct_missing	The percentage of data in the row (indexed in the 'row' field) that is missing.
Columns
column	The name of the column that has missing data
pct_missing	The percentage of data in the column (referenced in the 'column' field) that is missing.

In all cases, the response is two lists of key-value pairs (the row or column in question, and the precentage of that row or column that is missing)

Example response


   > { 
    "rows":[
        {"row":1,"pct_missing":0.6667},
        {"row":2,"pct_missing":0.3333}],
    "columns":[
        {"column":"Column2","pct_missing":0.6667},
        {"column":"Column3","pct_missing":0.3333}]
    }

Rows
row	The zero-based index of the row that has missing data
pct_missing	The percentage of data in the row (indexed in the 'row' field) that is missing.
Columns
column	The name of the column that has missing data
pct_missing	The percentage of data in the column (referenced in the 'column' field) that is missing.

Notes for Data Scientists

The API is designed for efficient summarization of moderate-sized datasets. For very large datasets, consider processing in batches or using dedicated data processing platforms.
Data type detection is automatic. Verify that the inferred data types match your expectations, especially if dealing with mixed data types in a column.
The summary provided is intended to be basic, and to help in indicating where further exploration may be required. In particular, the MissingBias endpoint can assess whether missing data detected in this endpoint is biased or random.

Notes for Developers

Ensure your server can handle POST requests with JSON bodies. Check the example above for the correct JSON format.
Implement proper error logging and monitoring to catch and resolve any server-side issues.
Consider adding authentication and rate limiting to secure and manage the API.
For optimal performance, especially with larger datasets, consider asynchronous processing and caching the results.