Most organisations are risk averse. This means that most corporate professions have procedures that are documented and transparent, so that in the event of something going wrong, it is easy to find out what happened. Lawyers, accountants, doctors, logistics managers - all have standardised, routine documentation that is produced as part of the job. Accounts, contracts, shipping notices, and invoices provide a comforting paper trail.
For data science, however, this is not the case.
Most data science risk mitigation strategies are idiosyncratic and improvisatory - an afterthought in the name of “getting things done”. In the industry, we call this “agile” and “flexible”, since we see it as a positive. But improvisation can also mean a lack of clear documentation about what exactly was done, and why. The improvisatory nature of data science is a problem for impact, because it makes it hard for non-specialists to trust the answers.
In short - if your data science program doesn’t follow a standard procedure, or produce routine documentation, you have no way of knowing whether its results are the product of a high-quality data science program or just random keyboard mashing.
Your data scientists can produce any number of products within the bounds of their skill set - it is the way they do it that makes them scientists.
What elevates data science above technobabble is the “science” part - the scientific method, that involves patient and deliberate documentation of hypotheses, methods, and results. If your methods, assumptions, and results are documented properly, and stored somewhere that other data scientists can access them, then it’s possible for other (data) scientists to build on your breakthroughs, to check your mistakes, and to iterate on your improvements.
In data science - as in all science - there is a series of routine steps that can be documented. Hypothesis, data, analysis technique selection, data checks - these steps are often skipped by data scientists trying to get an answer out quickly. It’s only when an answer seems wrong that we go back and check the data and the analysis technique.
If we, as data scientists, want to be taken seriously in a commercial context, then we need to produce the same kind of routine documentation as other professions. Every time we do an analysis, we need to have a set of standardised information - like you would see in an accountant’s books or a lawyer’s contracts - recorded somewhere for people to refer to later.
What exactly should be reported in data science documentation? It should be possible to enact a professional standard that’s consistent across organisations. At a minimum, your documentation should report enough information for a non-technical specialist to understand broadly what was done and why, and enough detail that another data scientist could replicate your analyses with the same data and get exactly the same result.
If you’re interested in concrete advice, there are reporting standards for academic disciplines, such as psychology, that will cover most bases. As an alternative, there are automated products, such as our Laboratory that produce standard statistical reports on datasets. Using products or technical standards like this, you can start communicating your method more clearly without confusing your stakeholders.
Correct reporting on data science is hard work. It requires time, attention, and patience, all of which are in short supply in a corporate context where a new question is generated every minute. Knowing what to check, what to report, and how to write it up, is a skill that academic researchers spend decades perfecting. Like any skill, however, scientific reporting can be learned, especially if it’s done routinely.
Producing routine documentation for data science programs doesn’t have to be difficult. Most data scientists should have the skills to automate a procedure that performs basic checks and reports the necessary information in a standard format. If not, there are tools available online that can produce standardised reports about your data and analysis.
If you are interested in AI tools for improving confidence in your data, check out our Laboratory - a beta tool we’re working on to help the community improve its data health and data confidence.
Jac Davis and Jegar Pitchforth are data science consultants with a shared 15 years experience in data science. We’ll be following up with a series of investigations into data anxiety and data leadership. If you’re interested, follow along to see posts on: strategies that excellent teams use to make data decisions, why performance reviews aren’t fixing your data problem, risk factors for a data anxious organisation, and treatments for data anxiety.