Stop Being a “Janitorial” Data Scientist

My coworker and I recently gave a talk at the Data Intelligence Conference hosted by Capital One. The talk explained the complications of working with patient clinical data and how data scientists often play the roles of “janitors” by regularly massaging their data sets by performing type assertions and value checking. Not only is this cumbersome work, but it is also repetitive, which leads to frustrated data scientists.

Allow me to begin by saying that consuming data from electronic medical records (EMRs) is not a trivial task. Often times, the consumer of said data is challenged with verifying its validity. For example, the temperature of a patient is assumed to be a numeric type. One would expect that an acceptable value for the temperature of a patient might be 98.6 degrees Fahrenheit. Unfortunately, this is not always the case. Imagine receiving a value such as “Unable to take temperature, the patient just ate ice.” What is the data scientist supposed to do with this value? How is this even considered to be a legitimate temperature? There are a lot of unknowns.

So, again, the question remains, what is the data scientist supposed to do with this value? Sure, one can try to cast it to a numeric type such as an integer or a float, but would be faced with the following exception:

Not to mention, continuously performing type assertions, and type casting is not only “janitorial” work, but these operations are computationally expensive. If the size of the data set increases, which it almost always does, the amount of time taken to perform said operations increases the runtime of the program tenfold. Furthermore, is the data scientist even aware of the kind of data they are working with? This is a common problem amongst data scientists and the data sets they consume, especially in the medical domain.

Start with the source

Where is the data coming from? Typically, data is fetched from upstream resources. These resources can include web services, enterprise data warehouses, or even streaming telemetry data. It would certainly be helpful to have a way to validate upstream data before it even reached the pipeline, right? How about the idea of validating an HTTP response body or a database cursor? Are any of these ideas even possible? Absolutely. And I would encourage anyone facing these challenges (not just data scientists, but even data engineers), to read on.

Let us start with a simple example. Consider the following JSON data:

{
"id": 1,
"name": "A green door",
"price": 12.50,
"tags": ["home", "green"]
}

Simple? Yes. However, there are some uncertainties with this data. What is “id?” Is “name” required? Can “price” be zero? Are all “tags” strings? Unless provided some type of document that explains the above example, it is unclear to understand what the data represents and whether or not it is considered “valid.” Fortunately, there is an RFC defined for describing JSON formatted data.

Enter JSON Schema

JSON Schema is a vocabulary that allows you to annotate and validate JSON documents.

Now, look at the following JSON string. You should begin to see similarities from the previous example.

Another way to think of JSON schema is a way of “describing” data. It helps create a strong object model when working with data from upstream sources. The specification tells the user that the JSON instance is of type object with four properties, three of which are required (“id,” “name,” and “price”). It also requires that “price” has a minimum value of “0”. These are examples of validation keywords and are available for seven primitive types: object, array, boolean, null, integer, number, and string. Therefore, the previous JSON instance is considered to be valid. However, the instance defined below is not.

{
"name": "A green door",
"price": -15.50,
"tags": []
}
Note: Before you continue reading, keep in mind that JSON schema does not actually validate client-submitted data. It is important to realize that, by itself, JSON is not a programming language, rather a data-interchange format. JSON schema annotates and defines validation constraints using the JSON format.

Another question arises: Can JSON schema describe a web service like an HTTP API? Unfortunately, JSON schema itself can not. However, the team at SmartBear have created the Swagger specification, which uses JSON schema.

Swagger™ is a project used to describe and document RESTful APIs.

For the sake of brevity, I will not discuss all of the components of a Swagger specification, the documentation does an excellent job of that, but I will address a few significant parts.

Take, for example, the following Swagger specification. Note: This was the specification used for the talk and can be found here. It is also using version 3 of the Swagger specification, not version 2.

The paths object defines the available paths and operations for the API.
For brevity, the rest of the Swagger specification is omitted

The example above can be read like so, “Any request issued to /v1/demographics that yields a 200 status code is expected to return an array of Demographic objects in a JSON formatted response.” This is the typical behavior of a RESTful API.

Another important section is the components object.

The components object holds a set of reusable objects for different aspects of the Swagger specification.
For brevity, the rest of the Swagger specification is omitted

In the /v1/demographics endpoint, each element in the array is a pointer to a Demographic object, which is, in fact, a JSON schema. Using references promotes reusability and reduces the amount of code that is written. It is also worth mentioning that the Demographic object has some validation constraints for individual properties such as admit_source. This means that if a value from an upstream source does not match any of the elements defined in the enum keyword (“EMERGENCY,” “ROUTINE,” “TRANSFER”), then the instance is considered to be invalid. To reiterate, the Swagger specification itself does not perform the validation of the data but rather defines the API.

So, what does this look like in code?

While this Python code is elementary (and incomplete), it is a good starting point. The idea is to leverage the Swagger specification as a means of annotating and validating data from upstream sources. A flowchart is provided to diagram the workflow.

Python sample code workflow

The sample code is doing several things. It will continue to issue an HTTP request to a URL indefinitely or until it receives a signal interrupt. From there, it uses the swagger object (not defined in the code) to index the /v1/demographics endpoint and retrieve the available responses for said endpoint (in this case, the only response is 200 OK). It then indexes the responses object using the returned HTTP response status code and returns a schema object. The schema object is then used to validate the instance. If the instance is not valid, then an exception is raised, and error handling work is performed; otherwise, it is passed to the pipeline.

What benefit does this have?

That is a great question. For starters, it allows data scientists to define validation constraints for data returned from upstream sources. Second, it mostly prevents having to perform type assertions and castings in the model code. This is incredibly important, especially in dynamically typed languages like Python, which uses type inference. Doing these sorts of operations is not saving the data scientist any time, will eventually become computationally expensive and wasteful, and clutters the model code with the “janitorial” work that most data scientists attempt to avoid. Finally, the Swagger specification is not only self-documenting, but it can easily be updated if the API changes. Specifications can be shared with other collaborators and do not leave the data scientists scratching their head in search of API documentation.

Closing Remarks

As a final note, I would like to say the following. I fully acknowledge that this solution may not be suitable for all workflows. And yes, this article focusses on data being consumed from web services. However, the key takeaway of this post is to define a process for catching “bad” data before it even makes it to the model creation phase of the pipeline.

With that being said, I hope that this article has been useful and would love to see the work developed using Swagger or JSON Schema in the Data Science domain. Feel free to post your thoughts, concerns, or opinions in the comments. I am always open to feedback, and I am interested in hearing what others have to say.

Update: I have a written a Python module called aptos for validating JSON documents using JSON Schema. Contributions are welcome.