Creating Business Data Tools With Jupyter
When it comes to bulk inserting data into your database, there’s nothing worse than getting bad data. Poor formatting, mismatched IDs and other mistakes in the flat file will lead to errors that could cause the import to fail or worse, insert data you never wanted in the first place. As part of the Data Ops team at Namely, this is a challenge I faced every day. One of my first programming challenges when I joined the team was to find a way to ensure that client data about to be inserted was as clean as possible. With a lot of help from my team, we built tools that can transform messy external files into easily importable data in seconds.
In this article, I will show you how to build a sample of a Python CSV file validation tool we use at Namely. This lightweight application will be able to check each row of a CSV file and see if they adhere to your business logic such as making sure ID numbers are within a certain range or if the length of a string is too long for your database field.
*Please note that these instructions assume that you have some Python experience.
What Is Jupyter Notebook?
Jupyter Notebook is a powerful web application that can be described as an HTML wrapper around a terminal or command prompt. It was built with data science in mind, using notebooks as a way for data scientists to share their code, their ideas and to visualize their results. As such, it supports major data science languages such as Python and R. Our Data Ops team operates primarily in Python, which is well-suited for data processing due to a wealth of specialized libraries such as numpy and pandas. Another benefit of Jupyter is it allows us to build simple, yet effective GUIs using widgets. This allows us to deploy some of our notebooks to colleagues without requiring extensive programming knowledge to understand the input.
Setting Up Your First Jupyter Project
Since our notebooks serve as the interface for users via widgets, we try to keep them separated from the Python files to maintain a clean divide between front-end and back-end for ease of testing. We built the validator tool to be extensible for future use so our example folder will copy our repo’s folder structure, which looks something like this:
+ -- Notebook Folder
| + File Validator Notebook
+ -- File Validator App Folder
+ File Type 1 Folder
The workflow consists of the user loading a .csv file via the notebook, which is transformed into a pandas dataframe, passed into the
app.py, which then transforms and routes the data to the appropriate file type for processing. Finally, the results are returned by the
app.py to the notebook/user.
Now that the file structure is established, we can start building.
First, let’s install the required Python modules for this example. You can do this globally or in a virtual environment. You will need:
- Jupyter Notebook
- Pandas — Python data processing library
- Pyrsistent — Python functional data structure library
Pandas is the most popular data processing library in Python. It borrows the dataframe concept from R (think Excel but programmatic and fast) and allows you to perform relatively large data operations such as calculating standard deviations, pivoting data or joining multiple dataframes on an index. Because it easily consumes .csv files through the
read_csv() function, this was an easy choice for this application.
We chose Pyrsistent over other Python validation libraries like Cerberus because it was syntactically simpler to understand, allowed for other tasks like transformations in one function and has excellent error catching.
Creating Your First File Validation
With your environment setup, we can now write the file validation! Open up your favorite IDE and let’s write the sample validator for numbers less than 10 and the length of a string that cannot exceed 20 characters.
The field function is one of the best features of Pyrsistent. It’s an almost one line transformer and validator. From the example above, here’s what some of the arguments do:
- mandatory — This variable must be present if True. If not present, the PRecord returns an error.
- factory — Attempts to transform the data being assigned to a variable to whatever type or custom transform function you write. For example, if you wanted a factory function to transform a string to capitalize the first letter, you would write
- invariant — This is essentially where you would place the business logic of whatever you are trying to validate. It requires a function (in this case a lambda) that returns a tuple with a boolean if the invariant held or not
(is x > 10?)and a message if the invariant failed (‘Number exceeds range’). We’ve used list compares, length ranges and other similar methods for this argument.
Creating The Main Backend Application
With the file type in hand, the back-end app is relatively simple to write (and you only have to write it once!) Since we know we are consuming a pandas dataframe from the notebook, the app needs to transform this one more time into a dictionary that can be used by our new Validate class. Here’s our app should look like (you can also install tqdm if you want a cool status bar):
Building Your Jupyter Notebook GUI
The backend is now complete. Now we just need a way for a user to interface with our brand new app.
- In your terminal, navigate to the project folder and start the notebook service by typing
$ jupyter notebook
- Your default browser should pop up with the project folder like this!
*Note that you cannot go beyond the “home folder” where you started the service. If you need to navigate to another folder, simply kill the service
(Ctrl + C on Mac), move your terminal to the new base folder and start it again.
- With the notebook service active, click into the Notebook folder and start a new notebook by clicking on New → Python.
- You will be greeted by with the blank notebook GUI and are now ready to write a simple front-end for our app with some nice widgets. These widgets will allow users to select the files and convert them into dataframe objects that our app can consume. The code looks like this:
- Once you’ve got all that code in the block, hit
Shift + Enterto execute your code. The output of your notebook’s code block should now look like this:
- Now you just have to pass a path to a test file. Here’s what the cells would look like:
- Run that file path through and your notebook will look like this:
- Now go check the folder where your file is and you should have the output file with your
testfilename_validated.csvand it should now have this awesome status and error message column:
That’s it! You’ve successfully created a CSV validator! Of course, this is pretty basic but you can easily modify the
type.py to accommodate any headers/data to fit your needs.
The validator you just created is one of many possible tools you can easily create and distribute via Jupyter. By creating quick tools that can be easily distributed, you can make data tasks more efficient for your organization with reduced development time until either a proper application is developed for long-term needs or until the tool has fulfilled its function and is no longer required.