From Jupyter Notebooks to Production Data Pipelines: Our Framework for Delivering Data Projects

Marco Gagliano
AlayaLabs

--

At AlayaLabs, we deliver insightful dashboards and analytics to AlayaCare’s clients. Our data scientists are the ones who work with the clients to develop the machinery that extracts these valuable insights. Our data projects may include machine learning, graph search with specialized heuristics, or complex statistical analysis. When it comes time to bundle and deliver these projects, a production-grade infrastructure is required to handle everything from data warehousing to providing compute resources and job scheduling, to visualizing findings. I am a data engineer, and as such it is my job to provide not only this infrastructure, but tools, frameworks, and services that let data scientists make the best use of it. This article presents the main elements that make up our framework allowing data scientists to focus more on their models than the way they run on prod.

The infrastructure

Data warehousing via RDS, S3, and Snowflake

Due to our sources of data being quite diverse, we have multiple solutions in place for making data available. In order to allow data scientists the freedom to find any application data they need, our warehouse includes a layer of read-only replicas. We mostly get these replicas via AWS RDS, but for some databases, it is easier to role out lighter-weight, in-house mechanisms of replication resulting in a copy of the tables loaded into Snowflake. We love Snowflake, because it is dead-simple to manage, while performant and packed with features. As an instance can be deployed on our choice of cloud provider, it integrates perfectly with the rest of our AWS infrastructure.

Navigating the application databases requires making sense of often cryptic table names and relationships. Therefore, we provide the data scientists, with a layer of de-normalized business entity tables, which also end up on Snowflake. This higher level of data modelling decreases the time it takes to get to an interesting dashboard, but requires regularly scheduled workflows to run the SQL joins and insert the resulting rows to Snowflake.

Data flows — Putting pieces together with Airflow

At the beginning of a project, the data pipeline that a data scientist creates can be quite manageable. By running code on their computer and storing transformed or aggregated data on the computer’s file system, they can demonstrate the potential of the project. The problem comes when scaling to use larger data sets, or automating the pipeline to run continuously for a client. For instance, translating an exploratory Jupyter Notebook into an automated pipeline can be tricky.

We use Apache Airflow to provide job scheduling, executing, and monitoring of tasks. The data scientist now only has to adapt their code to be able to be run on Airflow for it to be executed automatically on a specified schedule. Airflow is our keystone for the bridge between development and deployment of data science projects. It allows a lot of flexibility, can be extended, and lets data scientists focus on the more interesting parts of their work that are most valuable.

Delivering Impactful Insights with Looker

A powerful aspect of a data science project is the visualization of the findings. This is where all the hard work pays off, and the data scientist can finally tell their story. Many tools exist that allow plotting data within a Jupyter Notebook, but this approach lacks many aspects of a production-grade deliverable. It is cumbersome to share the image files produced, they are static, and don’t always allow for striking visuals. While it is possible to produce dynamic, embeddable visual content with Python, we chose to go with Looker as our visualization tool. Looker is targeted at technical and non-technical people alike, by exposing multiple ways of interacting with data. We can let teams explore data freely via a graphical interface or we can share dashboards highlighting a story or valuable insights. Looker also allows multiple embedding options, allowing us to make dashboards available to our clients directly via the AlayaCare home care application.

The Framework: Where are we going with this?

With a stable, fault-tolerant infrastructure, data engineers are venturing into a collaboration with data scientists on designing a framework that bridges the gap between development and deployment of AlayaLabs data projects.

As a start, we are creating a protocol. For instance, we standardized what staged data should look like. Any records that are produced between tasks or at the end of a workflow are written in batches as compressed Newline Delimited JSON documents. An example of what stages might look like for a natural language processing pipeline might be:

  • A copy of database rows including the targeted text fields. This data results from queries to the replica databases.
  • A cleaned version of that data, that has gone through spellchecking or categorization, for example.
  • Results of applying a trained model to cleaned data.

Staged data also acts as our data lake, where all the raw data and transformed data can be stored and accessed later.

By knowing what data looks like in our data lake, we can provide helpful Python packages. We created a set of Airflow extensions adapting these to operators. Similarly, a data scientist can adapt them to their notebooks or scripts. The following is what an Airflow Dag might end up looking like:

Why batches?

Using batches for staging makes it possible to get many records with one call. For instance a batch of 30 records on S3 will require one HTTP request rather than 30. Another important aspect is that batches play into how Airflow likes to do things, which is to operate on specific intervals of time atomically. Therefore, by encoding the interval’s execution time into the name of the batches, we make it really easy for tasks to find the data they need. An article written by the creator of Airflow offers a much better description of why batched data plays well with Airflow. A third advantage, is that it makes it easier to ensure an optimal size for the files we are asking Snowflake, our Airflow worker, or your Jupyter notebook to load. We cap the uncompressed size of batches to 500MB so nobody ever requests more data than they can chew on at once.

Why ndjson?

Newline Delimited JSON, is supported by Snowflake. In fact, when specifying the JSON file format type, Snowflake expects ndjson by default. It is supported by Pandas out of the box as well, another tool commonly used by our data scientists:

pd.read_json(ndjson_bytes_stream, orient="records", lines=True)

JSON is also a much better defined standard than csv, and comes with better tooling. JSON serialization is not always very performant, so we found that employing orjson provides significant benefits.

As a data engineer at AlayaLabs, my job is to make sure our data scientists can spend most of their time developing models and gaining insight, and less figuring out fault tolerance, job scheduling, infrastructure, and other aspects of software delivery. Our cloud native stack, with Airflow at its core, helps make this possible. This post summarized a high level view of our framework for developing data projects. Subsequent posts will likely describe individual aspects in greater detail. Furthermore, our infrastructure, software, and tools are constantly evolving. Going forward, our goal is to continue closing the gap between development and delivery until anyone with Python knowledge can leverage the power of our tech stack.

--

--