Production Code for Data Science and Our Experience with Kedro
Debunking the myth: “Data Scientists can’t write production-ready code”
I think most of us have heard something along the lines of “Data Scientists can’t write production-ready code” or worse, that they throw bad code over the fence for software engineers to fix and optimize! Whilst I can’t deny that these murmurings are partially correct, we can’t generalize these issues to the vast task space of data science. In this post, I would like to discuss the issue of production level code for data science teams from our own experience at Beamery.
The term data science has been used loosely to capture many domains all working with some aspects of data. I’d like to separate these domains into 3 distinct groups:
- Data engineering: building reliable and scalable data pipelines
- Data analytics: business analytics and decision making support
- Data science: mainly statistics and machine learning
Most of the work around data can be safely placed into one of these groups and the personas for each group are consistent in their skills and experiences. If you are interested, you can read further on data science roles in this great post. Among these groups, I will focus on the need for production level code in data science as a sub-category.
Even though there is a lot of hype and attention around data science, the practice is still relatively new and doesn’t have the maturity and ecosystem that software engineering enjoys. Coupled with the ever-increasing need for data scientists, the recruits had to come from various disciplines such as academia, analytics, and software engineering. Data science bootcamps, master's degrees, and online courses have rightly been concentrating on the theory and practical applications of machine learning algorithms. Software engineering practices and writing good code are significant parts of data science but they are not the core and these skills are honed with experience. As a result, data science teams have been built with members from varying degrees of software engineering expertise leading to inconsistent code quality and engineering practices.
What constitutes production code for data science?
Let’s accept the claim that data scientists write messy code. Is this necessarily a bad thing? Isn’t much of data science experimentation? After all, only a very little amount of code data scientists write end up in production. Not exactly! The output of data science is information. It can be argued that any information that is useful enough to inform and influence the rest of the company is part of production; therefore, the code that produces the information should be considered production code¹. The focus may shift to a clear narrative rather than computational efficiency, but it still requires the same care.
Is it production code all the time for data scientists? Data scientists wear different hats in various parts of data science projects. Exploration and experimentation are integral parts of the job. You forage for ideas, go through literature, lay everything out on your hypothetical table, and start interacting with them. This process is generally done in a notebook environment (e.g. Jupyter Notebook) for most data scientists. You want to test, iterate and break things at this stage and you want to do it fast. Only when you get to a point where the skeleton for the experiments becomes clear and a narrative is established, the need for reproducibility, readability, and documentation becomes a necessity.
The advocacy for production code can be expressed in the arguments above; however, it still needs to be defined properly in the context of data science. After all, you need clear expectations if you ever expect to keep your team accountable to it. At Beamery, we consider that the production code should adhere to the following principles:
A non-reproducible analysis is nothing more than speculation. The code should always be version-controlled and findings should be presented alongside the code that produced it.
As the famous quote from Robert C. Martin goes;
“… the ratio of time spent reading versus writing is well over 10 to 1 …” ²
A common project structure makes every project very easy to navigate. Enforcing style conventions could help establish a common style and would make reading code easier.
Establishing a clear narrative with comments and docstrings in code as well as README files where necessary, reinforcing the code with business case write-ups, discovery/exploration analysis, qualitative and quantitative evaluations are all crucial in presenting ideas in a cohesive manner. We also support and encourage visual explanations in the form of flowcharts and mind maps as they complement the rest of the documentation. This documentation process allows for better collaboration and knowledge sharing.
Modular code/pipelines are easier to understand, design, and test than large scripts/notebooks.
If the majority of the team members accept and uphold the principles above, then the rest of the team adjusts accordingly. In turn, this process results in a consistent and cohesive codebase. These principles can be highlighted in collaborative work and pull requests can be leveraged as a tool to enforce style and structure. As a result, new members can get up to speed quickly with expectations and good examples clearly defined. This way we can revisit the problem of establishing the common ground for a team with members coming from different disciplines and varying levels of software engineering expertise. The principles above are a distillation of our experience at Beamery which is a scale-up with a growing data science team. However, the problems it’s aimed at should be true for most data science teams.
Adopting Kedro As A Good Habit Enforcer
At Beamery, the data science team is growing rapidly. So do the code base and the number of artifacts created. If we are to foster communication and collaboration within and between projects, we need to set a common language (metaphorically) in terms of project structure and code practices. Looking for responses to similar issues in the industry, we have come across Kedro, an open-source Python framework that borrows concepts from software engineering practices such as modularity and separation of concerns. After a couple of successful tests involving multiple team members, we have adopted Kedro as a project structure framework for data science.
What is Kedro?
In essence, Kedro acts like a structure enforcer. It builds on the Cookiecutter Data Science project. The uniform file structure brings uniformity across projects. It’s a bonus for readability which fosters collaboration and makes it easier to learn from and to build on top of previous work.
Kedro allows for modularity in the code so that the experimentation can be partitioned into independent and interchangeable modules. This functionality is allowed through concepts such as nodes and pipelines. Nodes are wrappers for a Python function that specifies the inputs and outputs and it is the building block of a pipeline. A pipeline organizes the dependencies and execution order of a collection of nodes and connects inputs and outputs while keeping your code modular.
Another useful feature of Kedro is the concept of a DataCatalog. It’s essentially a Python class that takes one or more configuration files and registers the data sources the project can access. The framework takes care of the parameterized saving and loading of the data. Coupled with distinct data folder layers, it provides a clean and readable way to manage the datasets used in a project.
Configuration management in Kedro allows for easy parameterization of the pipelines. A properly defined configuration file can describe all the decisions made in a specific iteration of the experiment; therefore, it is also useful in experiment tracking as well.
Our Experience with Kedro
Kedro has proved to be quite useful at Beamery, primarily as a project structure enforcer. Increased readability and modularity allow for easy collaboration and efficient knowledge exchange. Pipelines and proper configuration give a boost to reproducibility, especially if enough care is given to data validation steps.
There will be a learning curve and the curve is steeper if the user is coming primarily from using notebooks. There are Kedro specific concepts that may take time to grasp. For the first few projects, the attention will be split between the actual problem and the project setup. However, after this hurdle is overcome, finalized projects will resemble the production level code that is an expectation for any work that produces useful information.
An important point is to understand that not every part of the project needs a strict structure. The focus for exploration is not efficiency or modularity. Most of that work will still happen in notebooks and there is a dedicated space for it in the project structure. When one of our POCs with Kedro had finalized, 90% of the code was still in notebooks. This is where discovery and messy exploration happens. Only when ideas had matured, did we start moving the code, the pipelines, and nodes with the expectation for production level code.
Extending Kedro: Experiment Tracking with MLflow
Depending on the use case, the functionalities of Kedro can be extended to support new behaviours. The desired functionalities could be data source integration, data validation, experiment tracking, or deployment of produced ML artifacts.
There are a few ways to inject new behaviours into Kedro projects. If you are looking to add data connectors to access various data sources,
DataSet abstraction allows for a wide list of data connectors, ranging from a CSV
DataSet to a
DataSet reading from Google BigQuery.
Another option is to inject additional CLI commands via plugins. There are 3 officially supported plugins: Kedro-Docker for packaging and shipping Kedro projects within containers; Kedro-Airflow for converting your Kedro project into an Airflow project; and Kedro-Viz for visualizing your Kedro pipelines. The open-source community around Kedro has been developing useful plugins such Kedro-Great, a Great Expectations integration enabling catalog-based expectation generation and data validation on pipeline run (see for the list of plugins).
Lastly, the option that we found to be easier to develop and extend, is Hooks. As the name suggests, the process is adding a hook to Kedro’s main execution. The example we’ll focus on will be adding experiment tracking functionality using MLflow. Kedro offers two main types of Hooks: execution timeline and component registration. The first one allows for the injection of additional behaviour at particular execution points such as after running a certain node or at the end of the pipeline. The second is for registering library components to be made available for use within the project. For experiment tracking, using execution timeline hooks is more intuitive as we’ll expect the pipeline to log at different stages of the execution.
Setting up hooks
Kedro readily defines certain specifications for particular execution points. These include
after_pipeline_run and others. For our use case, we need to define where we would like to log information and select the hook specifications that can reach those points of execution.
Let’s assume that we have a node named
split_data where the dataset is split into train and test sets, and another node called
train_model that outputs the trained model artifact. Lastly, we'll envision another node called
evaluate which returns the accuracy metrics of the trained model on the test set.
Looking at the plan for logging actions, we will need the following
after_pipeline_run hook specifications in the implementation. Next, we need to define the implementation in the following file in your Kedro project:
<your_project>/src/<your_project>/hooks.py. We will use
@hook_impl decorator to declare a Hook implementation. The name of the implementation has to match the specification name for Kedro to understand where to call the hook. For example, naming the implementation method
before_pipeline_run allows Kedro to associate it with the specification of the same name.
Inspired by the official documentation: https://kedro.readthedocs.io/en/stable/07_extend_kedro/02_hooks.html
The next step is to register the Hook implementations in
<your_project>/src/<package_name>/settings.py. You can register multiple Hook implementations like shown below.
Source (Official implementation): https://kedro.readthedocs.io/en/stable/07_extend_kedro/02_hooks.html#registering-your-hook-implementations-with-kedro
Assuming that you have an MLflow server running, after running the pipeline you’ll find that a new experiment has been registered, parameters, model and metrics logged and run ended. This simple tutorial exemplifies the easy, yet powerful Hooks implementations in Kedro.
In this post, we have discussed the need for production level code for data science projects. We defined our expectations from production code at Beamery. We maintain that any work that results in information that is consumed in decision making is production code. However, certain aspects of data science such as exploration and early experimentation focus on fast iteration and fast failure. However, when the skeleton for the experiments becomes clear and a narrative is established, the need for reproducibility, readability, and documentation becomes a necessity.
To provide a common ground for the data science projects in our growing team, we have looked at ways to enforce structure into our codebase. Kedro has fulfilled this need and provided additional capabilities for increased readability, reproducibility, and modularity in our projects. In the small Kedro Hooks tutorial, we have looked at extending Kedro with experiment tracking capabilities using MLflow. Combined with the ability to extend, Kedro can become a tool that can be personalized to data science teams with unique ways of working and expectations.
- Robert C. Martin, Clean Code: A Handbook of Agile Software Craftsmanship
Interested in joining our Engineering, Product & Design team?
We’re hiring! Learn more HERE