Five Software Engineering Principles for Collaborative Data Science
Reproducible data science projects need project organization and clean code
A traditional software engineer sets out rules in code. In contrast, a data scientist identifies with learning algorithms that analyze patterns in data. But analytics projects are still bound together with conventional code, and as a data scientist, you can benefit from best practices first pioneered by software engineering.
Using a metaphor, let’s say you’re a chef and you’re making a fancy dinner. In the kitchen, you have different ingredients and tools, and your job is to use the tools and combine the ingredients in the right way to make a delicious meal. You and your customers want your meal to be tasty and not burnt or undercooked. A disorganized kitchen makes it hard for anyone else to cook your meal, so a good chef takes time to keep things neat, label ingredients and document the meal in a set of recipes.
Likewise, as a data scientist, you want our output and insights to be correct and reproducible. Following best practices in your “code kitchen” helps you create well-ordered code means that you and others can pick up the analytics project in the future, understand, extend and reuse it.
In this article, we’ll talk about some approaches so that your “glue code” is reliable, efficient and easy to understand. When you start to cut code on a prototype, you may not prioritize maintainability and consistency, but adopting a culture and way of working that is already proven can get your prototype to production-ready faster.
“Well-engineered data science code can help you unlock valuable insights.”
1. Use a standard and logical project structure
While data science aims to generate a set of insights, such as reports and visualizations, it’s essential to consider the quality of the programmatic code that produces them. While an experiment with data may have a serendipitous outcome, you and your potential colleagues need to be able to extend the experiment and re-run it in future.
It’s a good idea to start each experiment with the same, consistent and logical project structure. Anyone looking at the project can understand the layout without requiring extensive documentation. Well-organized code tends to be self-documenting because it provides context.
Using the same structure for every project helps reproducible collaboration, which means you can be confident in the conclusions drawn from the analysis. Have you ever tried to reproduce something you did a few months ago? You may have absorbed all the details then but returning later means starting almost from scratch if you left the project disorganized.
A consistent and reliable project makes it easier to return to and share with others so that your teammates can easily maintain and modify your project.
There’s no single right or wrong way, but you should adopt a semantic folder structure whereby the location encodes meaning (for example, with folders for configuration, data, documentation, notebooks and source code). This approach makes project navigation easy since the location of an object describes its purpose.
If you’re looking for inspiration, look at DrivenData’s page on CookieCutter Data Science, which they describe as “A logical, reasonably standardized, but flexible project structure for doing and sharing data science work”. And take a look at the open-source project Kedro, which is built on the learnings of CookieCutter Data Science with modifiable project starters to customize your templates.
2. Make your environment reproducible with dependency management
Most data science Python code imports third-party packages that offer reusable functionality. These packages come in different versions and typically have their own dependencies on specific versions of other Python packages.
Dependency management documents the exact working environment for your project so it is easy to reproduce the setup on a different, clean environment by installing the equivalent set of packages.
One option would be to write a list of every package dependency and sub-dependencies in the documentation. The recommended approach is to set out the information using a standardized, reproducible, widely-accepted format such as the input to pip install
.
For each package your project directly depends upon, list it and the version you need to ‘pin’ it. Packages may be updated frequently; pinning protects you from potential disruption when a change introduces bugs or non-backwards compatible changes.
Virtual environments
If you are working with Python, always use a virtual environment for your project to protect the environment against potential changes to the global environment. For example, if you depend on a feature of the Pandas project introduced in 2021, but another project you are working on needs an older version, there will be a conflict over which is needed if you work in a single, global space.
Keeping a separate clean environment for each project, for example, using conda
or venv
, ensures greater project reproducibility since you avoid version clashes.
Find out more about why you need a virtual environment
3. Make your code reusable by making it readable
Clean Code: A Handbook of Agile Software Craftsmanship is a well-regarded software engineering book from 2008 that sets out the best practices to follow for sound and efficient code, regardless of the programming language you use or its purpose. It sets out several principles to write good code from scratch and improve bad code as you encounter it, and it describes “code smells” that indicate when your code is “off”.
Besides reading the book, you can find numerous videos, training courses and summaries of the book, depending on how much detail you need. It’s not my intention to reproduce it all here but to consider one aspect the book describes that facilitates collaboration: code readability.
“Code is read much more often than it is written. ”
Although there doesn’t seem to be a single source of this quote, it is often ascribed to Guido van Rossum, creator of the Python programming language and contributor to the milestone PEP8 document that provides guidance on writing readable code.
You can make your code more readable by following common standards and conventions and asking your team to adopt code reviews. You may want to concentrate initially on the code’s functionality, but if you make code readable when writing it, you will find it simpler to work with later. Clarity helps if you need to debug it, and you’ll find it easier to maintain if other people have checked it and confirmed they understand your approach and that it follows some basic rules.
Here are a few pointers when you look at your code (or someone else’s):
- It is common for novice programmers to use abbreviations or short names for functions and variables. It’s difficult to interpret them if you didn’t write the code; even if you did, you’d find it hard to understand a few months after writing. Create meaningful names.
- The best code is self-documenting, meaning it requires few comments to understand, but comments are helpful to document non-trivial code at a function level. Just don’t write a big block of text that duplicates the code.
- Make code readable by using whitespace. You’ll find this straightforward if you use Python, which gives syntactic meaning to whitespace.
- Write small functions that do just one task, with single return paths and a limited number of arguments.
- Don’t use hardcoded values; instead, use precisely-named constants and put them all into a single configuration file so you can find and update them easily. Configuration management tools like OmegaConf, python-anyconfig or PyYAML are designed to help with this.
Don’t forget documentation
Documentation also helps code readability and varies in levels of detail:
- Basic inline comments
- API documentation from docstrings that explain how to use/reuse a function
- Markdown files such as the README page in the root of a GitHub repository, that explains project setup or specific usage details.
Keep your documentation up to date, or it is misleading and worse than no documentation at all and invest some time in learning how to build your docs to publish them, using a tool such as Jekyll or Sphinx.
4. Refactor notebook code into pipelines
Up to this point, the advice has been sufficiently generic that it could be picked up by a junior software engineer as much as a data scientist. This point applies to working with a sequence for data ingestion, transformation, model training, scoring, and evaluation.
Using Python functions and packages to form a pipeline makes it possible to encode the sequence of task execution. There are several open-source solutions to assist in constructing these types of pipelines, such as GNU Make, a general, legacy tool that still meets many of the needs of modern data science, and Snakemake for Python. Other popular pipeline tools include Kedro, Luigi, Metaflow, Airflow, Prefect, and Ploomber. When selecting a tool, you should consider the learning curve and whether you need additional features such as being able to schedule pipeline runs.
Benefits of pipelines
Reproducibility: Anyone can reproduce the results from the raw data with little effort
Correctness: The outcomes are testable
Readable: A new team member can pick up and understand the pipeline
Extensibility: You can take a small pipeline and extend it to ingest multiple data sources, use different models and create reports.
Maintainability: You can edit and retest. As we’ve previously described, Jupyter notebooks are great for a quick prototype, but they’re like the table by the entrance of your home or the drawer of odd bits and pieces. However good your intentions are, it’s where clutter, like hard-coded constants, print statement debugging and unused code, accumulate. The more code in a Notebook, the harder it is to know whether the code you’re writing functions as you expect.
Testing, testing
Using pipelines allows you to put your functionality in Python modules that you can test, update, and test again, with no side effects from zombie code to complicate the interpretation. The pytest framework is there for you.
Write tests! If you include writing tests (such as unit tests, integration tests and data validation tests) in your definition of done, you can’t skip them, and you’ll account for them when you estimate the size of a piece of work. What’s more, good tests also function as documentation because reading the tests will help with understanding what the code does.
In most data science projects, much of the code transforms data, while only a small part of the codebase is actual machine learning. A pipeline for data transformation code can be tested (by definition, they should return the same output for the same input). Even machine learning code can be tested to confirm it works as expected. You can write functional tests to review the metrics of the model (e.g. accuracy, precision, etc.) that exceed an expected threshold.
“Move code out of notebooks into Python modules and packages to form pipelines as early as possible to manage complexity.”
5. Invest some time in mastering version control
Version control systems (VCS) like Git, Mercurial or Subversion allow you to store checkpoint versions of your code so that you can modify it but later roll back to a previous version. It’s like having a series of backups, and, what’s more, you can share them with other developers.
It’s wise to invest some time in learning the principles of version control to maximize the value they offer you so you can deal with more complex scenarios such as awkward merges. There are some excellent hands-on learning materials available such as the quick start guide to Git on Github and the learn Git branching tutorial.
Some data scientists learn the basics of commit
but here are a few best practices to consider:
- Commit often: If you develop your code substantially without committing the changes, you could risk losing the time you’ve already spent on it by inadvertently adding a change that breaks it and then being unable to roll it back. Also, be careful when drinking that coffee near your laptop! The codebase may move on if others are working on it too, and when you do come to commit, you’ll potentially face conflicts and merge hell.
- Only commit what you need: Not all files should be stored if they’re for your personal local configuration, secrets, e.g. database login credentials or intermediate files that result from a build. Learn how to use
.gitignore
. - Don’t store raw data in version control. The raw data won’t change, so you don’t need to version it. That’s also true for intermediate data files if you can regenerate them from raw data and your code. If you need to keep track of transformed data, there are different data/artifact/workflow version control tools like DVC (Data Version Control) or Pachyderm that extend your Git code versioning.
The benefits of using version control within your own projects and across a team are that you can experiment with confidence that, whatever risk you take in the code, you can swiftly revert to a known, working state. Combining a solid version control system process with testing gives you a powerful way of working. When your code produces the right results, save it, then when you next change your code, rerun the tests. If they pass, you probably didn’t break anything. If they don’t, you can re-work, or revert.
Summary
It’s been over ten years since Harvard Business Review reflected upon the growing need for data scientists: people who could combine programming, analytics, and experimentation skills. There still isn’t a single well-defined career path into data science, but there are now plenty of ways to learn, including:
Typically the role attracts people who are confident with math, happy to experiment with complex, messy data sets and can code.
Data science may have now established itself as a role, but software development has a few decades of maturity and lessons learned. Some of the most valuable techniques a data scientist can pick up are those that generations of software engineers have established, such as combining version control, testing, readable, clean code and a well-organized folder structure. These can make the difference between a successful production-level project and one that stalls after prototyping.
If you’re a data scientist looking for inspiration in 2023, borrow these engineering best practices to achieve long-term analytic success.