Machine Learning in Production: 9 Best Practices to make you a more effective Data Scientist

Data Science in Production
11 min readFeb 18, 2020

--

The job of a Data Scientist has become so complex and multi-disciplinary that most of us are experiencing similar productivity problems. In this blog post, we introduce 9 recommendations to help you become a better data science practitioner.

The material presented here is a brief summary of the lessons we have learned through our professional experiences. For a more complete, detailed exposure, you can apply to our ML In Production Training, February 28–29th and March 6–7th in Barcelona, Spain.

1. Organize your project meaningfully

Like it or not, our discipline’s work is usually a team effort. Whether we are collaborating with a fellow Data Scientist, peer-reviewing a colleague’s contribution, or working with engineers to deploy your models, ours is seldom a one person’s job. As a consequence, it is important to build our projects in such a way that collaboration is easier. To that end, our first recommendation is to:

“Organize your projects in a clear, consistent, meaningful manner”

Let’s define each term in more detail:

  • Clear means that the motivation of each file and folder are easily understood from their context and name.
  • Consistent means that every project follows the same logic, so that there are no surprises as you work with different people.
  • Meaningful means that every choice has a proper motivation.

Luckily for us, we don’t need to reinvent the wheel and think through the structure necessary for each use-case. Instead, we can rely on tools like Cookiecutter or pyScaffold to create or adapt templates that someone else has already built, and use them for each new project so that the cognitive load of maintaining consistency is close to zero.

We may even have several templates, for example: one template designed for one-off analyses and prototypes, and another for projects related to the training and deployment of ML models. These templates can include common setup patterns encoded in Makefiles (see for instance an example from our training’s capstone project) and pre/post hooks that automate other tasks like the initialization of a git repository or the configuration of pre-commit hooks.

2. Use Version Control to manage changes

Figure: Version Control will help you stay sane in a collaborative environment, credits: http://phdcomics.com/comics/archive_print.php?comicid=1531

Data Science projects are typically iterative in nature: you try and implement a few ideas, test their effectiveness, and use the learnings to implement subsequent iterations that are more complete or better performing.

As a result, we need to manage changes in our code (and possibly, our data) in a way that allows us to maintain our sanity. Version Control Systems (VCS) like git or mercurial are the industry standards in the Software Development industry, and mastering them can be a powerful skill in your toolbox. In addition, there are a number of practices that we should promote in our teams to best leverage VCS:

  • Keeping one or several repositories for all your data science projects, integrated with the production environment of the rest of the company.
  • Introducing pair programming to reinforce high quality code standards, which also has the benefit of helping more senior data scientists mentor less experienced members of the team.
  • Enforcing peer review for code and data science modeling choices.
  • Setting repository permissions to require one or more pull request approvers whenever features are introduced to master branches.
  • Adding proper package versioning, and integrating them with a CI/CD (Continous Integration / Continous Deployment) framework
  • Using tools to manage dataset and artefacts versioning and provenance, like DVC

3. Improve your Python knowledge

Python is the most used programming languages in Data Science, for two fundamental reasons:

  1. It enables fast development and prototyping of new projects,
  2. It can also be used in production-grade environments and data pipelines.

Even though it is easy to learn basic Python, most Data Scientists have a limited knowledge of its advanced features. In order to succesfully use or extend existing Machine Learning tools, we believe it is important to learn and master some of its key features:

  • Modularity: Objects, Functions, modules, packages
  • Object Oriented Programming: Classes, Inheritance, Operator Overloading
  • Other programming patterns: generators, iterators, map, reduce, decorators
  • Proper implementation tools: Logging, caching, testing, run-time configuration tools.
  • The richness of its standard libraries, see for example: “Python 3 Module of The Week” or the “official” “Brief Tour of the Standard Library”.

4. Make your code portable and reproducible

Whether you work at a 10-person start-up or at a big corporation, your code will, at some point, need to run in different computation environments. Examples of these could be your colleague’s computer, a server in the cloud or your own laptop at a later point in time.

We define the computation environment as everything involved in executing your code: the architecture of the computer(s) where it runs, its operating system, the system libraries it calls, as well as any other services that your code may interact with (databases, other micro-services, etc). Our work’s output consists in a collection of artifacts resulting from processing a number of inputs (data, environment) with our software (defined by our source code).

Figure: The computation environment is one more input to your data processing work, which processes data to generate artifacts.

As we integrate our work with the rest of the software developed in our company, we will need to make sure our software runs properly in at least three different environments:

  • Development: this is where we develop and test our code, for example: your or your colleague’s laptop.
  • Staging/Testing: this is where your code is integrated with the other resources or services it relies on, in an environment emulating that of production.
  • Production: this is the real world, where your code will interact with the rest of the infrastructure and the universe in a (hopefully) predictable way.

Even if you don’t think your code ever gets to “Production”, let me ask you: is there anybody making decisions based on your analyses? Are users or customers being impacted by your work’s output? If you answered “yes” to any of these questions, then your code is already running in a “Production” environment!

In order to make sure that our code runs predictably in all these different environments (i.e. generates the same output for the same input), we need to make it portable and reproducible. To make it portable we need to ensure, to the best of our abilities, that the changes in the environment do not translate into unexpected changes in the code’s output. To make it reproducible, we need to make it easy for other actors in different environments to run our code and generate the same results.

Luckily, there are several tools we can leverage to make our life easier. For example, we can use conda or virtual environments to manage our dependencies, combined with Docker containers to manage OS-level requirements, and Makefiles to automate an otherwise cumbersome set up process (see the example from our training capstone project). Incidentally, these tools also allow us to specify our environment as code, which can be checked into a Version Control System together with the rest of our software. In addition, the usage of containers enables reproducible, orchestrated deployments via Kubernetes.

5. Know when to use Notebooks (and when to stop using them!)

Despite their shortcomings, Notebook tools like Jupyter Notebooks, Binder or Google Colab have become the de-facto IDE for quick prototyping and knowledge sharing of analyses with colleagues. When we need to quickly load data and train a baseline model within 15’, the fastest way is to spin up a notebook to get a quick validation going.

However, as has been extensively reported (see for instance this nice JupyterCon talk by Notebook’s nemesis Joel Grus or this piece from ThoughtWorks), Notebooks are not good software development tools because, among other reasons:

  • They maintain state in an error-inducing manner.
  • They are hard to code-review without specific plugins.
  • They incentivize action over reflection of how the code should work.
  • They are hard to extend in a sustainable way.

This means that as Software Developers (yes, most Data Scientists are also Software Developers), we must know when we need to switch from the right tool for a quick prototype to the right tool for software development. And usually that moment is sooner than most of us would dare to admit!

Figure: Are you starting to define non-trivial helper functions to process your dataset…? It may be time to think about moving away from the Notebook, putting the functions into their own module and start writing unit tests!

6. Accept your own cognitive biases and write tests!

“Beware of bugs in the above code; I have only proved it correct, not tried it.”

D. Knuth

In Software Development, testing is the norm. However, Data Science has traditionally followed a different development workflow, and testing has been usually a disregarded subject.

From getting a dataset, cleaning it, training a model, and finally deploying it to Production, there is an infinite number of potential bugs that can (and will) creep into your work. It is hence very important to spend time not only on the most basic unit testing, but also to think about how to test the full, end-to-end, project pipeline.

For example, when you deploy data pipelines and machine learning models to a Cloud environment, such as Google Cloud Platform (GCP) or Amazon Web Services (AWS), you must also account for the interaction between your software and the cloud infrastructure’s components:

  • Managing the dependencies with data sources (SQL/NoSQL Databases, Cloud Storage,…)
  • Testing micro-services endpoints like RESTful interfaces, where one should mock parts of the architecture so that the endpoints can be tested in isolation.
  • Keeping in mind that the underlying environment may change due to security patches, software updates, changes in the underlying infrastructure, etc.

In addition, testing Machine Learning components typically requires additional considerations, for example:

  • Computing metrics from old and new models on new data, so that one can decide in an automatic whether we deploy a new model or keep the existing one.
  • Keeping in mind how to reproduce results when non-determinism is introduced, for example by making sure that you use the same random seed in all your deployments, and that you can also reproduce them even when running on top of different hardware architectures.

For all the above reasons, it is important to have a properly set CI/CD system in place, that will not only automatically run unit tests upon every Pull Request or new branch, but also will run functional and integration tests prior to deploying the artefacts to your Testing or Production environment. See for instance the extensive post from Martin Fowler’s blog on the topic.

7. Automation is your friend

At this point, after only 7 of our 9 recommendations, you may be already wondering: “How am I supposed to take care of all this and do my regular statistical modeling and research tasks?”.

The answer is: “Using automation tools. Many of the tasks and concerns expressed in this blog post can be automated with a variety of tools, like the Makefiles, cookiecutter or pre-commit tools we have already seen . More specific to data processing and Machine Learning tasks, there are also a variety of frameworks to write code as tasks describing your processing steps, as well as orchestration and execution frameworks. Examples of these are:

  • Pydoit: An all-purpose build automation tool,
  • Pachyderm: A framework to easily build, train, and deploy your data science workloads on a Kubernetes cluster,
  • MLFlow Projects: packages and chains Data Science Workflows in a reproducible and reusable way

For more generic data-processing and workflow orchestration, the de-facto standards are:

  • Apache Airflow: In Airflow workflows are described as Directed Acyclic Graphs (DAG). Each node in the graph is a task, and edges define dependencies among the tasks.
  • Luigi: Developed by Spotify, Luigi allows you to ETLs as a collection of inter-dependent tasks.

8. Leverage Cloud Platforms

The old times of running your workloads in your machine or your on-premise cluster are now gone… Due to their flexibility, their power and their ability to reduce capital expenses, Cloud Platforms have become an essential tool in our toolbox. In addition, they provide long-term advantages for building data-powered systems:

  • You can easily access unlimited computing and storage resources,
  • They enable fast and easy deployment of ML models
  • They offer a variety of well-tested products for many different use cases

For example, Google Cloud Platform offers a number of tools that greatly reduce time-to-market for Data Science products:

  • Google Cloud Storage (CGS), which can be used to store large datasets securily and reliably,
  • BigQuery: Used for processing, querying, and working with huge datasets. It is extremely fast to compute even complex queries over Terabytes of data.
  • Google Compute Engine (GCE): computing on demand to deploy your docker containers, orchestrate them, and build the full deployment ML solution.

Other large Cloud providers like Amazon Web Services or Microsoft Azure provide very similar managed services, relieving us from the need to build and maintain costly infrastructure in-house.

9. Learn how to deploy models as micro-services

After playing around with different hyper-parameters and model architectures, you’ve finally trained an awesome ML model… Congratulations!

Unfortunately, even though you may have already invested a lot of energy - you are not yet done! If you want your model to have an impact in your organization’s bottom line, it needs to go life and interact with users or customers. This means that you are left with two alternatives:

  1. You depend on other Engineers to make your model come to life, or
  2. You learn how to deploy your model so that the work left to other Engineers (which will likely still exist) is minimal. In our day and age, this means knowing about:
  • Docker
  • Orchestration with Kubernetes
  • Web Application frameworks like Flask or FastAPI
  • Production-grade considerations: auto-scaling, canary rollouts, load-testing…

Conclusions

Bringing Data Science projects to life is a hard endeavor. In this post we have summarized a collection of 9 recommendations that have proved to be effective in our professional life. In summary:

  1. Use templates to organize your projects meaningfully
  2. Use Version Control
  3. Know your Python!
  4. Make your code portable and reproducible
  5. Use notebooks judiciously
  6. Don’t trust your code, test it!
  7. Automate as much as possible
  8. Leverage Cloud Platforms
  9. Learn how to deploy models as micro-services

You can find practical applications of some of our recommendations in our training’s capstone project, which we made available at:

If you are interested in learning more about the topics presented in this post, you can join us at our Machine Learning in Production training, February 28–29th and March 6–7th in Barcelona, Spain.

The Barcelona ML in Production team is a group of highly experienced Data Scientists based in Barcelona, focused on helping data practitioners become more effective at bringing Data Science and Machine Learning projects to life.

Bernat Garcia Larrosa, Aleix Ruiz de Villa Robert, Tristana Sondon, Arnau Tibau Puig

Webpage: https://mlinproduction.github.io/

Twitter: https://twitter.com/ds_in_prod

--

--

Data Science in Production

Group of data scientists based in Barcelona, helping data practitioners to do better production grade Data Science and ML. https://mlinproduction.github.io/