Introducing Kedro: The open source library for production-ready Machine Learning code
Improved business performance is increasingly driven by Data Science and Machine Learning. For that reason, it is of crucial importance that the code powering key business decisions is deemed to be of production quality. Machine learning models which can be deployed effortlessly and operate unattended are far more likely to achieve commercial objectives. At QuantumBlack, we’ve always asserted that the only useful data science code is production-level.
Expectation Vs Reality
Every data scientist follows their own workflow when solving analytics problems. When working in teams, a common ground needs to be agreed on for efficient collaboration. However, distractions and shifting deadlines may introduce friction, ultimately resulting in incoherent processes and bad code quality. This can be alleviated by adopting an unbiased standard which captures industry best practices and conventions.
Moreover, while the background of today’s data scientists and data engineers has become more varied, many continue to have a grounding in mathematics, statistics and modelling, rather than software engineering. Code is seen as a tool to be employed when solving a problem, not the end goal. This often leads to little effort being spent applying software engineering principles dictating that production-ready code must be:
· Reproducible in order to be trusted
· Modular in order to be maintainable and extensible
· Monitored to make it easy to identify errors
· Tested to prevent failure in a production environment
· Well documented and easy to operate by non-experts
Code written during a pilot phase rarely meets these specifications and can sometimes require weeks of re-engineering work before it can be used in production environment, causing project timelines to shift. QuantumBlack’s new tool, Kedro, has been designed to address these common missteps and guarantee production-ready data analytics code.
What is Kedro?
Kedro is a development workflow framework which aims to become the industry standard for developing production-ready code. Kedro helps structure your data pipeline using software engineering principles, eliminating project delays due to code rewrites and thereby providing more time to focus on building robust pipelines. Additionally, the framework provides a standardised approach to collaboration for teams building robust, scalable, deployable, reproducible and versioned data pipelines.
The features provided by Kedro include:
- A standard and easy-to-use project template, allowing your collaborators to spend less time understanding how you’ve set up your analytics project
- Data abstraction, managing how you load and save data so that you don’t have to worry about the reproducibility of your code in different environments
- Configuration management, helping you keep credentials out of your code base
- Promotes test-driven development and industry standard code quality, decreasing operational risks for businesses
- Modularity, allowing you to break large chunks of code into smaller self-contained and understandable logical units
- Pipeline visualisation making it easy to see how your data pipeline is constructed
- Seamless packaging, allowing you to ship your projects to production, e.g. using Docker or Airflow
- Versioning for your datasets and machine learning models whenever your pipeline runs
Kedro is suitable for a wide range of applications, ranging from single-user projects, to enterprise-level software driving business decisions backed by machine learning models.
We have deployed Kedro internally in QuantumBlack and McKinsey & Company across more than 50 projects to date. Experienced Kedro users assert that the library has set a best practice standard for how teams should collaborate when writing code which is meant to be production-ready from the beginning, by delivering a disciplined starting point and workflow for analytics projects.
Kedro is the result of two years of work by a multi-disciplinary team of QuantumBlack colleagues. During this time it has revolutionised our workflows and, following a fantastic response from clients, we made the decision to convert it into an open source tool. This was supported by the recent formation of QuantumBlack Labs, our technical innovation group which aims to produce an ecosystem of products facilitating advanced analytics at scale.
We’re tremendously excited to see how the community responds to this release and how we can work together. QuantumBlack Labs is committed to improve and enrich the tool by continuously providing new functionality through frequent updates.