Keeping up with Kedro — the latest developments in our development workflow framework
Merel Theisen, tech lead for Kedro Framework
At QuantumBlack, we’ve always asserted that the only useful data science code is production-level. That’s why we developed Kedro, a development workflow framework which aims to become the industry standard for developing production-ready data science code.
Since its inception in 2019, Kedro has evolved to become easier to use and in the upcoming 0.19.0 release we will move datasets from the “extras” directory in Kedro into a separate package, called Kedro-datasets. This change is one of the initiatives that will bring us closer to a stable release of Kedro 1.0.
For those new to Kedro, we have developed an introduction article that details all that you need to embrace the Python framework.
At the start of 2022 QuantumBlack donated Kedro to the Linux Foundation (LF), one of the world’s largest vendor-neutral consortiums for open-source projects.
Now, developers can explore what the team is working on, view the backlog or status of the current sprint and contribute according to the Kedro roadmap.
This new standardised contribution workflow means that anyone can join in Kedro’s continued development and eventually progress into becoming an official maintainer on the project, writing code to continually improve the central framework. In return for a weekly time commitment, maintainers will join Kedro’s Technical Steering Committee and help shape product strategy and roadmap decisions through regular voting.
Datasets moving into a separate package
As a result of so many people around the world now using Kedro, our primary goal is to enable users to easily keep their packages up to date. We want to remove sources of breaking changes and achieve a stable framework.
Datasets are Kedro’s way of dealing with input and output in a data and machine-learning pipeline. Kedro supports numerous datasets out of the box to allow you to process different data formats including Pandas, Plotly, Spark and more.
In Kedro 0.19.0, we will move datasets into a separate package to reduce breaking changes associated with dataset dependencies.
In the past, Kedro users were impacted by datasets being part of the Kedro framework in several ways:
- Users had to wait until all of our dataset dependencies had released Python 3.9 and 3.10 support before the Kedro framework was compatible with Python 3.9 and 3.10.
- Users could not get new, but workflow-breaking, dataset functionality frequently, because the Kedro framework has more conservative versioning (on average we ship a breaking release once a year and users had to wait for changes to the datasets).
The benefits to Kedro users of having datasets in a separate package include:
- Kedro becomes more modular, making it possible for users to upgrade only the Kedro-datasets dependency in production rather than modifying the entire template.
- Users can deploy newer datasets with older versions of Kedro and can even use datasets without Kedro.
Go with the flow
Once datasets are moved to the new repository and removed from the core Kedro package in the 0.19.0 release, users will need to make three changes:
- In addition to installing Kedro, users will need to install Kedro-datasets to make use of datasets. Specifically, you will need to run pip install Kedro-datasets[SomeDataSet] instead of pip install Kedro[SomeDataSet] to install a dataset.
- Use import Kedro_datasets instead of import Kedro.extras.datasets to import the datasets package into code.
- Make any dataset updates and contributions to the new Kedro-datasets repository instead of the core Kedro one.
Stages of roll-out
Moving Kedro.extras.datasets from the framework into a separate package is a breaking change, therefore we will introduce it in stages by:
- Making a separate Kedro-datasets package and having Kedro instantiate datasets from Kedro-datasets with higher priority than Kedro.extras.datasets.
- Removing Kedro.extras.datasets from the core Kedro package when we ship Kedro 0.19.0, early in 2023.
Watch this recording of the October 2022 Kedro Showcase to learn more about the changes to datasets, new features in the 0.18.x releases, and community updates.
We are excited about the release of this structure in Kedro 0.19.0 in early 2023.
If you have questions or feedback, comment on our existing GitHub issue about this work or by raising a GitHub issue.