New features available with Kedro
We’ve added datasets and documentation enhancements to the recent 0.18.4 release of Kedro
Jo Stichbury, Technical Writer, QuantumBlack Labs
Since its inception there have been some major milestones in the lifetime of Kedro. From being open-sourced in 2019, to being donated to the Linux Foundation.
Kedro is constantly being developed and the latest release, made in December 2022, brings a raft of changes as the rest of this post describes.
The new release of Kedro (0.18.4) focuses on improving datasets to enhance input and output in a data and machine-learning pipeline.
Kedro datasets are used in combination with the Kedro Data Catalog, which is the registry of all data sources to map the names of node inputs and outputs in a specialised class for a range of data storage types. For example:
# Load a Spark DataFrame on S3
# Save an image created with Matplotlib on Google Cloud Storage
Kedro provides numerous different built-in datasets for various file types and file systems, to save you from having to write the logic for reading or writing data, including Pandas, Spark, Dask, NetworkX, Pickle, and more.
As we mentioned in “Keeping up with Kedro”, the upcoming Kedro 0.19.0 release (expected in early 2023) will move Kedro’s datasets from the
extras directory to a separate package called Kedro-Datasets.
In preparation, within this recent release, we’ve added framework code that prioritises datasets from the
kedro_datasets namespace over
kedro_datasets is the namespace for the new package).
We’ve also added some datasets:
svmlight.SVMLightDataSetto work with svmlight/libsvm files using scikit-learn library
video.VideoDataSetto read and write video files from a filesystem
video.video_dataset.SequenceVideoto create a video object from an iterable sequence to use with
video.video_dataset.GeneratorVideoto create a video object from a generator to use with VideoDataSet
pandas.SQLQueryDataSetnow takes the optional argument
execution_optionsto reduce memory usage when dealing with large dataset .
Finally, we’ve updated the MatplotlibWriter dataset docs with working examples.
To accelerate the process of getting Kedro up and running, we’ve made some changes to our documentation to improve it for new users.
We have revised the early sections of the documentation to simplify them and clarify the learning path. The spaceflights tutorial is now more straightforward, and we’ve moved advanced materials into more appropriate sections. We’ve improved the experience by streamlining the navigation between pages. The table of contents is now sticky, to make it easier to find your way around.
Contributions from the Kedro community
The release also includes some configuration improvements and numerous bug fixes and minor enhancements in response to reports from our users on Kedro’s Slack organisation. Take a look at the full release notes on GitHub for details. We’re proud of the fact that 14 of the PRs included in this release are contributions by members of Kedro’s open-source community. We’d particularly like to thank the following GitHub users:
jstammers, FlorianGD, yash6318, carlaprv, dinotuku, williamcaicedo, avan-sh, Kastakin, amaralbf, BSGalvan, levimjoseph, daniel-falk, clotildeguinard and picklejuicedev (for comments and input to documentation changes).
Our standardised contribution workflow means that anyone can join Kedro’s continued development and eventually progress into becoming an official maintainer on the project, writing code to improve the framework. In return for a weekly time commitment, maintainers may join Kedro’s Technical Steering Committee and help shape product strategy and roadmap decisions through regular voting.
Our community is thriving, as can be seen from the proliferation of third-party plugins that the Kedro community has recently created, including:
- kedro-kubeflow, kedro-airflow-k8s, kedro-vertexai, and kedro-azureml by GetInData
- kedro-neptune, by Jakub Czakon and Rafał Jankowski
- kedro-mlflow, by Yolan Honoré-Rougé and Takieddine Kadiri
For more insights into the Kedro community, check out this recording of the October 2022 Kedro Showcase, which includes more information about the changes to datasets, and new features in the 0.18.x releases, as well as community updates.
To ask us questions, meet the community and stay up to date with Kedro news, why not join our Slack organisation?