How Kedro is Leading the Data Science Framework Revolution

Waylon Walker explains the challenges data scientists face when their code moves into production, and how data science frameworks like Kedro are changing that.

a person typing on a laptop with a python data science framework notebook by their side
Photo by Christina Morillo

Our conversation with Waylon Walker forms the first part of an ongoing series about Kedro and understanding how it is used around the world.

Almost two years ago, Waylon Walker, created a framework to streamline the process of creating data and machine-learning pipelines. He did not set out to design a framework, but after working on many small-to-medium sized data projects a framework had formed. This framework would come to have no name and no real users, but the need for software that aided Waylon’s data science workflow was clear.

Waylon soon discovered Kedro, QuantumBlack’s first open-source product, and has become a prolific member of the open-source Data Science community. The Python Developer, Engineer and Senior Data Scientist has been busy. In addition to participating in feature development on Kedro, he also built plugins such as steel-toes, find-kedro and kedro-static-viz, all available on his GitHub profile. He carries a philosophy of learning in public and finds time between work, marriage and kids to keep busy blogger profiles on DEV.to and Twitter.

We caught up with Waylon to talk about trends in the Data Science space, his philosophies on sharing knowledge and his experiences using Kedro.

The conversation has been edited for length and clarity.

What problems are Data Scientists and Data Engineers currently facing?

The biggest challenge that we have is that using a framework for data science is not yet the norm.

Let me clarify my statement by using an analogy of the state of web development before frameworks like React were released. Ten years ago, most people used jQuery, a library designed to make it much easier to use JavaScript on your website, and thought that frameworks were madness.

Look at where we are now; we have pivoted. At some point, frameworks became the norm, and now if you don’t have a npm run build step, newbies are entirely lost. We also soon discovered that if you don’t use a framework, you will always end up building one yourself to solve the same problems.

Another problem is that the scope of data science has increased. I am struggling to onboard new data scientists, and I feel that there is a sign that says “You must be this high to enter” (to steal a phrase from Anjanka Vakil on the latest Party Corgi podcast). Most folks are quite familiar with wrangling DataFrames, but getting that work to production has so many more steps than a few lines of pandas in a Jupyter notebook. There has been a focus on Data Scientists being good at exploratory data science, but things like linting, formatting, type checking, documentation, CI/CD, Docker, AWS (and other cloud services), logging, monitoring and version control have been left out.

What are your impressions of Kedro, a framework designed to exist in this space?

I feel like I am able to do more complex machine learning projects faster because of Kedro.

In the past, I had built something that resembled Kedro if you were to have built it in an afternoon. For me, this feels like going from a very primitive version of Kedro to having something well-designed by a whole team of people. I have been using Kedro for over a year now. Building pipelines with Kedro has only gotten easier since I have been using it. There are continuously new features, many of which I haven’t tried yet.

Let’s circle back to my comments about how data science frameworks are not yet the norm because this affects Kedro. Data Scientists are still stuck in the Jupyter notebook era, and this has challenges for collaborating complex data science projects.

For data scientists not using Kedro, I can’t count how many times I have said: “Why haven’t you done this in Kedro?”, to which they reply: “Well, because it is just a Jupyter notebook and I didn’t need all of that”.

Then as we problem-solve, they run into all sorts of issues that wouldn’t be there if they would leverage the framework to simplify their problems. Trying to run a whole project from a Jupyter notebook is hard; however, it is the norm for so many. Especially when the lines between an ad-hoc analysis and a full project are unclear at the start.

Kedro simplifies the mental overhead required to construct and run a data and machine-learning pipeline. If we go back to where I said I made my own framework, essentially [laughs], it is like “all-or-nothing”, you have to run the whole pipeline. In that sense, I had to keep a mental model of all the steps in my pipeline to run it. I did not have the flexibility of just running a specific part of it. That mental overhead takes up a lot of space and slows you down. This is why I enjoy how Kedro allows you to jump in the middle of a pipeline, load the dataset and output the dataset. Kedro-Viz is an essential component that helps you visualize exactly the part you are collaborating on. It makes it easy to iterate on one little chunk of the pipeline, and I don’t need to think about the entire project. It’s the same way that someone else on my team could say: “I need someone with expertise on this thing.” I don’t need to know about the whole project to help.

Kedro also helps with scaffolding the additional requirements that you might have for taking data science code into production.

What has it been like building a public blogger profile?

It is like learn in public, you know? I have that kind of mentality of trying to learn all this stuff and share things what I am learning. I often go back and grab snippets, or I get questions a lot. It’s just that now I have a place to point people. I have learned so much from the community and it brings me joy to be able to give back in my small way.

And that’s it for our conversation with Waylon Walker…

This conversation touched on the challenges of using Jupyter notebooks for involved, collaborative Data Science projects. You can watch Joel Grus’ perspective on why he doesn’t like Jupyter notebooks for more thoughts on this topic.

Waylon Walker

Follow Waylon on DEV.to or Twitter for more of his perspectives on Kedro and the Data Science community, and subscribe to Waylon’s newsletter.

Stay tuned for more posts in this series!

Authored by: Lais Carvalho — Developer Advocate, Jo Stichbury — Technical Writer, Yetunde Dada — Principal Product Manager

--

--

QuantumBlack, AI by McKinsey
QuantumBlack, AI by McKinsey

We are the AI arm of McKinsey & Company. We are a global community of technical & business experts, and we thrive on using AI to tackle complex problems.