Becoming a Full-Stack Data Scientist

Interview with Alexey Grigorev on AI in Action

Alexey Grigorev
Data Science Insider
6 min readMay 4, 2020

--

I had a great pleasure to talk to Anthony Kelly on the AI in Action podcast. Anthony asked me many interesting questions, that’s why I decided to summarize our conversation in a blog post.

We covered many things, including:

  • The work I’m doing at OLX Group
  • What does a lead data scientist do
  • How to become better at productionizing machine learning models
  • What is a full-stack data scientist
  • What motivated me to write books

Let’s start!

Welcome to the AI in Action podcast. I’m your host, Anthony Kelly, and today I’m delighted to have on the show Alexey Grigorev.

Alexey is a Lead Data Scientist with OLX. Alexey, welcome!

Hi, it’s a pleasure to be here.

Tell us a bit about OLX. What do you do there?

OLX is a platform for online classified advertisements. This is a place where you go if you want to sell or buy something.

We use machine learning extensively across many areas. One of the applications is moderation: we want to detect when somebody is trying to sell something they shouldn’t — like drugs or weapons — and stop that. To do it, we analyze the content of each listing and determine whether it’s safe to sell or not.

Many of our services rely on deep learning, but with a load of 10 million images per day, serving deep learning models is challenging. One of the projects I worked on was building an infrastructure for serving image models. It took some time to get it right, but now we can deploy a new model in a matter of days instead of months. You can read more about this project in our tech blog.

Infrastructure for serving machine learning models (source)

What sort of responsibilities does a lead data scientist have?

Many companies have a different definition of a “lead”. At OLX, a lead is an expert, not a manager: this is a technical role above “senior”.

As a lead data scientist, I help teams that need help. There are people who have a lot of ideas, but they don’t always have enough expertise. I help them by suggesting the best way of achieving what they need.

I try to be on top of everything related to machine learning. If somebody is developing a service, I help them align with other teams and make sure we don’t reinvent the wheel if there’s an existing solution that solves a similar problem.

Infrastructure-related topics is another area where I assist. Training a model is usually not difficult for data scientists, but the deployment process often is. I help by guiding and suggesting the best practices. One of the projects I’m doing right now is standardizing the way we deploy ML models across the company.

What are the potential problems of bringing ML models to production?

The most important question we need to ask ourselves before starting an ML project is “do we actually need a model?”.

Often, we don’t: it may turn out that we can solve a problem with a simple heuristic. We should always start with a simple baseline and only then improve it with machine learning. I made this mistake many times. Don’t go with heavy machinery first, instead, try simple things and iterate. If we start with a complex model, it will be too difficult to integrate it.

Take end-to-end ownership: be ready to get your hands dirty (Photo by Alex Jones on Unsplash)

Another problem is ownership. Some data scientists don’t want to write code. They say “I didn’t spend so many years writing my PhD to end up doing software development”. If a data scientist doesn’t take care of a model, then somebody else should. When software engineers do it, they may have many other priorities, so by the time they have a chance to work on productionizing the model, it’s too late.

To develop successful machine learning products, we should take end-to-end ownership of our models.

How to become better at productionizing machine learning models?

The attitude matters. Never say “not my job”, be prepared to go the extra mile, deploy a model, and learn from this experience.

Knowing tools that are needed for deploying models is beneficial. At OLX, we use AWS, Kubernetes, and Terraform. Being able to use these tools is helpful to be more independent and move faster. At other companies, the tools might be different, but knowing them will help deliver faster.

Knowing infrastructure tools is helpful to be more independent (image sources: AWS logo, Kubernetes logo, Terraform logo)

What do you think about full-stack data scientists?

A full-stack data scientist can do a machine learning project end-to-end.

A typical project involves many steps. First, we need to understand the problem and try to frame it in terms of machine learning. Next, we need to know what data is available, and if some data is not available, we need to acquire or produce it. Then, we prepare the data — and only after that, we can use it to train a model. After training, we evaluate the model and then deploy it.

A machine learning project consists of six steps: from business understanding to deployment (source)

At each of these steps, a different set of skills is needed, and often, a different person is involved. First, a product manager frames a problem, then a data engineer prepares a dataset, after that, a data scientist trains a model, and finally, a machine learning engineer deploys it.

It’s quite difficult to find a single person that can do all the steps. Yet, I think such people exist — and we can call them “full-stack data scientists”.

They don’t have to be excellent at all the areas, but they should have a T-shaped profile: have deep expertise in one area and a relatively good level in others. For example, expert-level knowledge of machine learning and good knowledge of other topics. With this profile, they can do a project on their own if needed but also can play nicely together as a team.

T-shaped profile: deep expertise in one area (machine learning) and a good level in others

You have written a couple of books. What inspired you to write them?

Mastering Java for Data Science was the first book I wrote. Prior to being a data scientist, I was a Java developer, so I thought the intersection of Java and data science was a good niche area. There were very few available resources about this topic, but since I had both skills, I decided to fill this gap and write a book about it.

Unfortunately, not many people wanted to use Java for data science: today, Python dominates the world of machine learning. The main reason Python is popular is its ecosystem and interactivity: it already had tools like NumPy and IPython long before data science was popular. Other languages don’t have that. Because of that, my book wasn’t in high demand.

That’s why I decided to write Machine Learning Bookcamp — this time, using Python.

Machine Learning Bookcamp: learn machine learning by doing projects

Machine Learning Bookcamp is about learning by doing projects. We start with a problem, and the book guides through solving it. It’s different from traditional machine learning courses, where instead you first learn about the theory and then get to see some applications.

There are 10 chapters planned and right now I’m working on the 5th. It should be ready by the end of 2020.

Thanks, Alexey, that’s all for today. Great to have you on the show.

Thank you for inviting me!

Thank you for reading it. If you liked the article, you can listen to the entire episode on the AI in Action podcast page.

Follow me on Twitter (@Al_Grigor) for the updates!

--

--