What They Don’t Teach You in Machine Learning Courses

Maksim Butsenko, Data Scientist at Bolt

Bolt
Bolt
Dec 21, 2018 · 10 min read
Image for post
Image for post

Data science is an integral part of building an efficient ride-hailing platform. At Taxify, it took us just one year to build a strong and agile data science function which works on state-of-the-art solutions and deals with optimising millions of rides happening in real time.

While interviewing hundreds of candidates, we’ve realised that even those with a strong technical background were very often lacking some essential skills. In this article, we’re talking about things that they don’t teach you in Machine Learning courses.

Defining the role of a data scientist

Asko Seeba explains very nicely the process of how business might look at data science project and argues that it is mainly a research project and should be considered as such. Considering that even people in the industry still try to understand how to utilise data scientist’s expertise in the most efficient way, how should new joiners know what skills they should focus on?

Building a team

But who exactly are we looking for? The proper academic degree programs in data science are just emerging and the industry itself is not yet completely sure how to define a nicely framed profile of a data scientist.

As a matter of fact, the pool of data scientists currently consists of individuals with various backgrounds. There are people with background in computer science and AI in our team, but also those who come from the fields of signal processing, econometrics, chemistry, complex systems, sociology etc. Our common denominator is usually a good understanding of scientific method and of the design of experiments. Technical skills are much more straightforward to acquire. However, as we come from various fields, our understanding of processes around delivering data-based products might differ. It takes some effort to integrate all these experiences and deliver strong team results, and here’s how we handle this.

Data Science Excellence

The reason I speak about our Data Science Excellence guide here is that it can provide useful insight in what it takes to deliver a data product. While most of this is self-evident for experienced data scientist, you won’t learn about it from ML courses or books, so this is useful for anyone starting their career or transferring to data science from other fields. Doing the interviews and reviewing the test assignments by candidates, we constantly see that many beginners who have strong technical skills and sufficient understanding of ML pipeline fail at asking the right question or knowing how to test their model in live. Our Data Science Excellence guide is here to help.

Problem statement

Goals and metrics

  • What is the actual value in your model for the product and people using the product?
  • What is the impact? Does it impact 10% of your user-base or 40%?
  • How do you plan to measure the efficiency of the model on the product level?

Collecting and measuring all the KPIs and metrics you can think of is a good practice. Positive impact in your main objective might have negative impact in other domains.

Timeboxing

For example, predicting the amount of time for driver to reach a rider (estimated time of arrival — ETA) is the crucial element of our service. After delivering successful ETA prediction model which had improved considerably on mean absolute prediction error compared to existing solution, we had to ask ourselves if it is reasonable to spend the effort now on trying to reduce the error further by some % or come back to it in a few iterations. We knew it would require considerable engineering effort and the amount of possible improvement was impossible to estimate ahead.

Considering this we strive towards timeboxing our efforts, either in regards of doing exploratory analysis or optimising existing model. We set limited amount of time for particular task and try to deliver the results during this time-slot, even if it means not choosing the fanciest of the models available or omitting some interesting feature engineering ideas. Timeboxing is also useful in the ideation stage — for example in order to evaluate possible areas to work on, we spend one day per idea in quick pair-hackathon mode to understand what are the possible outcomes of this path, how quickly can we get to the deployable model and how much is it going to improve the simple baseline.

Tooling

For example, the method of comparing two heatmaps with geospatial data might be relevant to other people in their analysis as well, so it makes sense to spend some time and generalise the function and have it as a part of our internal data-stack library. This ensures we move faster as a team. Also the team as a whole will benefit from structuring the code or notebook in readable and easily reusable way.

In general we aim to give our scientists and engineers the best tools available. We build what we must and buy what we can.

Code review

For example, in tasks related to our field you might want to discuss the questions such as “How do you define demand from the customers?”, “Why missing ride price fields were imputed by average price?” etc. This means that a software developer without good understanding of data science process is not capable to evaluate the full functionality of the code or notice mistakes in data-related assumptions. Therefore, it makes sense either to split code review in separate stages (software/model) or use people knowledgeable in both areas. For a great overview on software development for data scientists I would recommend reading this blog post by Trey Causey.

Code review is also a very good way to improve knowledge sharing inside the team, especially when different team members are working on separate projects.

AB testing

Considering this, AB-testing is the tool which can reliably (if conducted properly) measure the impact of the feature. Here, by AB testing we mean experimentation setting with randomised assignment into control and treatment groups. However, some experiments influence the whole city at once (for example improving dispatching algorithm), making a proper AB-testing impossible. For these cases, our simulation engine comes handy. Running several experiments at once is another complexity to account for.

The solution is to build a sophisticated AB experimentation engine to track all the experiments, handle randomised allocation in test and control groups, collect observational statistics, and calculate corresponding p-values. Nevertheless, even the most sophisticated engine cannot account for errors that are made in the setup of the actual test. Therefore best practices have to be shared inside the company.

Visibility and communication

  • Everyone knows what others are doing and they know who they should ask for advice/collaboration.
  • Reduces probability of “double work” with several people working on the same task.
  • Provides feedback from others on your work (especially important for models closely related to non-technical domains, where operations teams have much more domain expertise).
  • It may sound counterintuitive, but communication visibility reduces overhead of explaining same things several times as well as helps to reduce miscommunication.

It was important for us to establish constant flow of tracking and sharing the team progress in many channels, so we explicitly defined the good practices of sharing the results of your work:

  • Slack: regular updates in corresponding channels about the state of the project.
  • Weekly and monthly meetings: discussing and prioritising our work inside the team as well as with stakeholders. Re-evaluating priorities continuously, as even weekly cycle is too slow.
  • Research Notes: data scientist tracking the state of the project mostly for himself or herself, important findings, plans. Good place to gather the main findings which are later easy to share in other channels, slides, meetings etc.
  • When it comes to visibility in the team, less is definitely not more. Oversharing is usually not a big issue, however not sharing enough can seriously hinder your team progress.

Storytelling

Most of what you are trying to tell your audience should be self-evident from the plot itself and solved by choice of the plot type, colors, legend, axis labels. If you feel that you need to work on your storytelling, check this guide from AnalyticsVidhya and see some great examples from FlowingData here.

Conclusion

At Taxify, we’ve made this process more transparent, unified and efficient by having an initiative we call Data Science Excellence — it helps to build our work around the best practices established by the team. In addition, sharing the best practices with the new data science team members is beneficial both to the company and to the new joiners. Finally, we hope that these practices can be useful for anyone starting their path in data science.

Do you have your own version of Data Science Excellence project in your company? What are the best practices you believe are worth sharing? Please let us know here in the comments, or come talk to us during North Star AI conference. It would be awesome to know what do you think.

About the author

Bolt Labs

Behind the scenes look at the engineering and data science…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store