101 advice for data scientists

Johann D Harnoss | Imagine
Imagine
Published in
4 min readNov 7, 2018

What languages, frameworks & tech do you need to succeed in Germany?

A young data scientist?

A conversation with data scientist Mohamed from careem.com. Hosted by Egyptian Techies in Berlin & Imagine Foundation e.V.

Johann: Hi, Mohamed. Great to have you here. Who are you, and what do you do?

Mohamed: Hi! I’ve been working in Germany for 2 years now. I’ve been been working for one year at GoEuro as a data scientist in ranking search results. And now, I’ve been working for 1 year with Careem.com, a ride-sharing company. I’m mostly involved with dynamic pricing and machine learning.

Johann: [Joking] So, you’re increasing the price of my rides?

Mohamed: Well, I am trying to make your ride available. High prices are better than “not available”, don’t you think?

Johann: Of course, cool. Can you also comment on the key languages, technologies and frameworks that you are using at Careem as a data scientist?

Mohamed: Yeah, so mainly we are using Jupyter Notebook. Let’s focus first on targets. The target is to develop a brochure and analyzing the data.

The most important part is your communication ability.

So you have to create a model and have to express to stakeholders like product managers, engineers, why should we use this model, trends in the data, how for example will the price increase/decrease, how this affects the customer. All these analyses have to be embedded in the code.

We use Python or R as a main programming for machine learning. In addition we use a bunch of Apache SPARK for data retrieval and for scaling the machine learning model.

So let’s say that the first part is creating the model using Python / R and the second part is this scaling where we are using Apache Spark, also which has Scala programming, but you can also have a nice Python interface.

To sum up, for the programming skills, it’s Python, R, SQL, Scala, Apache SPARK for the programming and engineering part.

Regarding the machine learning part, I think good knowledge of different classification techniques, statistical analysis, the broad knowledge, not the deep mathematical knowledge.

Johann: Thanks for your super detailed answer. Next question. There is a lot of confusion out there in the world about the role of the data scientist. How much modeling do you do, how much data engineering you do, what are the typical data science roles at Careem?

Mohamed: Yeah, that’s a very good question. Actually I don’t have the perfect answer but I will answer from my perspective. It’s based on statistical analysis and machine learning which is a quite old, but because of the booming availability of data it’s quite booming.

So data science is split in two roles. And every company says data science is machine learning / product. Let’s say that data science is / analysis is focused more on analyzing. Typically you work in a cross-functional team. So for example, let’s say that I have a pricing team, they give analysis on what if we increase the price, how would this affect the demand, how are we going to personalize the pricing, to customize solution to each cluster of customers. So let’s say this is the first part, focusing on analyzing the data and presenting the conclusions and supporting statistical consistency, etc.

And the other part is data science / machine learning which is more closely working with back-end engineers, so after all analysis we come up with a fantastic algorithm. So, we focus more on the scalability of machine learning algorithms, how are we going to collect data, how automations are processed. These are the two main things, plus the data engineer which provides the muscle to scale the algorithm for example to trillions or millions of records and so that for example when you search for something, you instantly see the answer related to your search.

Johann: Thanks. What recommendation do you have for people who want to understand the frontier of applied data science?

Mohamed: My recommendation is something that I did myself: Side projects. Technologies like Apache Spark are not that mature. They have a lot of gaps where you can contribute code, there are a lot of open issues where you can help fixing a problem.

From a data science perspective, I think Kaggle is a great resource where you can get a flavour about machine learning.

There are competitions that are issued by companies, we posted a fraction of their data set and they want people to compete.

So I think side projects are a good tool for increasing your level from junior to senior in the minimal amount of time.

Johann: Mohamed, thanks for your time and detailed advice. Thank you very much.

— Your friends at Imagine. Apply now.

A special thanks to our partner Egyptian Techies in Berlin for their contributions to this article.

This post is part of a longer series. For more visit us here: https://medium.com/imagine-foundation

--

--

Johann D Harnoss | Imagine
Imagine

PhD @SorbonneParis1, MPA/ID @Harvard, @celtics fan. Economic migrant.