Data Science Skills That Will Not Get You a Job But Will Get the Job Done

Piero Cinquegrana

Published in

motive-eng

6 min readJul 18, 2019

The worst-kept secret in the data science profession

1. Intro

At KeepTruckin, Machine Learning (ML) is front and center of our product offering. We have tons and tons of sensor data coming from our Electronic Logging Devices (ELDs) streaming from our large fleet of vehicles. The abundance of data means that our customers demand an ever-increasing array of products and services powered by ML.

One such project I worked on is the prediction of dwell time at logistics facilities (henceforth “Facility Insights”), so drivers and fleet dispatchers can more accurately plan when to arrive at the facility for loading or unloading the goods they are transporting.

During this project, I spent about 90% of my time working on such things as data processing and extraction, optimizing the runtime and memory consumption of the codebase, mapping and converting data types when moving from one database to the next or spinning up Amazon Web Services (AWS) EC2 virtual machines.

In our imagery, these tasks are more closely associated with data and machine learning engineers, the former specializing in data preparation and the latter in the deployment of ML applications. In reality, very few companies can afford to have such specialized roles.

This is the worst-kept secret in the data science industry: the skills above are rarely tested during the interview process but are invaluable to get the job done. I will review the various aspects of the job that data scientists rarely talk about.

2. Cloud

If your company or prospective employer utilizes cloud services or they are in the process of adopting or migrating to the cloud, knowing how to use cloud infrastructure is of paramount importance. At KeepTruckin, our cloud provider of choice is AWS, but similar concepts apply to Microsoft Azure or Google Cloud Platform. While working on Facility Insights, I interacted heavily with S3 where all of our data is located. I had to list files in specific S3 locations, delete them or move them around. Knowing how to use the AWS CLI is very useful to list the contents of a bucket:

aws s3 ls s3://<your-bucket>

Or to even spin up an EC2 box programmatically. It is very important to know how to gain remote command-line access to an EC2 machine via ssh. For some large jobs that I could not execute locally, I used an EC2 instance:

ssh -i <your-pem-file> ec2-user@<ip-address-of-ec2-machine>

Of course, your security group needs to be open to port 22 to ssh into a machine. You have to be familiar with such concepts when interacting with remote cloud machines.

3. Big Data: Apache Spark

KeepTruckin is part of the movement towards the Internet of Things (IoT). Our sensors located inside our fleet of vehicles produce data around fuel consumption, GPS location, revolutions per minute (RPMs), tire pressure and other important indicators that help us build ML products to increase the efficiency and safety of the logistics industry. All of this means the volume and velocity of data are very large and we need big data engines such as Apache Spark to process data at scale.

Apache Spark has a steep learning curve, with all its complex master/worker architecture, its thousands of knobs to configure (e.g. spark.memory.driver, spark.memory.executor), and its very hairy deployment process.

The first job I wrote at KeepTruckin is 10 lines of Apache Spark code to convert gzipped-compressed text files into structured Parquet files.

df = spark.read \
 .format(‘csv’) \
 .option(‘delimiter’, ‘,’) \
 .option(‘quote’, ‘\”’) \
 .option(‘header’, ‘true’) \
 .schema(schema) \
 .load(<s3-location-of-your-files>)

The above command read ~10TB of data in one go and wrote partitioned Parquet files (after some manipulation):

df.write \
 .format(‘parquet’) \
 .partitionBy(<partition-columns>) \
 .save(<s3-write-path>)

4. Linux/Docker/Kubernetes

For reasons of space, I will lump Linux, Docker, and Kubernetes in a single category, even though I will not do justice to these three different components of a tech company’s stack. After working on extract load and transform (ETL) pipelines in Spark, my data was ready and I worked on my ML model to predict dwell time. I used pandas and scikit-learn, which are traditionally the workhorse of data science interviews. The total time spent on these two technologies for the entire project was maybe 10–15%.

After my model was ready and validated, I had to deploy the job into Kubernetes (K8S). K8S is particularly suited for stateless jobs like this one, where I had to read from Spark, do model training and write the prediction to our application DB to surface the insights to a customer-facing application.

To deploy in K8S, I had to build my Docker image with the latest code and push to AWS Elastic Container Repository (ECR). Here are some sample commands:

docker build -t ds-python-3–7 -f Dockerfile — no-cache

The above command will build an image called ds-python-3–7 using the file Dockerfile, containing specific instructions on how to build the image.

After tagging your image with AWS ECS repo, you can go ahead and push it:

docker tag ds-python-3–7 <your-ecr-repo>.dkr.ecr.us-east-1.amazonaws.com/ds-python-3–7:latestdocker push <your-ecr-repo>.dkr.ecr.us-east-1.amazonaws.com/ds-python-3–7:latest

K8S will create a pod with the image specified and you can list the pods this way:

kubectl get pods

During the pod run or after completion, you can retrieve the logs:

kubectl logs <name-of-the-pod>

5. Application DB: PostgreSQL

When working with a backend process Spark and S3 provide cost-effectiveness and scalability, but for our front-end application KeepTruckin needs transactions and millisecond latency. Traditional data warehouse technologies provide advanced support for inserts, deletes, and updates, much better latency for concurrent queries and support for replicas, federated queries and much more.

Our technology of choice is PostgreSQL. After writing Spark jobs, running Pandas/Scikit-Learn in a Kubernetes cluster, I had to write Facility Insights predictions to PostgreSQL.

While I did not need specialized knowledge of the application DB, it was useful to know how to map data types from Spark to PostgreSQL and work alongside software engineers to structure front-end tables and make suggestions about schemas.

6. Git

Writing ad-hoc analysis is very different than writing production applications. Like it or not, in order to build ML applications, data scientists increasingly need hard software engineering skills.

Facility Insights is an ML application that required code versioning, feature branches, logging, and fault-tolerance. Without a git workflow, it would have been very hard to correct bugs, make incremental changes to the existing code base and deploy in dev, staging and production environments.

At KeepTruckin Data Science, we follow a strict process of feature branches, pull request reviews and we use a monorepo to host our entire codebase.

Feature branches help us to make small incremental code changes to our codebase, usually associated with a single JIRA ticket;
Pull request reviews (PRs) require at least one reviewer to comment and approve the changes. We try to limit our PRs to a reasonable size so as to elicit multiple comments and keep reviews manageable;
The monorepo helps us to integrate our codebase and promote the use of utilities across the company.

7. Product Management

Last but not least, data scientists need to work alongside product managers (PMs) to prioritize and make suggestions on ML products. Data scientists bring specialized knowledge about what it is and is not possible to build, given a certain stage of the company and its data. Data science — unlike for example software engineering, is a new function in many companies and a lot of PMs do not have specialized knowledge of ML applications. Thus, data scientists can greatly contribute to internal debates around data products.

For instance, for Facility Insights I worked with our PM to provide ad-hoc analysis and discuss results to segment the customer roll-out of this feature. I also worked with him to discuss potential new features such as customers initiating requests for predictions of dwell time at warehouses that do not have a prediction yet.

8. Conclusion

In this article, I walked the reader through the Facility Insights project at KeepTruckin. Throughout this project, I used a variety of skills across different stacks to get the job done. However, these skills are often not part of the data science interview process. Should they be? I do not know the answer to that question.

On the one hand, we are asking junior data scientists to know an increasingly large array of technologies and skills to build ML applications. Should you ask data engineering and DevOps questions when interviewing data scientists? That would seem absurd. On the other hand, because companies with nascent ML requirements lack a mature ML platform they need data scientists with such diverse skill sets. Perhaps this is a transitional phase in the industry, and with time automation will reduce the surface area needed to build ML applications. For the time being, we need to learn to live with this paradox!