A Typical Data Engineering Project — Sharing From Netflix Data Engineering Team

Hui Zhu
BoringPPL
Published in
3 min readJul 16, 2018

Last week, I was lucky enough to attend the WiBD Workshop hosted by Netflix data engineering team. I have previously spoken to data engineers from many top tech companies such as LinkedIn, Facebook, as well as their counterparts in high growth startups. Not surprising — the role of a data engineer varies significantly from one company to another. Hence, the path to data engineering is always shrouded in mystery. I was excited to understand what data engineers’ primary focuses are at Netflix and get a flavor through the hands-on exercise.

Photo by Simon Abrams on Unsplash

A typical data engineering project

Senior data engineer Rashmi Shamprasad was kind enough to spend her evening teaching us. A key summary of her sharing below:

  • Always start with understanding the problem statement from your stakeholder
  • Data exploration: Data comes from log files, data warehouses and third party APIs etc. It is important to explore the structure, volume, granularity and frequency of your data
  • Data modeling: Structure how the eventual output should look like taking into account of your consumer, consider dimentionality of your data, key metrics to be reported and relationship across data sets
  • Data transformation: Filter, enrich, standardize and aggregate the data
  • Data quality: Check data trends, look for missing data gaps and anomalies

Collaboration with other teams is a big part of a data engineer’s daily job:

  • UI engineers: logging and instrumentation
  • Other data engineers working on the same data sets
  • Data scientists and data analysts: understand their experiments and analysis to prepare data for insights
  • Data platform team: Considerations on efficiency and scalability
Photo by Charles Deluvio 🇵🇭🇨🇦 on Unsplash

Hands-on exercise

Don’t worry if the summary above sounds a bit theoretical. We jumped on a hands-on practice to build a Spark pipeline following the steps above using Python. This exercise is excellent for anyone new to PySpark and wants to get a flavor of a beginner-level project. I can imagine it will be hard to find a Jupyter notebook like this too easily on Google so it is definitely a real gem that you may want to bookmark.

Although this is far from a fair representation of the actual projects handled by data engineers at Netflix, it is helpful to see the methodology above in action.

Photo by Paula May on Unsplash

How Do I Break Into Data Engineering?

Many attendees asked the question of how to break into data engineering without having prior working experience handling high volume of data similar to Netflix’s scale. Advice given by Netflix data engineers was pretty consistent:

  • Build a portfolio: practice building something on your local machine, read the documentation, spin up a few EC2 instances to have some exposure to the cloud environment
  • Yes big data tools change all the time and you can’t possibly be an expert in every single thing. However, you can find any company’s main data stack from their engineering blogs or presentations in various conferences. Make sure you have experience with those tools if you are interviewing with specific companies
  • Always lean on your strength. Companies love people with data intuition too!

I wrote this post because being in the Bay Area gives me so much access to valuable industry insights and information. I thought those of you, big data enthusiasts not living in close proximity of any tech hubs can probably benefit from similar sharing. Hope you enjoyed it! Kindly note that this post is my personal opinion and does not represent official opinion from Netflix.

I have also started a Github repo capturing more insights from events in the Bay and my past conversations with data engineers here. I am adding content over-time and would love to have more contributions and comments! Send us a PR or leave your thoughts below :)

--

--