Three Questions to Help You Prepare for a Data Engineering Interview

Jun 27, 2018 · 7 min read
Image for post
Image for post
Photo by Glenn Cartsens-Peter from Unsplash

Data science is just one of the modern data-driven fields in our new data world. Another job that is even more prevalent than data scientist is data engineer. Now, being a data engineer does not have all the hype behind it of being a data scientist. However, companies like Google, Facebook, Amazon, and Netflix all need great data engineers!

Data engineering requires a combination of knowledge, from data warehousing to programming, in order to ensure the data systems are designed well and are as automated as possible.

The question is: How do you prepare for an interview for a data engineering position?

Many of the questions will require you to understand data warehouses, scripting, ETL development, and possibly some NO-SQL if the company uses a different form of data storage system like CouchDB.

In case you are preparing for a data engineering interview, here are some questions that might help you. We are focusing on conceptual questions. However, you should also work on some technical skills like SQL, Python, and etc.

How Would You Approach Developing a New Analytical Product as a Data Engineer?

This means a data engineer needs to understand the entire product. A data engineer can’t just get away with building systems based off of requirements. They need to ask why they are building certain tables and objects.

It’s helpful if the stakeholders already have a general outline of what they want. If they don’t have an outline, we would want to work with them to develop a general idea of what metrics and algorithms will exist. This drives all the major decisions, including what data should be pulled, how long should it be stored, if should it be archived, and etc.

Once a general outline exists, the next step would be drilling into the why of each metric. This is because as you’re building different tables at different data granularities, certain issues might arise. Should the unique key be on columns A and B, or A, B, and C. Well, that depends, why is this important? What does that row signify? Is it customer level, store level, or maybe brand level?

Once your team has gone through the process of working on the outline with your stakeholders and gained an understanding of the why, the next step is to think through as many operational scenarios as possible.

Will you ever need to reload data? Do your ETLs allow for it? Is it efficient? What happens when X occurs? How do you handle case Y?

You can’t spend all day doing this, but trying to think of all the issues that could occur will help you develop a more robust system. It also helps create a system that actually meets requirements.

From there, it’s about developing the design, creating test cases, and testing the tables, stored procedures, and scripts, and then pushing to production. How that occurs usually changes from team to team.

What Is the Difference Between an Operational Database and a Data Warehouse?

A data warehouse is not concerned as much with dealing with millions of fast transactions every second. Instead, a data warehouse is usually built to support a data analytics product and analysis. This means performance is not geared towards transactions — instead, it’s aimed at aggregations, calculations, and select statements. A data warehouse will have a slightly denormalized structure compared to an operational database. In most data warehouses, a majority of tables will take on two different attributes: a historical transaction table and tables that contain categorical style data. We reference these as fact and dimension tables.

The fact table is essentially in the center, unlike in a normalized database where you might have to join across several tables to get one data point. A standard data warehouse usually has a focus on the fact tables, and all the dimension tables join to provide categorical information to the fact table. It’s also typically bad practice to join fact table to the fact table, but sometimes it can occur if the data is created correctly. Here is an example of the structure of a data warehouse:

Image for post
Image for post

These are not the only tables that exist in a data warehouse. There are aggregate tables, snapshots, partitions, and more. The goal is usually a report or dashboard that can be automatically updated quickly.

Data warehouses also have a lot of other nuances, like slowly changing dimensions. However, that is a whole other can of worms.

Tell Us About a Time You Had Performance Issues With an ETL and How Did You Fix It?

Hopefully, whoever set up your ETL has an ETL log table somewhere that tracks when components finish. This makes it easy to spot bottlenecks and the biggest time sucks. If not, it will not be easy to find the issue. Depending on the urgency of the issue, we would recommend setting up an ETL log table and then rerunning to identify the issue. If the fix is needed right away, then you will probably just have to go piece-by-piece through the ETL to try to track down the long-running component. This also depends on how long the ETL takes to run. There are ways you can approach that as well depending on what the component relies on.

Issues can vary wildly, and they can include table locks, slow transactions, loops getting stuck, and etc. Once you have identified the issue, then you need to figure out a fix. This depends on the problem, but the solutions could require adding an index, removing an index, partitioning tables, and batching the data in smaller pieces (or sometimes even larger pieces — it seems counterintuitive, but this would depend on table scans and indexes). Depending on the storage system you are using, it’s good to look into the activity monitor to see what is happening on an I/O level. This will give you a better idea of the problem.

When you look at the activity monitor, you can see if there is any data being processed at all. Is there too much data being processed, none, or table locks? Any of these issues can choke an ETL and would need to be addressed.

If you Google some of the performance issues, then you will find some people blaming the architecture for a lot of the problems. We don’t disagree with them. However, this doesn’t mean you should throw in the towel. There are always various ways to manage performance if you can’t touch the actual structure. Even beyond indexes, there are some parallel processing methods that can be used to speed up the ETL. You can also add temporary support tables to lighten the load.

Image for post
Image for post

Question Number Four: Are You Experienced in Python, PowerShell, Bash, and or Java?

So we do recommend having some scripting abilities as well. It allows you to easily automate data flow and analytical tasks.

These are just some general questions to help get ready for your data engineering interview. Besides these questions, you should also look into the concept of slowly changing dimensions, automation with python or PowerShell, some basic Linux commands, and design concepts. If you do have a data engineer interview, then we wish you good luck!

Better Programming

Advice for programmers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store