Before I started as a Data Engineer two years ago, I had no idea what the role entailed or how it differed from data science and data analytics. Job titles with the word “data” in them are known to be an enigmatic black box. That’s true even for folks in technical roles. This post is what I would have wanted to read when I was trying to fit the pieces of the data pipeline together.
Can you give me an overview first?
How do you relate to the other data teams?
At Gusto, Data Platform Engineers support the other data teams — Data Analysts and Data Scientists. If those teams need access to data, we move and adjust it to a usable format for them. If they need a server that can crunch numbers for them quickly, we maintain that server. If they need a tool to explore data interactively, we make sure they have it — whether it’s an external tool that we keep connected or a tool we build ourselves.
What part of the business does your work impact?
Primarily, our work impacts the business side of Gusto. For example, we recently switched our customer satisfaction (CSAT) surveys to a new vendor, and the customer-facing teams who rely on the surveys needed to be able to access the data in a similar way as before.
My job was to collect the raw responses from the third-party vendor, transform the data to make it query-able, and then make sure a partner on the Data Analytics team had access to it. From there the Data Analyst made graphs and tables so that the customer-facing teams were able to see their stats in a way that was already familiar to them.
Why do you need to do all this to data?
If you only needed to use data from one database, you may not need to move it beyond creating a read-only cluster specifically for this purpose. Even then, it’s likely that the database is structured in a way that works for the application or the engineers, rather than for the people who need to analyze the data.
If your users (i.e. data analysts and business teams) need information from more than one database or want to use data from third party tools, you’ll need some way to group all the data together (i.e. a data warehouse). If your database doesn’t keep a record of each write, you may need some way to allow users to get a historical view of the database.
I could keep going, but you get the idea. The data is very unlikely to be in the most ideal state for queries that yield fast insights, which is why it needs to be copied and restructured.
So you get data from more than one place? What are all the data sources?
In our case, we gather data from our four internal databases which contain the data created by our apps. For example, whenever a customer runs a payroll, multiple rows are created or updated across multiple tables.
We also gather data from external tools via their APIs. These include Qualtrics, Marketo, and Zendesk Chat, among others.
Another way to get data from an external tool is to connect to the database directly. We do this with Salesforce, but we also copy data into certain Salesforce tables from our internal databases.
Ok, but what product do you create?
In our case, we expose a data lake to the Data Analytics and Data Science teams. The data lake is meant to be a place of discovery for these teams. Since the data is raw, it takes less work for the Data Engineering team to manage, but it doesn’t eliminate data that could be useful for skilled explorers.
More broadly within Gusto, we expose a data warehouse of tables that are structured to be queried quickly and only contain a subset of all the data in the lake. For Gusto, all of our data goes through the lake before it gets to the warehouse, and only the data that we know is useful and worth cleaning gets to the warehouse. These tables are meant to be more easily understood and allow for varying levels of access to sensitive data through different schemas.
Overall, we can think about the difference like that between a lake and tap water. Lake water in some places is safe to drink, but usually it’s not and in most cases getting the water is a lot less convenient than drinking from the tap. Tap water on the other hand is enriched with fluoride, is stripped of lead and bacteria, and is at your fingertips when you want it.
Similarly, the lake is a huge amount of data (all that my team collects). To make any use of it, you need to know what’s in it and how to treat it to write a query that returns anything a human can make sense of. The warehouse has been cleaned for you; it’s in tables that make sense for known use cases; and you can get answers out of it quickly.
How do you make the lake and the warehouse?
First, we run a scheduled ruby CLI command to get raw data from the data sources and store it in S3 in the rawest possible format. Storing raw means that if we decide later on that we want an additional field or a mapping changes, we won’t have to reload everything, we’ll just change the next step.
Next, we use Apache Spark or Amazon Athena to perform a simple ETL ( Extract, Transform, Load) on the data that we want in the data lake. An ETL could be as simple as reading a table from json files, making small changes to the structure, and writing the new table to csv files. The data lake is also stored in S3, but all files are parquet, whereas the raw data could be collected in any format, like json or csv. For our purposes, this ETL stage leaves the data with the same columns as in raw and switches the file format to parquet. Changing files to parquet, a compact columnar format, means that Presto (or, more specifically in our case, Athena) can query them very quickly.
Finally, once we know what to put into the data warehouse, we run Redshift SQL queries to perform another ETL which loads the new arrangement into a table in the data warehouse. This ETL stage does things like combine tables, unnest complex column types like arrays or structs, and remove sensitive fields.
How do you decide what is going to be in the Data Warehouse?
The short answer is we don’t. For the most part, the Data Analytics team decides what should be available and in what format. Sometimes we are given the requirement and then write the code to move it. Sometimes we build the tools that allow a Data Analyst to build and maintain a table on their own.
I think I followed most of that… is that what all Data Engineers do?
Not all Data Engineering teams will do the same work. Some Data Platform teams focus exclusively on infrastructure, while ETL teams might exclusively handle the process of moving and cleaning data.
That sounds interesting; how can I learn more?
I can’t say exactly what you’ll need to know, but here’s what helped me feel confident in my role:
- Take time to learn more about infrastructure. We use Docker to deploy most of the projects that we own, so being able to work with the CLI, read a Dockerfile, and dig into containers has been immensely helpful. I’ve also gotten more comfortable building pipelines in Buildkite, writing Terraform files, and navigating the AWS console.
- Brush up on your SQL. You might not be writing a ton of raw SQL right now, but as a Data Engineer you’ll likely being writing more complicated SQL and need to be able to understand the queries that other data teammates will be writing.
- Ask lots of questions, and be fearless. There are still projects that my team owns but that I haven’t worked on a ton. I push myself to pick up tickets in those domains with the knowledge that I can ask my teammates questions, even though it is usually scary to work on new code and admit what I don’t know.
- Keep looking for more resources.
Originally published at https://engineering.gusto.com on August 22, 2019.