Data engineering needs to be specialized — here’s how we did it

Published in

Creditas Tech

5 min readNov 29, 2019

In my last post, I talked about a misconception that many applicants to our data area have: that it’s necessary to be a data science unicorn to have a career. In it, I explained how unrealistic this standard is and how a “unicorn” profile doesn’t really contribute to the acceleration of data initiatives.

We don’t need everyone to be a Data Scientist

I recently spoke to Bárbara Barbosa, leader of the Creditas Data Science team, about the difficulty I was having in…

medium.com

The distinguishing feature of a Data Scientist is their specialization in mathematics and statistics. When we take a step back from this focus on math, the mix that’s left is popularly known as Data Engineering. This term encapsulates all our demands and contexts in one way or another; however, the diversity of our demands and contexts leads us to the same dilemma I described in my previous post: we want racehorses, not unicorns.

It’s a question of focus. If we want to move faster, we need people who are focused on specific problems and who become really good (and fast!) at solving them. At Creditas, we wanted to speed up our deliveries, so we split the role of a Data Engineer into three others.

The needs of a large data department

When we talk about the initiatives related to the data of an organization, especially from the point of view of management, the needs are many: from the storage, distribution, and availability to the discovery, governance, and understanding of data. It’s in the flow that different professionals, with a variety of focuses and abilities, are meant to participate.

The following image depicts a diagram of how we can arrange theses different demands according to the four major areas of knowledge needed to extract value from data: business, database, programming, and mathematics (statistics).

Necessary activities to extract value from an organization’s data

The way we used to organize Data Engineering here at Creditas (and perhaps how you have organized yourself as well), our Data Engineering team was responsible for most of these demands. In my first post, I included a list of responsibilities (among others) which the team had:

construction of workflows for data processing;
crawlers for data capture;
data ingestion via streaming;
construction and organization of the data lake;
data warehouse modeling and construction;
deploy of data-science models;

…in addition to those we might add, among others:

database access governance;
database documentation;
user data capture systems;
provision of analytical data in the operational environment

The problem with this mix

If we were to look at the way we now structure these responsibilities among teams, the diagram would look something like this:

The organization of our teams alongside data demands

Data Science: a team of data scientists focused on constructing models that aggregate value to the business, especially regarding innovation and the automation of repetitive tasks.
Product Development: teams made up of developers and product managers focused on delivering technologies that add value to business. From a data point of view, they are the producers of operational data and are the technology teams closest to business.
Analytics: teams of data analysts focused on tracking operational KPIs and answering business questions. From a data point of view, they are the analytical environment’s main consumers and the team that demands the most from the Data Engineering team.
Data Engineering, the mix: our team of superheroes/heroines (not to say the others aren’t) who do their best to meet the data needs of all the other teams in the analytical environment.

Although this wide range of demands brings a lot of fun and learning to our daily lives, many problems arise from this scenario; often generating friction, rework, and inefficiency. To name a few problems:

The lack of focus reduces the in-depth mastery that people can achieve in an area or technology;
A change of context is a toll that is always being paid;
It’s not rare that work needs to be redone due to an incomplete understanding of the domain being addressed;
Interpersonal relationships are diluted since each task requires interaction with a different person;
It’s hard to find people with all the necessary qualifications, so it’s hard for the team to grow;
Wages become inflated for professionals who have little depth or experience.

Moreover, our company is in a stage of speeding things up. Simply put, the old system needed to go — the unicorn had to be dismembered. (poor thing 😭)

The right amount of multidisciplinarity

If we go back to the beginning of everything (relax, not that far back), to the conceptualization of data science, we see that an ability to overcome obstacles was crucial. What was once done by different people according to their capabilities and limitations came to be done in a new and more creative way, which had fewer restrictions, by someone with multidisciplinary knowledge (a unicorn).

How to solve this dilemma, then? It’s simple: with people who are able to hurdle one barrier very well, not all of them. The following diagram shows how we dissected the mix that was the role of a Data Engineer into three distinct new roles.

The new dissected roles of Data Engineering

Analytics Data Engineering: a team focused on enabling and accelerating the acquisition of answers to business questions that demand data analysis. They demand infrastructure from the Data Platform Engineering team and collaborate closely with Analytics.
Data Platform Engineering: a team focused on the automation of data capture and redistribution, both in the operational and analytical environment. They collaborate with the Product Development team for operational data capture and with Machine Learning Engineering to organize the data lake.
Machine Learning Engineering: team focused on helping data scientists, working with a model throughout its entire life cycle, from data capture for training to the production of the same data.

Included in the diagram below is the flow of value creation, showing how each team collaborates with those beside it to overcome the obstacles presented in everything, from the production and storage of data to its use in decision making and business innovation.

In the posts to come, I’ll go into detail about the function of each one of these “new” roles and what skills are expected for each one. Stay tuned, we have vacancies opening up! See you next time!

Interested in working with us? We’re always looking for people passionate about technology to join our crew! You can check out our openings here.