Creditas Tech
Published in

Creditas Tech

Data engineering needs to be specialized — here’s how we did it

In my last post, I talked about a misconception that many applicants to our data area have: that it’s necessary to be a data science unicorn to have a career. In it, I explained how unrealistic this standard is and how a “unicorn” profile doesn’t really contribute to the acceleration of data initiatives.

The distinguishing feature of a Data Scientist is their specialization in mathematics and statistics. When we take a step back from this focus on math, the mix that’s left is popularly known as Data Engineering. This term encapsulates all our demands and contexts in one way or another; however, the diversity of our demands and contexts leads us to the same dilemma I described in my previous post: we want racehorses, not unicorns.

It’s a question of focus. If we want to move faster, we need people who are focused on specific problems and who become really good (and fast!) at solving them. At Creditas, we wanted to speed up our deliveries, so we split the role of a Data Engineer into three others.

The needs of a large data department

When we talk about the initiatives related to the data of an organization, especially from the point of view of management, the needs are many: from the storage, distribution, and availability to the discovery, governance, and understanding of data. It’s in the flow that different professionals, with a variety of focuses and abilities, are meant to participate.

The following image depicts a diagram of how we can arrange theses different demands according to the four major areas of knowledge needed to extract value from data: business, database, programming, and mathematics (statistics).

Necessary activities to extract value from an organization’s data

The way we used to organize Data Engineering here at Creditas (and perhaps how you have organized yourself as well), our Data Engineering team was responsible for most of these demands. In my first post, I included a list of responsibilities (among others) which the team had:

  • construction of workflows for data processing;
  • crawlers for data capture;
  • data ingestion via streaming;
  • construction and organization of the data lake;
  • data warehouse modeling and construction;
  • deploy of data-science models;

…in addition to those we might add, among others:

  • database access governance;
  • database documentation;
  • user data capture systems;
  • provision of analytical data in the operational environment

The problem with this mix

If we were to look at the way we now structure these responsibilities among teams, the diagram would look something like this:

The organization of our teams alongside data demands
  • Data Science: a team of data scientists focused on constructing models that aggregate value to the business, especially regarding innovation and the automation of repetitive tasks.
  • Product Development: teams made up of developers and product managers focused on delivering technologies that add value to business. From a data point of view, they are the producers of operational data and are the technology teams closest to business.
  • Analytics: teams of data analysts focused on tracking operational KPIs and answering business questions. From a data point of view, they are the analytical environment’s main consumers and the team that demands the most from the Data Engineering team.
  • Data Engineering, the mix: our team of superheroes/heroines (not to say the others aren’t) who do their best to meet the data needs of all the other teams in the analytical environment.

Although this wide range of demands brings a lot of fun and learning to our daily lives, many problems arise from this scenario; often generating friction, rework, and inefficiency. To name a few problems:

  1. The lack of focus reduces the in-depth mastery that people can achieve in an area or technology;
  2. A change of context is a toll that is always being paid;
  3. It’s not rare that work needs to be redone due to an incomplete understanding of the domain being addressed;
  4. Interpersonal relationships are diluted since each task requires interaction with a different person;
  5. It’s hard to find people with all the necessary qualifications, so it’s hard for the team to grow;
  6. Wages become inflated for professionals who have little depth or experience.

Moreover, our company is in a stage of speeding things up. Simply put, the old system needed to go — the unicorn had to be dismembered. (poor thing 😭)

The right amount of multidisciplinarity

If we go back to the beginning of everything (relax, not that far back), to the conceptualization of data science, we see that an ability to overcome obstacles was crucial. What was once done by different people according to their capabilities and limitations came to be done in a new and more creative way, which had fewer restrictions, by someone with multidisciplinary knowledge (a unicorn).

How to solve this dilemma, then? It’s simple: with people who are able to hurdle one barrier very well, not all of them. The following diagram shows how we dissected the mix that was the role of a Data Engineer into three distinct new roles.

The new dissected roles of Data Engineering
  • Analytics Data Engineering: a team focused on enabling and accelerating the acquisition of answers to business questions that demand data analysis. They demand infrastructure from the Data Platform Engineering team and collaborate closely with Analytics.
  • Data Platform Engineering: a team focused on the automation of data capture and redistribution, both in the operational and analytical environment. They collaborate with the Product Development team for operational data capture and with Machine Learning Engineering to organize the data lake.
  • Machine Learning Engineering: team focused on helping data scientists, working with a model throughout its entire life cycle, from data capture for training to the production of the same data.

Included in the diagram below is the flow of value creation, showing how each team collaborates with those beside it to overcome the obstacles presented in everything, from the production and storage of data to its use in decision making and business innovation.

In the posts to come, I’ll go into detail about the function of each one of these “new” roles and what skills are expected for each one. Stay tuned, we have vacancies opening up! See you next time!

Interested in working with us? We’re always looking for people passionate about technology to join our crew! You can check out our openings here.

--

--

--

Our technologies, innovations, digital product management, culture, and much more!

Recommended from Medium

The Role of Combinatorics in Text Classification

Data Science for the layperson — Part 1: 5 basic terms

GA COVID-19 Report December 31, 2020

Social Media Data Mining

How to use XLNET from the Hugging Face transformer library

Text Preprocessing in Python: Steps, Tools, and Examples

Notes on Two Mid-18th Century Meteorological Charts

Detail of Ferenc Weiss’ meteorological chart from 1766 showing barometric observations with dots on a daily base

Best 6 free and paid stock market APIs for 2020

Stock APIs: Best stock data apis of 2020 for investing, building applications, and gathering stock market data.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
André Casimiro

André Casimiro

Lead Data Engineer at Creditas. Building infrastructure to provide data driven decisions and inovation.

More from Medium

Apache Hadoop’s Core: HDFS and MapReduce — Brief Summary

Sparkify Customer Churn

What is Data Lake?

Big data and ML from data engineer’s perspective