Data Organization: why are there so many roles ? — And why it is important to understand them

Furcy Pin
YounitedTech
Published in
11 min readOct 19, 2021

Welcome, fellow reader! Since you are reading this post, I assume you probably clicked on a link shared by a colleague or in a weekly digest for blog posts about Data. If so, it won’t be long before you look at the diagrams below (perhaps even before reading this text, I know you do), and start thinking “oh no, not this. Not yet another blog post speaking about the differences between Data Analysts, Scientists, and Engineers, I know that all too well already”. Perhaps even, fellow reader, are you already acquainted with all these new hybrid roles that are coming up (ML Engineer, ML Ops, DataSecOps and Analytics Engineer, perhaps), as numerous and varied as these mythical half-breed creatures that Ancient Greeks used to summon in their stories? Perhaps are you already thinking: “ I don’t have time to waste with such nonsense, there is no need to be so specific about roles. Anyone can do anything if they set their mind to it, assigning roles simply raises artificial and unnecessary walls between what each individual can do.“? If so, I don’t have the power to stop you from leaving right now, but let me tell you that I agree with you. Coining names isn’t an exercise that should be done for telling people what they must or must not do, but I strongly believe it should be done to better understand what each one does. So please stay with me for a little while, and let me share with you my understanding.

Fig 2. A DataSecOps ; Fig 3. An Analytics Engineer; Almost as rare today as unicorns or five-legged sheep.

Organic growth of an organization

Younited, like many companies, started by hiring Data Scientists, Data Engineers and Data Analysts. Most fast-growing companies start off by hiring either a Data Scientist or Analyst before realizing they spend 80% of their time on Data Engineering, and then decide to hire Data Engineers too. Depending on the company’s priorities, technological choices, and individual mindsets, each of these three roles will delimit their perimeter by sharing, or not, some tools, skills and responsibilities. Because of this typical organic growth, we noticed that whenever we tried to compare our roles and job description with other companies, we found dissimilarities.

For example, at Younited, Data Engineers are using PySpark to ingest, transform and modelize data into a Dataware exposed on Google BigQuery, while our Data Scientists have always been “full-stack”, handling everything from feature engineering and model training to deploying their trained models on Cloud Run to serve scores used in our applications’ critical path. Meanwhile, our Data Analysts are using DBT on BigQuery and are tasked with maintaining ownership on all our Power BI dashboards. We discuss with another company, or just look at their job descriptions, and we see that they are hiring Data Analysts who write Spark jobs in Scala, Data Engineers who put ML models in production, or Data Scientists who perform ad-hoc analyses but never went close any ML library.

When Data Engineers from company A meet with Data Engineers from company B: same but different.

I believe that all this confusion around different data roles is the main reason why all these always-more-specific role names keep showing up. And it’s a good thing! It means that, slowly, people are gaining a better understanding about each part of the data chain, and more importantly they start to understand that each one requires specific skills and tools to be perfected. Just like the 25 distinct instruments composing an orchestra, each role puts their main focus on a different aspect of data and implementation.

This is why I decided to write this post: to share with you my interpretation about each role, main or specialized, but more importantly I will try to propose a visual representation to help everyone get their mind around it. Hopefully others will reuse this canvas to present their own vision.

4 Core Roles for 4 Core Skills

Let’s start simple by introducing the 4 core roles, all centered around 4 core skills:

The four core technical roles in a Data team, and their associated core skills

I feel that you are already starting to disagree here, reader: “I consider myself as a Data Analyst, but I don’t do BI. If a Data Engineer’s core skills is programming, why don’t you call them Developers or Software Engineers ? And by the holy data trinity, what is this yellow horror doing there ? Isn’t DevSecOps a chimera on it’s own anyway ?”. Please indulge me, reader. I am only trying to present my understanding of things. Feel free to mistreat my model and apply your own words, if it makes you more comfortable. Do not worry, I myself believe that each of these roles is much broader than the core skill I tagged it with, and that we could pinpoint them with much more specific names too. If it appeases you, I will try to give alternate, more specific names for each core skill practitioner. Let me present each of them to you:

Data Analyst: (Core skill: BI tools, more specific name for this skill: BI Engineer)

At Younited, what defines our Data Analysts the most are the dashboards they produce, own, or help producing. They also write their own transformation pipelines with dbt, are able to produce ad-hoc analyses for the business stakeholders and share them with Jupyter, but as of today, a talent for BI is what distinguishes them the most from other teams.

Data Scientist: (Core skill: Machine Learning & Statistics, more specific name for this skill: Data Scientist…)

What our Data Scientists have that other Data team members don’t is a knowledge of machine learning algorithms and statistics. They are full stack Data Scientists, meaning that they own everything from feature engineering and model training to production model deployments. They also spend a fair amount of their time performing various statistical analyses to better understand their models. With the continued growth of the team and increasing complexity and requirement levels of their work, they started to feel the need for a new specific role to handle this part, which we will see in the next part.
I would also like to point out a common overextension: People often use the term “Data Scientist” to designate all technical data roles, including Engineers and Analysts. Many companies still hire Data Scientists expecting them to perform all the data processing by themselves: ETL, data wrangling, automation, and all that. Don’t get me wrong, accomplished Data Scientists are perfectly capable of doing it, but that doesn’t mean they enjoy it. To make things worse, I also failed to find the right name to describe more precisely “a ML and Statistics practitioner”. Should we call them Machine Learners ? Statisticians ? those feel even less accurate than Data Scientist.

Data Engineers: (Core skill: programming, more specific name for this skill: Developer or Software Engineer ?)

While our Data Scientists and Analysts obviously have good programming skills, it is not what defines them. A Data Engineer’s main strong point is generally programming skills. But they also differ from the “standard” software engineer because of their complete focus on data. They have a very good knowledge of SQL (Data’s lingua franca), very good programming skills and apply development best practices. Ideally, they should also have a good understanding of the company’s business, as any good software engineer should have, but even when they don’t have this appetite for business understanding, they can bring a high value to the company by applying generic solutions to common problems.

How juniors must feel like when considering what data skills they should learn first.

DevSecOps: (Core skill: Infrastructure, more specific name for this skill: DevSecOps)

Unlike the other three, this role is not especially focused on data. They are responsible for the deployment and management of the cloud infrastructure, using Infra as Code tools like Terraform, but they are also responsible of the security of this infrastructure and also of the monitoring and stability of this infrastructure. Some also specialize in CI/CD pipeline automation. DevSecOps is a world of its own, with many sub-specializations, which can focus more on development, security, or operations.

At Younited, the DevSecOps has been historically far from the Data teams, firstly because they already have hundreds of things to take care of for the numerous (non-data) developers (don’t forget we’re a Software company before being a Data company), secondly because up to now the Data Engineers and Scientists managed to handle their own stack and infrastructure mostly by themselves. With the increase in size and requirements, we felt that this had to change too.

Introducing new hybrid roles

Now, this is where things get spicy … and a little wobbly. I am not sure myself of the accuracy of each name, nor that each one deserves a specific role in our organisation (at least not yet), but I feel that each one fits a specific need. This need can be (and currently is) covered by the 4 core roles, but as the teams grow we identify new ways to specialize. Please keep in mind, dear reader, that this is just a way to point out the different needs, not to confine people in so many small boxes. I will further discuss this at the end of the post, but this aims to provide a common understanding and help us exchange with people about what their main focus is.

Introducing five hybrid roles. Each hybrid role answers to a specific need listed below.

Analytics Engineer: (in-between: BI and Programming, specific need: Data Transformation)

With DBT’s sudden rise in popularity, you probably already heard about this new hybrid. As dbt-labs explains it themselves, now that companies shifted from ETL to ELT and that they increasingly rely on pure-sql transformations on distributed SQL warehouses (BigQuery, Redshift, Snowflake, Presto), some Data Analysts and Engineers started to specialize into SQL-only data transformations, using Software Engineering’s best practices: multiple environments for testing/prod, CI/CD validation, DRY principle, etc.
Tools like DBT helped them to get more familiar with these principles and apply them more easily, but did not solve the core of the problem: transforming the data and organizing it properly requires grey matter, lots of it. While the Extract and Load patterns (the E and L from ELT and ETL) are done quite easily, the Transform remains the most difficult (and valuable) part. For this reason, I would say that this part of the work currently represents at least 50% of our Data Engineers and Analysts work.

Business Analysts ?: (in-between: BI and ML, specific need: Ad-hoc analyses)

I’m not sure Business Analyst is the correct name for it, but the core task I associate with this position is performing ad-hoc analyses. Depending on the company and analyst’s background, they will use Jupyter or other tools to produce and share ad-hoc analyses to answer to various business stakeholders. At Younited, our Data Analysts and Data Scientists may both perform such tasks, as well as our dedicated Risk Analysts. To avoid the creation of diverging KPIs or simply reinventing the wheel, it is important to have them work hand in hand with the Analytics Engineers to produce common information sources for all analysts. Otherwise that would lead to anyone reproducing that age-old saying that I’m sure you are familiar with, dear reader: “Ask 1 question to 2 analysts, you’ll get 3 different answers”.

DataSecOps: (in-between: Data Engineer and DevSecOps, specific need: Data Infrastructure)

While on one side of the company, the marketing and sales teams need for self-service data rises like a tsunamy, on the other side, the CISO (Chief Information Security Officer) and the company’s board demand more security guarantees on our data. To address those needs, we need people who understand cloud infrastructures, and cyber security, but also data needs, tools and usages. The DataSecOps are all that. They are capable of securing networks, deploy and configure Big Data tools such as Spark, Airflow, or DBT, and provide users with CI/CD pipelines to help them be more efficient. (If you think you match this role, we have an open position for you ;-) )

Me when someone tells me that cyber-security is a top priority, simply because there has never been so many dangers and so much at stake.

ML Ops: (in-between: Data Scientist and DevSecOps, specific need: Data Science infrastructure)

Similarly to the DataSecOps, this role is a DevSecOps specialized on Data Science infrastructure needs. They help the Data Scientist deploy and monitor their real-time scoring infrastructure, which is crucial for several reasons: The scoring models are on the critical path of our customer’s journey, so they must be highly available; The scoring models may use sensitive data, so they must be highly secured.

ML Engineer: (in-between: Data Scientist and Data Engineer, specific need: Data Science industrialization)

While some Data Scientist are capable of being “full stack”, it doesn’t mean that deploying and maintaining tens of models in production is their favorite activity, and that they wouldn’t prefer to spend more time on doing actual data “science” than industrialization. For this reason, some Data Scientists who realized they prefer coding and automating, or more commonly some Software Engineers hyped by the Data Science train, joined the community and devoted most of their time to build code and tool belts to better automate, deploy, and maintain Data Science pipeline and models. (If you think you match this role, we have an open position for you ;-) )

Bridges, not walls

After introducing the 4 common main technical roles involved a Data Team, we introduced 5 hybrid specializations, each responding to a specific need. Hopefully this will help you, reader, to better formalize your own organization needs and roles, and to compare it with other organizations.

As a last word, I would like to insist that each of the hybrid roles represent bridges between teams, not walls. In fact, in our current organisation, an important part of the needs addressed by those roles are areas where the Data Scientist/Analyst/Engineers responsibilities and skills overlap. Here is a representation of our 4 data teams current area of responsibility/expertise:

Of course, this is only a snapshot of our current Data Team organisation, and it will keep on evolving in the next years. For instance, we might create an Analytics Engineering team comprised of Data Analysts and Data Engineers. In the future, we will probably open positions for many other data roles that I didn’t even take the time to bother you with, dear reader, that you surely know already. Data Architects, Data Stewards, Data Product Managers, Data Support Engineers, perhaps ?

P.S.: Younited has many open positions in our Data team: whether you are a Data Analyst, Engineer, Scientist, Steward, Architect, or any of the mythical creatures presented in this article, don’t hesitate to check our open positions.
At the time of writing this post we have open positions for Data Engineers, Analysts, Scientists, DataSecOps, ML Engineer, Data Product Managers.

P.P.S: I apologize if my writing style threw you off, dear reader. I have been reading a science-fiction masterpiece lately, called Terra Ignota, by Ada Palmer. It’s truly amazing, and the 4th (and last) book just came out. I have been so hooked that the writing style got into me. It was just for this post, promise.

--

--

Furcy Pin
YounitedTech

[Available for freelance work] Data Engineer, Data Plumber, Data Librarian, Data Smithy.