We don’t need everyone to be a Data Scientist

André Casimiro
Creditas Tech
Published in
5 min readAug 7, 2019

I recently spoke to Bárbara Barbosa, leader of the Creditas Data Science team, about the difficulty I was having in finding good candidates for the Data Engineering team. The data field is quite competitive, and its scenario is made up of inexperienced candidates and salaries that are too high; besides, those candidates that were actually good choices for our team were actually seeking to become data scientists. Since then, whenever I talk about
this, I say:

Everyone wants to be a Data Scientist!!!

Well, needless to say that I’m not the only one thinking it, do I? See this question on Quora about whether we are heading toward a talent bubble in Data Science.

The truth is that in order to be a true data science talent you need to be that fantastic being everyone keeps talking about: someone who has hacking skills, math and statistics knowledge, and mastery of business. A true unicorn!

That familiar image always nice to remember

Unicorns are not born unicorns

Yes, unicorns do exist. True data science talents do exist. Those people who master everything needed to impact the business by using data, from the conception of the business issue up to the delivery of the production model, they do exist. But they did not start their careers like that. It is simply not possible to gather enough experience to perform such a complex role after only a few career years.

Studying hard, dedication, taking online courses on all kinds of algorithms and techniques seems to be the standard strategy to becoming a unicorn. But the truth is that you need much more than that to become relevant in solving real issues. You need experience, and you can only get experience with time and work.

In her “Data Science is different now” post, Vicki Boykis, a data scientist since 2012, says that the current job market is full of candidates for junior job positions in data science, the competition can be up to 100 candidates for 1 job position. She advises those who wish to become data scientists to use the following career plan:

1. Don’t shoot for a data science job
2. Be prepared for most of your data scientist work to not be data science. Adjust your skillset for that.

then, regarding this PWC report, she says:

… the number of data science positions is estimated at 50k. The number of data engineering postings is 500k. The number of data analysts is 125k.

It’s much easier to come into a data science and tech career through the “back door”…

And that is exactly the reality we see in applications to Creditas Data Scientist and Data Engineering positions. DS positions receive much more applications, while we have many more open DE positions.

We do not need unicorns anymore, we want racehorses

Even though unicorns are extremely valuable to our teams, we understand that this is not the best way to work. Times have changed. Creditas now wishes to specialize its Data Science and Data Engineering teams. This way, we can improve the efficiency of both teams.

It is a well-known fact that 70% to 80% or more of the time spent by data scientists is actually just in preparing the data. This is a task that is clearly the responsibility of a Data Engineer.

Another examples are data organization in the data lake, crawlers for data capture or the deployment and packaging of the generated models (a position nowadays most known as machine learning engineer).

We do not need unicorns who are capable of solving issues on their own because that model is neither sustainable nor scalable. What we need most today are data engineers who can create infrastructure that maximize the amount of time that data scientists can focus on data analysis and model training.

The relation between Data Engineering and Data Science

Data Engineering is the part of Data Science focused mainly on technological and analytical infrastructure, in order to collect, organize and enable the analyses needed. To better explain this, I have created a new version of the Venn diagram for the skills needed for a data scientist, by splitting “hacking skills” into “programming” and “database”, that are specific to and present in the Data Engineering routine.

Diagram comparing Data Engineering and Data Science skills

When we look at the diagram, we see that the only skill not used in Data Engineering is Math and statistics. What we do every day is coding pipelines to move, prepare and transform data; seeking to democratize access to data within the organization simply and instinctively.

As I said before, the need to accelerate delivery requires our teams to be
specialized, and the usage of Math and statistical models has become a well-
defined boundary
between DE and DS responsibilities; and it is probable that this is a trend for the market as a whole.

Moreover, to be fair, it must be noted that a Data Engineer is not a Data Scientist who does not know Math; this is not a matter of qualification, but rather of specialization. The data engineer focuses on infrastructure so that the data scientist can focus on research and modelling. A good data engineering team unlocks the entire company’s productivity in data use, including, obviously, the data science team. (This paragraph was suggested by Jéssika Darambaris, thank you!)

Nice! Now, can you explain a little more about what you done in Data Engineering?

In short, everything related to programs that manipulate databases (create, migrate, transform, convert, etc.), such as, but not limited to:

  • building data pipelines (python, airflow, spark, pandas);
  • crawlers for data capture (scrapy);
  • data ingestion via streaming (kafka);
  • data lake construction and organization (S3, Athena);
  • data warehouse modeling and construction (Kymball model, Redshift);
  • Deployment of data-science models;

In the list above, the technologies in brackets are those we use here at Creditas.

Profiles we want for our Data Engineering team

Professionals from different backgrounds can be valuable to our team in many ways, and the most common are:

  • Software developers interested in data and distributed processing;
  • Data and BI analysts with knowledge of and interest in programming;
  • DBAs with knowledge of and interest in programming;
  • Experienced data engineers;
  • Professionals who wish to become data scientists one day and want to start their careers working with Data Engineering. :)

If you found this article interesting and would like to apply for a job position on our Data Engineering team, click here.

--

--

André Casimiro
Creditas Tech

Experienced leader in the data engineering field. I offer consultancy services, check more on: andrecasimiro.com