How asking a simple question bridged the gap between data science and data engineering

Anne-Sophie Schröder
Axel Springer Tech
Published in
3 min readDec 2, 2021
Photo by Christina on Unsplash

The disconnect between data scientists’ and data engineers’ code

Data science is an integral part of StepStone’s tech stack, powering different products: showing candidates the perfectly matching job, recommending better job titles to listings or suggesting to sales agents which company might be in need of posting a job ad.

One of the typical challenges data science faces in many organizations is bringing projects from experiments into production. This was no different at StepStone. The key was to collaborate closer with Data Engineers.
In the last few years at StepStone Data Scientists and Data Engineers worked in separate team structures and departments. The setup has been changed recently to working in interdisciplinary agile teams, however our insights remain relevant to anyone working in a separated structure.
Typically a data science project started and was driven from the data science side. The data engineering department helped decide on the technology to choose based on the requirements from data science. Only once the data science team has done their experimentation and built a first version of a pipeline or API the code was handed over to data engineering teams. This means that the primary communication tool here was “code + calls”.

We knew that this type of interface created some friction as data scientists and data engineers have quite different styles and objectives when writing code. One remark from our data engineering colleagues really hit home:

“While data engineers use code to build extendable systems based on abstractions, rules and design patterns, data scientists use code to translate mathematical models into computer language.”

The simple question that uncovered the pain points

It all started when we opened a round of feedback from data engineers to data scientists asking: “What happens when data scientists have prepared their side of the code?” — it was like opening Pandora’s Box. The simple question for how data engineers process our code and which pain points they have, led to a full roadmap of improvement possibilities we want to tackle in the future. From properly using the infrastructure (PySpark vs. Pandas might ring a bell with some of you!), to “Testing, Testing, Testing”, to simply separating the important parts of code that Data Engineers modify completely from the rest.

How we started lifting the pain

Just asking for feedback is of course not enough, we needed actions to follow up. After prioritizing the topics by urgency, the two most important ones were “separation of pipeline stages in code” and “testing (units, business logic & integration)”.

In the past we have held workshops from data engineers to data scientists and vice versa to bridge the gap. This time we tried something different: we wanted to empower data scientists through self-teaching guided by data engineers. This meant the data scientists prepared short four one-hour workshop sessions in pairs with consultancy by data engineers.
The big advantage here was that learning to teach others forces you to fully understand the material. This means that we now have experts on the different topics within data science.

All that was left then was to organize a workshop for data scientists. Luckily at StepStone we already have a group of experienced people that regularly set up workshops — our Data Science Community. They powered the organization of workshops and attendees.
The feedback we received from the attendants was very positive, leaving them hungry for more.

What we learned in this process

- Foster an open feedback culture, it helps everyone to improve
- Especially when working with other teams: keep asking for feedback and have actions to follow up on it
- Every time you learn to teach others you will fully understand the material

By simply asking what happens with our code and voicing our will for improvement of software quality we started a journey of improving the software quality of data scientists’ code. You can do that, too.

--

--