Six Ways to Improve Your Data Scientist — Data Engineer Partnership
Data Engineers and Data Scientists are two peas in a pod, right? Both roles are designed to have the same purpose: to extract more value out of data. With such a clear purpose, why does it seem like there are often challenges when these two roles come together to solve a problem?
Data scientists and data engineers have very different backgrounds that drive the fundamental ways in which they approach problem solving. The context in which they learn and grow to become either a data engineer or a data scientist is what makes each strong at their respective positions. However, this is also the core problem with many data scientists — data engineer interactions.
I like to think that these differences represent opportunities rather than barriers: for improved team performance, happier management and — most importantly — personal growth. Here are six tried and true methods to improving your data science — data engineer interactions:
1. Cross-train for self-sufficiency
Far too many companies create silos around the specific roles within their data teams. Data engineers create new data, orchestrate pipelines, and ensure data is available in the warehouse. Scientists query this data, curate features, train a model and hand the process back over to the data engineer to put into production. In turn, the engineers make the code more performant; the scientists says that it invalidates the feature. The interaction goes back and forth until neither side is happy and a sub-par product is delivered.
It is my opinion that complete self-sufficiency for data-scientists should be the goal for every company. This is not to say that data scientists will never need engineering support. This particular goal is always going to be a moving target as the technology to create features evolves, new modeling libraries are introduced and teammates rotate through different teams. Individuals have their own specializations within teams and it only makes sense to structure work in a way that plays into your specialists’ strengths.
Cross training for self-sufficiency is fundamentally about removing the 80% of interactions that do not require specialist knowledge. For example, data engineers could train their data scientist partners on how to:
- Optimize common SQL access patterns
- How to manage memory (especially if Pandas must be used)
- Improve performance by uploading in-memory files to S3 instead of writing to disk first
- Create simple REST endpoints for their model
Once the data science team is trained, the burden of completing these tasks can be shifted away from the engineers. This further enables data engineers to spend more time building the frameworks that simplify feature creation or model deployment for the data scientists.
Cross training should happen both ways and data scientists should be helping their engineering colleagues learn as well. A few ares I’ve found to be most beneficial for engineering focus are:
- Interpreting basic model output and feature importance
- It is much easier to have a conversation about the performance tradeoff for a given feature if you know it only contributes 1.5% to the model versus 30%
- An understanding of different algorithms and when they may be useful
- There are many off-the-shelf models that engineers can leverage as easily as they would call an API; data engineers can augment the data science delivery with these models to rapidly deliver new products
2. Give feedback and check each other’s work
Data scientists and engineers generally work to build data products together; it should only make sense that teammates are following up to verify the product that their colleagues produce. Code reviews are the first thing that comes to mind and they are a great starting point, especially as you consider how data engineers can use pull requests as baseline examples for how their data scientists partners may replicate a similar feature in the future. Here are other practical ways that each role can validate and provide feedback to the other:
- Perform data quality checks when new pipelines are created and periodically for the lifetime of a data pipeline
- Request early feedback on feature engineering methods to reduce the risk of having to make late-stage tradeoffs due to technical complexity
- Ensure that models deployed to production can handle unexpected inputs (e.g. extreme values, nulls, non-numeric values)
- Initiate conversations about the velocity of data to understand the frequency and cadence that it will be used for modeling and analytics
3. Share common goals
The ways that data scientists and data engineers work can be drastically different. Engineers tend to focus more on day-to-day or wee-to-week deliverables, organized within sprints. Scientists, on the other hand, have more open-ended analytical cycles that may last several weeks to months depending on the problem they are trying to solve. When considering that most organizations align scientists and engineers in different reporting hierarchies the different delivery cadence can create disjointed priorities between individuals.
Typical goals may look like:
- Data Engineers: “build a data pipeline to ingest data from X, transform it and place it in Y database every Z [timeframe]”
- Data Scientists: “develop a model that predicts churn with at least X precision and can identify at least Y of users that will churn in the next Z days”
While these goals are aimed at solving the same problem they promote individualism instead of a unified target. Create a shared objective that both parties can contribute to. Data scientists and engineers will naturally gravitate toward filling the gaps with their preferred skill sets while simultaneously being incentivized to help each other. Problem scopes for each person change from “what do I need to do” to “what do we need to accomplish”. The most important aspect here is that it drives teammates to ask each other “what do you need” and “how can I help you” which breaks down individualism by promoting team-first delivery.
Some of the common problem and goal definitions I’ve seen work well in the past are:
- Customers leave too frequently, decrease churn by 20%
- CAC is too high because users do not complete the application, increase the application completed rate by 50%
- Engagement is the main revenue driver, increase engagement by 5%
4. Get a mentor in the other role
Mentorship in any form is an excellent way to augment knowledge by learning from someone else’s experience. I find that mentors between the data science and engineering job families bring an extra level of nuance from tangential perspectives within the same problem space. Having more one-on-one time with individuals in the other role will, fundamentally, build basic relationships which will support overall engagement, but the real value will come through a growth in non-technical skills. There are a few specific areas in which each role can learn from each other:
- Influence and sales: Data scientists are, generally, better at communicating outcomes, using data to drive decisions, and structuring results to be easily interpreted. They practice this more often so it only makes sense that this skill is more highly refined. You can pick up these skills from a strong data science mentor and use them to help influence your data science partners in your day-to-day work.
- Think about how data is used, not created: Engineers can sometimes focus too much on how data is being created and where they need to store the data. Unfortunately, the perspective of “how will this data be used?” and “what implications will this have to the queries that the data scientist and analysts write?” is not taken into account. Look for a mentor who can help articulate this and give you practical examples for how to solve problems from both the data creation and data usage perspectives.
- Planning and coordination: One strength that many data engineers have is the ability to define a clear set of sequential dependencies that are required to solve a problem. This problem solving approach stems from day-to0day development and also applies to more abstract job functions such as defining a data architecture. Learning how to approach problems from this perspective from a mentor can help data scientists engage more during a feature’s planning phase and can reduce the risk of identifying a requirement or major change after work is well under way.
- Delivering on MVP: Developers tend to have a clear idea of what a true minimum viable product is and it’s easy to articulate because features are binary: they exist or they don’t. Statistics is not that straightforward; there are many considerations that tend to draw out the delivery time for data scientists from business impact from model performance to the unknown unknown when starting to research how to build a model. However, defining MVP is a mindset and a strong engineer who understands the basics of the ML lifecycle will be a strong counterweight to balance your opinions and inject new ideas as you work to determine what is truly minimum.
5. Keep metadata up to date
Metadata may be the most undervalued data that companies today are creating. It has tremendous value to both data scientists and data engineers and, most importantly, it helps both groups operate more efficiently!
Consider a typical interaction model between an engineer and a scientist when a data pipeline is being updated:
- Engineer 1: I need to find out who is using the data that this pipeline creates; I’ll ask my teammate
- Engineer 2: I know that this data is sent to the data lake, you may want to ask the data scientist how it is being used
- Data Scientist: I just joined the team two weeks ago, I think that this data is being used for an ops dashboard
- Engineer 1: We need to remove the “cust_active_dt” field since the upstream API no longer has that value, do you know if it is being used?
- Data Scientist: I don’t think so but I would need to go back and check the code
What a headache. There are so many unknown factors and the potential impact to business operations are highly dependent on research that may yield inconclusive results. Metadata can completely remove this ambiguity and save time for all three individuals in this process.
Let’s consider that this team has a data catalog and their typical development lifecycle includes updating the catalog with any new changes that are made. In this scenario the engineer could have checked the data lineage within the data catalog to see both A) what downstream systems and processes are impacted and B) that the “cust_active_dt” field has a synonymous value in the same pipeline and removing this field would have minimal impact on the business because all downstream processes can just point to the other field.
There is no silver-bullet data catalog that will completely solve this problem. The most important factor here is the commitment between the two primary data stakeholders: data engineers and data scientists. Keeping metadata up to date can be challenging but it is a long-term investment in your data and in your team.
There are tools that can help enable you to manage metadata better, such as Tree Schema’s Data Catalog, which can automatically populate your data catalog and includes a Python client for engineers and scientists to manage data lineage as code. Alternatively, open source data catalogs such as Amundsen and Data Hub are also available options. Each of these tools has their own strengths, but you can only extract value from metadata if it is up to date and that requires a dual effort from both data creators (engineers) and data users (scientists).
6. Participate in Hackathons together
There is no better way to pick up new skills and try all of those algorithms and techniques you’ve stuffed into your TODO bookmarks folder than to stretch yourself in a hackathon. For scientists and engineers alike this is the perfect opportunity to blur the line between the roles even further and to get hands-on experience creating new skills.
Hackathons have traditionally been engineering focused but there has been a clear shift from what I’ve seen over the past few years to bring analytics into the fold. There are clever ways to curate datasets within a hackathon so that even the initial product has a solid base for analytics; simplified web scraping and publicly available APIs and data assets have enabled more data science interactions within hackathons. Building these integrations into the MVP is a fantastic way for scientists to understand the basics of development. Google alone has several high quality ML APIs:
The advent of machine learning as a service also provides the ability for engineers to scratch their ML itch when hacking together a product in just a few days. Developers looking for more of a challenge may look to implement a reinforcement learning or genetic algorithm that will help their product continue to learn on its own.
The spirit of the hackathon is that everyone chips in to do what is necessary. Hackathons provide an open environment where participants learn new skills and build strong relationships with each other.
Whether your data science and data engineering teams are already hyper efficient or they are struggling to deliver a single product to production there is always room for improvement. One common theme among all of these suggestions is to learn from each other. At the end of the day you’re either a data scientist or a data engineer and both of those jobs are incredibly fun! Spend time to understand how your colleagues work and think, and get more enjoyment from your collaborative work efforts.