An engineer’s perspective on engineering and data science collaboration for data products
Data products facilitate meeting an end goal through the use of data. At Coursera, we’ve built data products whose missions range from facilitating better content discovery to scaling learner interventions to benchmarking learners’ performance of various skills. Each data product is a collaboration among product leaders, business leaders, data scientists, and engineers. Effective data products need effective collaborations between data scientists and engineers.
What are the factors that go into an effective collaboration? While we don’t claim to have all the answers, this post explains three themes that have worked well for us at Coursera — all from the perspective of an engineer. The three themes are, first, define boundaries of focus, not boundaries of concern. Second, strike a balance between productionizing models and building platforms. And third, use data and SQL as the universal language.
Define boundaries of focus, not boundaries of concern [h/t to Tyler Treat for this phrasing]
When building data products together, data scientists and engineers should work within defined boundaries of focus. They are working on different parts of the system. Not everyone needs to be an expert in everything, and ownership of the various modules should be clear. Yet we’ve found it important for engineers to engage outside of this boundary.
This is because there is no cookie cutter formula for a data product―and thus there is a greater need for empathy. For engineers, viewing data product development work from the data scientist’s perspective has often yielded insights resulting in better work within the engineering domain.
For example, we developed our experimentation platform as a collaboration between data scientists and engineers. The focus boundaries are as follows: Engineers built the allocation, measurement, and serving engine, while data scientists built the metric definitions, queries, and reporting engine.
But engineers were also concerned with how the data scientists computed the metrics and conducted experiments. As a result, we did readings of experimentation best practices, and collaborated closely with data scientists on validating assumptions. For instance, we were able to start with a simple approach while also preempting future needs in the area of experimental design. This axis of extension enhanced the usability of the system.
Defining boundaries of focus, not concern, results in:
- A more cohesive output data product due to mutual empathy
- Fulfillment of new and existing requirements with less churn and higher quality
- Opportunities for innovation at the intersection of boundaries
Strike a balance between productionizing models and building platforms
Data products follow an iterative development cycle. But how should we sequence between productionizing models and building platforms to support future models and features? As engineers, we may look down on one-offs or distasteful hacks, but the YAGNI principle suggests doing the simplest thing that could possibly work. In product engineering, this is done by building a minimum viable product. But continuously productionizing models risks iterating toward a local optimum. Iteration by definition is incremental, while building a platform enables new capabilities.
We believe productionizing models is a necessary prerequisite to building a platform. Building a platform is like building an abstraction, as it grants the users capabilities without them needing to understand the details. Deciding what to put in the abstraction is the challenge. So much so that duplication is often preferable to the wrong abstraction. As a result, our initial investments in a new data product take the form of a minimum viable product. This product often requires manual work to iterate on, but is crucial to discovering patterns and high-value iteration paths. Those paths systemize into platforms. The platform provides a foundation for future iteration and new feature development by the data scientists.
For the engineer, maintenance of the platform going forward is necessary, but they will have automated away a part of the previously manual workflow, and they have also increased the velocity of iteration. In our experience, the engineering maintenance cost is worth it.
An example of how we struck a balance between productionizing models and building a platform is our automated coaching feature. In the present iteration, we have the capability to target messages to any learner segment and to personalize messages to each individual learner. We use a feedback loop to control the volume and relevance of messages. Our automated coach can also collect learner goals that we surface back to learners at later times. But this data product feature didn’t start out this way.
We first productionized an automated coaching feature capable only of sending generic messages to all learners doing a certain action during a certain course. We were able to track learners’ receptiveness to the various message types, and tested out various learning nudges, such as emphasizing a growth mindset.
After iteration on the message types and copy, we saw big differences between messages in engagement and helpfulness rates. We also hypothesized additional use cases for these in-course coaching messages such as allowing learners to self-assess their competency on a module. At this point, building out a platform to enable high-value iterations became a necessary and natural next step.
Striking a balance between productionizing models and building platforms results in:
- Speedy fulfillment of specific use cases
- New capabilities emerging as the right forward-looking features are systematized into a platform
- An ability to steadily and iteratively improve data product features and impact
Use data and SQL as the universal language
Data scientists at Coursera operate in R and Python, while engineers write Scala. There are a few viable approaches to bridging this difference when developing data products.
One way is to train data scientists in Scala and engineers in R and Python. This approach is common in smaller organizations where individuals wear many hats.
The pros of this approach are that coordination costs are minimal and flexibility is maximized. Engineers and data scientists jointly define and redefine the collaboration model on an as-needed basis for each data product. But engineers and data scientists with cross-trained skills are hard to find. This strategy also punishes high-performing engineers and data scientists who prefer to focus on their domain of expertise.
A second way is to have data scientists own the model prototyping phase and engineers own the model productionizing phase. This approach is common in larger organizations that can afford to hire for specialized roles such as machine learning engineers.
The pro of this approach is that this specialization can bring efficiency. Domain expertise and industry best practices have emerged around the ML engineering field. However, ownership questions arise as machine learning engineers need to interface with both front-end product engineers and data scientists to productionize a data product. Striking the right headcount balance among data scientists, machine learning engineers, and product engineers is another challenge.
A third way is to use data and SQL as the intermediary. In this approach, data is the lingua franca among data scientists and engineers. We’ve had good success with this approach in the past few years.
A benefit of this approach is that SQL + data is a constrained interface that requires minimal training to operate. Data is dumb. It is easy to inspect, visualize, and debug data using SQL, and it is easy to collaborate without hidden states, assumptions, and nuances. Furthermore, this approach tightens the iteration loop, as data scientists can iterate on a model from end to end. We think this approach works for the majority of cases. But we recognize there are scenarios where data is not an ideal interface. The two main scenarios are when we need to encode stateful operations in data, and when precomputation of results is onerous. In practice, we’ve found these scenarios to be infrequent and not first-order concerns.
To use data and SQL as the universal language, we’ve had to build out and democratize our data warehouse, solve the problem of who writes ETLs (answer: everyone), and provide interfaces, libraries, and tools to make the data and SQL ubiquitous across the data science and engineering organizations.
An example of engineering and data science collaborating at the data boundary is our recommendations module infrastructure. It is a system that produces recommendations at various degrees of personalization. Recommendation modules range from fully personalized to the user (e.g., “Based on your recent activity”) to generic cold start recommendations to everything in between (e.g., “Because you viewed Machine Learning”).
Algorithms generating the recommendations range from matrix factorization to regression to rule-based queries. But data is an effective encapsulation — a combination of results, scores, and metadata is an effective internal API. It meets the characteristics of a good API: It’s easy for engineers to consume, easy for data scientists to produce, and sufficiently powerful for our use cases.
Using data and SQL as the universal language results in:
- Clear boundaries of focus between engineers as data consumers and data scientists as data producers
- An understandable and debuggable interface
- A common language between data scientists and engineers when collaborating on shared concerns
At Coursera, engineers and data scientists have built many data products. We’ve learned that building a data product is a team sport. As with any team, our goal is to be more than the sum of our parts through effective collaboration. This post has outlined three themes that worked well in our pursuit of this goal from an engineering perspective. Be on the lookout for a post from a data scientist’s perspective!