Clear Roles, Full-Stack, Can’t Lose

Published in

Root Data Science

7 min readDec 18, 2020

Much has been made in the data science blogosphere (humble as it may be) of the evolution towards the “full-stack data scientist”. A full-stack data scientist is…well, let’s hear from some of the pioneering bloggers:

“…a perfect mix of data scientist and machine learning engineer, who can design and build end-to-end machine learning projects and software” ~ Akira Takezawa
“…able to build things and perform tasks that aren’t traditionally thought of as part of their role, [including business analysis, infrastructure, ETL, DevOps, web app deployment, and data visualization]” ~ Chris Schon
“…a jack of all trades, master of none, though oft times better than master of one…you code, you test, you ship and you maintain” ~ Anand Chitipothu

At Root Insurance, we embrace individuals with a breadth of skills to succeed end-to-end in a data science project life cycle. However, owing to the complexity of our product, we frequently encounter cases where subject matter specialization, not skill specialization, is advantageous. Why does this arise and how do we solve it?

Skill specialization versus subject matter specialization

By definition, the full-stack data scientist has an aversion to narrow skill specialization, for instance only descriptive statistics or only modeling. He or she should have command over all the skills required to take a data science project from end to end.

However, reduction of specialization to a single dimension misses an important point: a full-stack data scientist can (and arguably should) embrace subject matter specialization. For example, at Root, we have subject matter specialization for (and even within) telematics, insurance pricing, underwriting, reserving, claims, user engagement, lifetime value estimation, etc. Various subject matter are required to take a data science product from end to end.

By expanding specialization into two dimensions, a richer set of personas emerges:

skill specialization and subject matter specialization in two dimensions

“The Neophyte” has not yet broadened or specialized their data science skill set or subject matter expertise. This is the case for many in the opening years of their data science career. Any of the four specialization quadrants are within reach!
“The Academic” applies deep skill specialization to a narrow class of problems, unlocking hard-won but often incremental value. You’ll find her at research institutes (where pooled funding can make her pursuits most cost-effective) or within R&D departments at companies where small performance margins translate into big economic gains (e.g., digital advertising, high-frequency trading).
“The Horizontalist” has a specialized skill, but he is more effective at applying it across a range of use cases. For instance, a resident deep learning specialist or a machine learning engineer might support a range of different product teams. Horizontalization is common for products where niche subject matter specialization does not have a compelling ROI.
“The Aspiring Director” has both a broad data science skill set and a broad range of subject matter expertise. He may have hopped from company to company or rotated throughout one organization. Arguably, this is the least stable quadrant of specialization. It is difficult to become a premier data science contributor without either skill or subject matter specialization. The aspiring director may attain broader scope as a manager and beyond. Alternatively, he may languish as a role player without a strong subject matter footing.
“The Expert Data Scientist” is a full-stack data scientist who has invested years to also become a subject matter expert. With their broad skill set, she is capable of taking a data science project from concept to deployment and beyond, while also accounting for all of the nuances of her domain. At Root, we encourage this path.

Data science pin factory? Or data science supply chain?

My personal favorite article on the virtues of the full stack data scientist by Eric Colson contrasts the data science lifecycle against Adam Smith’s exemplar of division of labor: a pin factory assembly line.

This article renounces the assembly line model for data science and endorses broad end-to-end responsibilities that are skill agnostic.

“The goal of assembly lines is execution. [The goal of data science] is to learn and develop profound new business capabilities…
…this means hiring “full stack data scientists” — generalists — that can perform diverse functions: from conception to modeling to implementation to measurement.”

But, this article too is mum on the matter of subject matter.

I believe the author selected the pin factory as a proxy for a data science component requiring many operations (ETL, machine learning, deployment) throughout its evolution. In reality, however, data science products are not amenable to the pin metaphor. Instead, they are more aptly characterized by a supply chain, requiring components of different raw materials, standards, and hazards (i.e., different subject matter) to be assembled into one.

The separation of subject matter in data science product development arises organically when the components of the product span too many domains. A supply chain treatment allows:

Quality control at a component level by practitioners with intimate familiarity.
Gratification among developers who are able to see their component from end to end many times per year.

Putting the supply chain to work at Root

One of Root’s main competitive advantages is the use of telematics to inform every full-term quote we offer. Telematics measures an individual’s driving behavior using sensors in their mobile phone and then uses this information to price more accurately and fairly.

To serve this product, Root’s data science team has embraced a distinction between the Telematics Features team and the Telematics Scores team.

Members of the Telematics Features team are full-stack data scientists who specialize in raw sensor data and physical first principles. These data scientists originate new labeled data sets through ingenuity or experimentation, navigate thorny sensor data quality issues, process petabyte-scale data to arrive at gigabyte-scale modeling files, perform any manner of machine learning, and deploy the resulting models into production.

Members of the Telematics Scores team are full-stack data scientists who specialize in driving behavior and actuarial sciences. These data scientists work closely with the Features team to arrive at thousands of user-level driving features, develop aggregation, imputation, and dimensionality reduction approaches, perform modeling against a sparse and skewed target variable (future insurance loss), and deploy the resulting models into production.

Both teams work end-to-end on their component of the overall product.

Data science supply chain management

Beware, the supply chain model of subject matter specialization creates interfaces. When products break, more often than not, it is at interfaces. Furthermore, as discussed in “pin factory”, interfaces increase coordination costs and exacerbate wait times.

We provide four principles to help navigate interfaces in your data science supply chain:

Create and evangelize an interfacial contract. At a minimum, enshrine specifications for the data hand-off and documentation standards. Hammer this home. Put it in writing. Hold forums to discuss with members of the entire supply chain. Make each other accountable for upholding these standards.

Mitigate information leakage. Agree on a strategy for data stratification that will endure throughout the supply chain, so that data that are in-sample at one component of the supply chain do not get mistaken as out-of-sample at another component.

Invest in a living interfacial modeling file. Instead of ad hoc handoffs of features or models that are then manually assembled (or reassembled) by the downstream team, agree on a common language for exchange.

For instance, our downstream Telematics Scores team might consume different driving features (columns) for different users in the historical data set (rows). So, our Telematics Features team is charged with preparing the data in this form.

But, don’t stop there. Create a mechanism for the upstream team to check-in new feature definitions, once they have proven their viability in isolation. Develop a one-click or zero-click process to generate new rows as new data arrive. For convenience, you might build this process to join on the target variables for the downstream team.

autonomous process for interfacial modeling file development, for batch or streaming data

Be flexible to but not dictative about occasional personnel rotation. You may have an Aspiring Director on your team. Or, just an individual who wants to understand another component in the supply chain. Where there is willingness, this kind of cross-pollination can lead to good things. Too much rotation, however, can undermine the power of deep-seeded subject matter expertise.

Bringing it home

Whether you are a new data scientist or the leader of a data science organization, it pays to appreciate the distinction between skill specialization and subject matter specialization.

Skill specialization will limit opportunities and increase unnecessary interfaces and organizational logjams. As an individual, build a broad data science skill set; for many, that requires expanding in cloud computing and dev ops. As a leader, embrace the power of the full-stack data scientist through recruitment, training, and recognition.

But, don’t make the mistake of blurring skill specialization and subject matter specialization. Areas of deep subject matter expertise throughout your product will allow for the identification and treatment of complex and nuanced phenomena, ultimately breeding more quality and gratification. As an individual, find a subject matter you love and hone your craft end-to-end. As a leader, thoughtfully devise a data science supply chain and forcefully address inefficiencies that can arise at interfaces.