Why Data Science Teams Need Generalists, Not Specialists

O'Reilly Media
oreillymedia
Published in
3 min readAug 26, 2020

Editor’s note: Eric Colson brings a unique perspective to organizing data teams as Chief Algorithms Officer at Stitch Fix, enlisting full-stack data science generalists — an approach that tech professionals love debating. In this article, Eric describes the challenges and payoffs of employing generalists, not specialists.

In The Wealth of Nations, Adam Smith demonstrates how the division of labor is the chief source of productivity gains using the vivid example of a pin factory assembly line: “One person draws out the wire, another straights it, a third cuts it, a fourth points it, a fifth grinds it.” With specialization oriented around function, each worker becomes highly skilled in a narrow task leading to process efficiencies.

The allure of such efficiencies has led us to organize even our data science teams by speciality functions such as data engineers, machine learning engineers, research scientist, causal inference scientists, and so on. Specialists’ work is coordinated by a product manager, with hand-offs between the functions in a manner resembling the pin factory: “one person sources the data, another models it, a third implements it, a fourth measures it” and on and on.

The challenge with this approach is that data science products and services can rarely be designed up-front. They need to be learned and developed via iteration. Yet, when development is distributed among multiple specialists, several forces can hamper iteration cycles. Coordination costs, the time spent communicating, discussing, justifying, each change, scale proportionally with the number of people involved.

Even with just a few specialists, the cost of coordinating their work can quickly exceed any benefit from their division of labor. Even more nefarious, is the ‘wait-times’ that elapse between the units of work performed by the specialists. Schedules of specialists are difficult to align so projects often sit idle waiting for specialists resources to become available. These two forces can impair iteration, which is critical to the development of data science products. Status updates like “waiting on ETL changes” or “waiting on ML Eng for implementation” are common symptoms that you have over-specialized.

Instead of organizing data scientists by specialty function, give each end-to-end ownership for different business capabilities. For example, one data scientist can build a product recommendation capability, a second can build a customer prospecting capability, and so on. Each data scientist would then perform all the functions required to develop each capability, from model training to ETL to implementation to measurement. Of course, these data scientist generalists have to perform their work sequentially rather than in parallel. However, doing the work typically takes just a fraction of the wait-time it would take for separate specialist resources to come available. So, iteration and development time goes down. Learning and development is faster.

Many find this notion of full-stack data science generalists to be daunting. Particularly, it’s the technical skills that most find so challenging to acquire, as many data scientists have not been trained as software engineers. However, much of technical complexity can be abstracted away through a robust data platform. Data scientists can be shielded from the inner workings of containerization, distributed processing, automatic failover, etc. This allows the data scientists to focus more on the science side of things, learning and developing solutions through iteration.

Learn faster. Dig deeper. See farther.

Join the O’Reilly online learning platform. Get a free trial today and find answers on the fly, or master something new and useful.

Learn more

Eric Colson is Chief Analytics Officer at a Stitch Fix. For more than 18 years, he has led data-oriented teams that span algorithms & machine learning, Big Data & data warehousing, and analytics & business intelligence. Prior to Stitch Fix, Eric was Vice President of Data Science & Engineering at Netflix. He holds degrees in Information Systems and Economics.

--

--

O'Reilly Media
oreillymedia

O'Reilly Media spreads the knowledge of innovators through its books, video training, webcasts, events, and research.