Why Data Science Teams Need Generalists, Not Specialists

ericcolson
97 Things
Published in
2 min readJun 28, 2019

In The Wealth of Nations, Adam Smith demonstrates how the division of labor is the chief source of productivity gains using the vivid example of a pin factory assembly line: “One person draws out the wire, another straights it, a third cuts it, a fourth points it, a fifth grinds it.” With specialization oriented around function, each worker becomes highly skilled in a narrow task leading to process efficiencies.

The allure of such efficiencies has led us to organize even our data science teams by speciality functions such as data engineers, machine learning engineers, research scientist, causal inference scientists, and so on. Specialists’ work is coordinated by a product manager, with hand-offs between the functions in a manner resembling the pin factory: “one person sources the data, another models it, a third implements it, a fourth measures it” and on and on.

The challenge with this approach is that data science products and services can rarely be designed up-front. They need to be learned and developed via iteration. Yet, when development is distributed among multiple specialists, several forces can hamper iteration cycles. Coordination costs, the time spent communicating, discussing, justifying, each change, scale proportionally with the number of people involved.

Even with just a few specialists, the cost of coordinating their work can quickly exceed any benefit from their division of labor. Even more nefarious, is the ‘wait-times’ that elapse between the units of work performed by the specialists. Schedules of specialists are difficult to align so projects often sit idle waiting for specialists resources to become available. These two forces can impair iteration, which is critical to the development of data science products. Status updates like “waiting on ETL changes” or “waiting on ML Eng for implementation” are common symptoms that you have over-specialized.

Instead of organizing data scientists by specialty function, give each end-to-end ownership for different business capabilities. For example, one data scientist can build a product recommendation capability, a second can build a customer prospecting capability, and so on. Each data scientist would then perform all the functions required to develop each capability, from model training to ETL to implementation to measurement. Of course, these data scientist generalists have to perform their work sequentially rather than in parallel. However, doing the work typically takes just a fraction of the wait-time it would take for separate specialist resources to come available. So, iteration and development time goes down. Learning and development is faster.

Many find this notion of full-stack data science generalists to be daunting. Particularly, it’s the technical skills that most find so challenging to acquire, as many data scientists have not been trained as software engineers. However, much of technical complexity can be abstracted away through a robust data platform. Data scientists can be shielded from the inner workings of containerization, distributed processing, automatic failover, etc. This allows the data scientists to focus more on the science side of things, learning and developing solutions through iteration.

--

--

ericcolson
97 Things

Former Chief Algorithms Officer @stitchfix, Former VP of Data Science & Engineering @Netflix, Former editor of http://multithreaded.stitchfix.com/