Moving from project to product: keys to developing successful data science products
The goal of a data scientist is to develop predictive models that are fit to purpose, such that they provide accurate, actionable insight within the constraints of the production environment for which they were designed. Developing a data science project into a data science product is about more than improving model performance; it’s about developing a sustainable ecosystem that enables project results to provide continual value. Having worked on a number of successful (and unsuccessful) data science projects ranging from proof of concepts through commercial products has provided a wealth of learning experiences regarding the critical components of a successful data science product, as summarized in the Venn diagram below. In the rest of this article, we’ll examine the interactions between each of the components of the diagram, what each interaction produces and the associated pitfalls that can prevent even the best ideas from becoming successful data science products.
Many data science projects begin with an idea: what if we could do ___? While these ideas can come from anywhere, the most productive ones often come from a series of discussions between subject matter experts and data scientists. These discussions help refine the idea to one or more specific problem statements, which are subsequently evaluated for potential value and associated data requirements. The result of these interactions is a proof of concept/proof of value, as represented by the overlapping region of these two groups in the diagram. A data scientist will work with a data snapshot, often manually pulled together from multiple data sources by subject matter experts, to evaluate how that data can be used to answer the problems of interest. Unfortunately, even the best proof of concept resulting from this interaction can fail to become a data science product due to the following pitfalls:
· No champion or a lack of perceived value: A predictive model not only needs to show that it can provide value (insight, improved efficiency, competitive advantage, etc.), but that value must be effectively communicated both up and down the organizational structure. Each project needs a champion with sufficient influence within the organization to drive it through approval channels for funding, resources and future adoption. While appropriately designed projects attempt to minimize disruptions in workflow for those downstream, value must also be clear to those whose workflow may need to be modified to ensure that the necessary data is collected appropriately for model development. Additionally, those with the power to block initial funding or adoption after development need to be identified and brought into the process early on when evaluative metrics of project success are established. While this does not ensure project success, it does help prevent being blindsided by previously unknown requirements or performance metrics that could derail the project late in development.
· No sustainable data management pipeline: Moving a model beyond proof of concept requires collecting additional data representative of expected use conditions. Machine learning models that form the heart of data science products are built from data; regardless of the algorithm employed, the resulting model can only be as good as the data collected for training. This requires establishing a structured data pipeline such that the collected data is of sufficient relevance, consistency and completeness for model development. Lack of a structured data management system can negatively impact data quality along all of these axes, ultimately resulting in negative impacts on model performance (i.e. garbage in, garbage out). In addition, if no path exists to continually collect data in the future, this extends the model development timeline and limits the ability to continually improve the model as more data is collected.
· Lack of a deployment environment: Whether the production model is deployed on a server or an external device, model end users need a way to easily access model results for it to produce value. In addition, both the deployment environment and the predictive algorithm should be designed in a modular fashion to easily enable model updates in production. From a data scientist perspective, the deployment environment can provide critical constraints, such as memory or processing power limitations, which will influence the data processing and modeling algorithms employed. Specifications on the deployment environment should be established early in project development to ensure that the predictive algorithm will function within the production environment.
A critical failure mode of data science projects limited to interactions between data scientists and subject matter experts, the lack of a sustainable data pipeline, illustrates the need for interaction between data engineers and subject matter experts. As illustrated in the diagram, the result of the interaction between these two groups is a data management platform. Subject matter experts can aid in identifying the data to be collected while data engineers formulate strategies for designing distributed data storage, collection, processing and retrieval systems. Development of these strategies can be influenced by a number of factors, including but not limited to: expected data volumes and demand, existing collection and/or storage mechanisms (i.e. data silos), and scaling requirements. Data science projects that originate from the interaction between only these two groups are typically of the data mining variety, where problem statements are generated by identifying interesting patterns in the collected data. While building an effective data management platform is necessary for such projects, failures of data science products can still occur as a result of this interaction due to:
· Lack of a data pipeline for predictive model development: Model development requires a pipeline for facilitating communication of data between the management platform and data scientists. Ideally, the data engineering team will establish API endpoints that enable easy querying of data by end users, which may be extracted from a variety of sources, to support model development or business intelligence initiatives. In this manner, back-end changes can be made to data models without impacting how end users interact with the endpoints. The existence of multiple use cases for data access driven by multiple user types may also influence architecture decisions required to efficiently support these endpoints.
· Insufficient data completeness: Data mining projects can encounter data consistency and completeness issues that can be avoided by directed data collection efforts dedicated to answering specific problem statements. Prescribed data collections ensure that the collected data is representative and of sufficient consistency and completeness such that predictive models developed from the data will generalize well to a production environment. Specifically, these collections ensure that: the same variables are recorded across experiments, the correct sensors are employed to record data, sensor settings are consistent across experiments and appropriately selected to avoid saturation, data is collected at sufficient frequency to resolve dynamics of interest, data streams can be synchronized across sensor readings from multiple devices, and both inputs and outputs are varied to span the expected future operating space. In a data mining project, deficiencies in the data collected in any of these areas can substantially negatively affect predictive model development and performance.
· Insufficient data accuracy: Even directed data collection efforts can suffer from a host of conditions that negatively impact data accuracy (equipment failures, erroneous sensor readings, or human error in experimental setup, such as sensor settings or experimental conditions, etc.). Ensuring that data inaccuracies do not negatively impact model accuracy typically requires development of a data curation pipeline to enable review of collected data for accuracy. Data curation efforts can be conducted by data scientists with the aid of subject matter experts. An initial round of data curation should be performed early in the data collection process to quickly identify issues with data accuracy.
The final interaction of importance is that between data engineers and data scientists, which results in the development of a data science pipeline. As detailed previously, a data science pipeline facilitates data access for model development purposes and establishes a pathway for the resulting models to be implemented in production. However, in the absence of interaction with subject matter experts, such efforts are likely to fail due to:
· Absence of a data management system: The lack of a defined data management system and directed data collections typically means that data for model development will be extracted from the default storage mechanism of choice: a variety of spreadsheets. As these data sources are typically maintained by multiple individuals who are interested in different problem statements, the data extracted from these sources typically have substantial consistency, completeness and correctness issues that can completely derail model development efforts.
· Lack of a defined problem statement: Subject matter experts have a better understanding of the problem statements whose solution have value to both them and the organization. Their expertise in the field can ensure that the models developed are fit to purpose and produce novel insights rather than reproduce prior knowledge. Engaging subject matter experts in model development can also create a vested interest in the resulting product, which can substantially increase the chances of model adoption. Model adoption by the target audience is critical in creating a data science product, as even the best predictive model cannot provide value if it isn’t being used.
In my experience, successful data science products result when a project team comprised of data scientists, data engineers and subject matter experts work in an integrated fashion to address each of the pitfalls previously identified. The interactions between these groups result in a natural data flow from the process/subject matter experts through the data management system to the data scientist, with the algorithm developed from the data being deployed into the production environment by the data engineering team to generate value for the subject matter experts through its use. It is important to create a scaffold of this entire flow early in product development, even if the resulting product is minimal in the beginning (i.e. a baseline model), to form the basis for future development and help identify potential blocking issues. Building a data science product is typically an iterative process; this initial scaffolding can subsequently be used to move from a development to production platform during these iterations of product development. When designed and implemented correctly, each part of the team can develop their parts of the platform in an agile fashion without negatively impacting the overall flow from data to product.
Ultimately, the failure of a data science project to evolve into a product can typically be traced back to failures at the intersections of these interactions. Knowing the essential components and potential pitfalls of data science projects in advance will hopefully help your team in developing future successful data science products.