Moving from project to product: keys to developing successful data science products

The goal of a data scientist is to develop predictive models that are fit to purpose, such that they provide accurate, actionable insight within the constraints of the production environment for which they were designed. Developing a data science project into a data science product is about more than improving model accuracy; it’s about developing a sustainable ecosystem that enables project results to provide continual value. Having worked on a number of successful (and unsuccessful) data science projects ranging from proof of concepts through commercial products has provided a wealth of learning experiences regarding the critical components of a successful data science product, as summarized in the Venn diagram below. In the rest of this article, we’ll examine the interactions between each of the components of the diagram and how the outputs of these interactions can be viewed holistically to drive ideas from project to product.

Venn diagram illustrating the resources employed in creating a data science product and the outputs of the interactions between these resources.

Many data science projects begin with an idea: what if we could do ___? While these ideas can come from anywhere, the most productive ones often come from a series of interactions and discussions between subject matter experts and data scientists. These discussions help refine the idea to one or more specific problem statements, which are subsequently evaluated for potential value and associated data requirements. The result of these interactions is a proof of concept/proof of value, as represented by the overlapping region of these two groups in the diagram. A data scientist will work with a data snapshot, often manually pulled together from multiple data sources by subject matter experts, to evaluate how that data can be used to answer the problems of interest. Moving a data science project forward beyond this proof of concept stage requires:

· Proof of value and an organizational champion: A predictive model not only needs to show that it can provide value (insight, improved efficiency, competitive advantage, etc.), but that value must be effectively communicated both up and down the organizational structure. Each project needs a champion with sufficient influence within the organization to drive it through approval channels for funding, resources and future adoption. While appropriately designed projects attempt to minimize disruptions in workflow for those downstream, value must also be clear to those whose workflow may need to be modified to ensure that the necessary data is collected appropriately for model development. Jointly establishing product requirements and evaluative metrics of project success with all stakeholders early in product development serves to focus subsequent data collection, data management and data science efforts required for a successful product.

· A sustainable data management pipeline: Moving a model beyond proof of concept requires collecting additional data representative of expected use conditions. Machine learning models that form the heart of data science products are built from data; regardless of the algorithm employed, the resulting model can only be as good as the data collected for training. This requires establishing a structured data pipeline such that the collected data is of sufficient relevance, consistency and completeness for model development. Ensuring data quality along all of these axes is critical both for building quality models and accurately evaluating product performance against previously established success metrics. Additionally, a functional data management pipeline establishes a pathway for continual model improvement as future data is collected.

· A deployment environment: Whether the production model is deployed on a server or an external device, model end users need a way to easily access model results for it to produce value. As models need to evolve with the data collected, both the deployment environment and the predictive algorithm should be designed in a modular fashion to easily enable future model updates in production. From a data scientist perspective, the deployment environment can provide critical constraints, such as memory or processing power limitations, which will influence the data processing and modeling algorithms employed. Specifications on the deployment environment should be established early in project development to ensure that the predictive algorithm will function within the production environment.

The collection of additional data required to drive model improvements illustrates the need for interaction between data engineers and subject matter experts, which results in the development of a data management platform. Subject matter experts can aid in identifying the data to be collected while data engineers formulate strategies for architecting distributed data storage, collection, processing and retrieval systems. Development of these strategies can be influenced by a number of factors, including but not limited to: expected data volumes and demand, existing collection and/or storage mechanisms (i.e. data silos), and scaling requirements. Data science projects that originate from the interaction between only these two groups are typically of the data mining variety, where problem statements are generated by identifying interesting patterns in the collected data. While building an effective data management platform is necessary for such projects, leveraging the benefits of a data management platform to move towards a data science product also requires:

· A data pipeline for predictive model development: Model development requires a pipeline for facilitating communication of data between the management platform and data scientists. Ideally, the data engineering team will establish API endpoints that enable easy querying of data by end users, which may be extracted from a variety of sources, to support model development or business intelligence initiatives. In this manner, back-end changes can be made to data models without impacting how end users interact with the endpoints. The existence of multiple use cases for data access driven by multiple user types may also influence architecture decisions required to efficiently support these endpoints.

· Sufficient data completeness: Employing directed data collection efforts dedicated to answering specific problem statements can help avoid the data consistency and completeness issues often present in data mining projects. Prescribed data collections ensure that the collected data is representative and of sufficient consistency and completeness such that predictive models developed from the data will generalize well to a production environment. Specifically, these collections ensure that: the same variables are recorded across experiments, the correct sensors are employed to record data, sensor settings are consistent across experiments and appropriately selected to avoid saturation, data is collected at sufficient frequency to resolve dynamics of interest, data streams can be synchronized across sensor readings from multiple devices, and both inputs and outputs are varied to span the expected future operating space.

· Sufficient data accuracy: Even directed data collection efforts can suffer from a host of conditions that can impact data accuracy (equipment failures, erroneous sensor readings, or human error in experimental setup, such as sensor settings or experimental conditions, etc.). Development of a data curation pipeline that enables easy review of collected data for accuracy can prevent inaccurate data from polluting the model development pipeline. Data curation efforts that leverage this pipeline can be conducted by data scientists with the aid of subject matter experts. Performing an initial round of data curation early in the data collection process can quickly identify sources of data inaccuracy and associated corrective actions to ensure both data and model integrity.

The final interaction of importance is that between data engineers and data scientists, which results in the development of a data science pipeline. As detailed previously, a data science pipeline facilitates data access for model development and establishes a pathway for the resulting models to be implemented in production. Developing a data science product requires building upon this interaction by establishing:

· A functional data management system: The data for model development in this interaction is typically extracted from the default storage mechanism of choice: a variety of spreadsheets. In the absence of data collections directed to address specific problem statements, these data sources are typically maintained by multiple individuals interested in different problem statements that have different data requirements. Establishing a defined data management system can eliminate the consistency, completeness and correctness issues that arise from this ad-hoc data management system to unlock the potential value of the underlying data in subsequent modeling efforts.

· A defined problem statement: Subject matter experts have a better understanding of the problem statements whose solution have value to both them and the organization. Their expertise in the field can ensure that the models developed are fit to purpose and produce novel insights rather than reproduce prior knowledge. Engaging subject matter experts in model development can also create a vested interest in the resulting product, which can substantially increase the chances of model adoption. Model adoption by the target audience is critical in creating a data science product, as even the best predictive model cannot provide value if it isn’t being used.

In my experience, successful data science products result when a project team comprised of data scientists, data engineers and subject matter experts work in an integrated fashion. The interactions between these groups result in a natural data flow from the process/subject matter experts through the data management system to the data scientist, with the algorithm developed from the data being deployed into the production environment by the data engineering team to generate value for the subject matter experts through its use. It is important to create a scaffold of this entire flow early in product development, even if the resulting product is minimal in the beginning (i.e. a baseline model), to form the basis for future development and help identify potential blocking issues. Building a data science product is typically an iterative process; this initial scaffolding can subsequently be used to move from a development to production platform during these iterations of product development. When designed and implemented correctly, each part of the team can refine their parts of the platform in an agile fashion without negatively impacting the overall flow from data to product.

Ultimately, the successful evolution of a data science project into a product is driven by subject matter experts, data engineers and data scientists working in an integrated fashion to produce value from data through the use of predictive models. Hopefully, knowing the essential components of data science projects in advance will help your team in developing successful data science products in the future.