Developing safe and reliable ML products at 23andMe

Manoj Ganesan, Software Architect at 23andMe

23andMe Engineering
23andMe Engineering
12 min readDec 20, 2021


ML products at 23andMe

Data and Machine Learning are core components of 23andMe’s Health + Ancestry and 23andMe+ services.

Since 23andMe launched in 2007, over 80% of our approximately 12 million customers have consented to participate in research and contribute their genetic and self-reported survey data to help the advancement of science. We use cutting-edge machine learning techniques to generate novel insights about their health and ancestry.

As we invest in continuously improving these products, an ever-increasing, diverse customer base further fuels data into our systems, helping us generate even more meaningful discoveries that lead to innovative products, which attract a wider audience of customers. This “data flywheel” is designed to deliver continuous, sustainable value to our customers.

Machine Learning (ML) has been a core ingredient in our Ancestry + Traits Service, for instance to power granular ancestry regions in our Ancestry Composition report.

In the Health + Ancestry and 23andMe+ services, we leverage polygenic modeling to provide reports on conditions like Type 2 Diabetes, HDL Cholesterol, Severe Acne and Gallstones. The size and diversity of the 23andMe database enable construction of large-scale, accurate multi-ethnic prediction models.

Building safe and reliable ML models while maintaining the flexibility and velocity of delivering continuous value to customers is paramount. Safety and reliability are essential given the sensitive nature of our consumer products and the customer data that powers them.

This post discusses the high-level challenges of building safe and reliable ML products at scale, particularly at 23andMe. It also addresses the organizational and technical choices we’ve made to solve them. Deeper technical details will be forthcoming in subsequent blog posts.

Challenges in developing ML products

Compared to shipping traditional software, which has existed and evolved for decades, the craft of creating software products based on Machine Learning is relatively nascent. A common pattern is black-box ML models shipped over the fence to Engineers who then build systems that serve predictions in production. In our experience shipping ML products at 23andMe, it’s increasingly clear that both engineers and scientists developing models should be aware of the infrastructure and product context (failure modes, performance SLAs, compliance concerns, etc) in which the models are trained and deployed.

The MLOps paradigm advocates for standards and practices to ensure safe and continuous delivery of ML products. Many of the principles and standards we adopt are inspired from this ecosystem and set of practices, and in particular, from a couple of early influential papers authored by Google that discuss technical debt in ML systems.

As we observe it, there are a few broad challenges when building production Machine Learning systems. They lie across the spectrum of technical to organizational in nature, and addressing them is both a technical and an organizational concern.

Difficulty testing ML systems

  • Compared to traditional software, defining correct behavior in ML systems a priori is hard as they depend on data and models in addition to code. ML systems have many moving parts, a degradation in each of which may functionally degrade model performance, which may lead to faulty products. Widely accepted modern change management practices are designed for traditional software and fail in ML scenarios.

Organizational and technical fragmentation

  • Data Science is often a separate organization from Engineering and individuals in these teams usually have different training, culture, vocabularies, goals, and incentives. To consistently deliver value to customers, these competing concerns have to be reconciled under a coherent vision.
  • Siloed organizations often end up building fragmented systems that don’t speak well with each other. In the case of fragmented training and serving systems, model artifacts may not work as expected at serving time, against production features and data services.

Conflation of key activities in ML systems

  • “ML Research” can be conflated with “ML Engineering.” In the course of building, maintaining, and improving systems, the team should address certain questions like:
    - “Which solver performs best?”
    - “How should we normalize these input features?”
    - “How should we select features for our models?”
  • Finding the answers to these questions is called “ML Research” and, in moderation, can be a valuable activity for a product-focused team. “ML Research” is a different activity, with different constraints, from implementing a well-tested, observable, maintainable ML system whose design is informed by the answers to those questions. “ML Research” requires flexibility to quickly answer a question. “ML Engineering” requires automation and reliability to relentlessly improve the system day after day, year after year.

Interdisciplinary demands of talent

  • Since software is infinitely malleable, if the software author doesn’t deeply understand the intended behavior, it will not do what it is intended to do. Good software engineers who understand ML well are rare.

Challenges in developing ML systems at 23andMe

In addition to the aforementioned general challenges developing ML systems, the nature of the personal genomics products we ship at 23andMe imposes further challenges that we’ve had to solve over the years.

Privacy and Compliance

  • We consider ourselves stewards of our customers’ data and always act in their best interest. We’ve promised our customers transparency, access, and control over their data, and we don’t just stop our work at legal privacy compliance (such as the GDPR and CCPA). We make sure that we centrally track all data we have about a customer. We ensure that participation in research is entirely opt-in, via consent. If a customer asks to see what data we have about them, they can simply download or permanently delete it from their Account Settings. We build systems that are designed to respect our customers’ privacy, and we have a culture where we do those things even if they’re inconvenient.

Data security

  • Given the sensitivity of our customer data, we have very strict controls on who and what can access the customers’ individual data. This imposes a certain level of access control restrictions to systems we design, which can be a challenge considering the data-coupling inherent in ML systems.

Specialized domain knowledge

  • In addition to the requirements of traditional data science, the products we build also require deep technical expertise in the field they’re modeling. We’re often at the leading edge of science, operating at the largest scale, delivering direct-to-consumer products. Building these systems by bringing together a set of individuals with the right mix of expertise across engineering, genomics research, and medicine is hard!

Desirable properties of ML systems

We treat both ML prediction and training components as parts of a single end-to-end pipeline, as opposed to treating the ML model simply as a black box of weights. Just as with the other software systems we manage, we expect this end-to-end system to be “engineered” to meet certain desirable properties.

Data Security

  • As mentioned earlier, data security is of paramount importance at 23andMe. Training systems adhere to security best practices, and are built using processes like “privacy by design” at various stages of its development. Data access controls for artifacts generated by the training pipeline are clearly identified and implemented.

Customer Privacy & Compliance

  • All data generated by training systems are compliant to privacy standards GDPR/CCPA and security standards like ISO27001.

Feature Access and Data Lineage

  • Feature code is version controlled and common across training and prediction use-cases. All features are thoroughly tested.
  • Feature access is centralized using well known production data services, which in turn have the standard security and privacy controls.
  • Feature expectations (like expected distributions) are captured in a schema and continuously validated.
  • Features can be added quickly — the turnaround time to downstream training using updated features is minimal.

Model and Pipeline Lineage

  • Model pipelines are fully encapsulated in peer-reviewed, version-controlled workflows.
  • Model pipelines are reproducible, maintainable, and continuously improvable.
  • Model artifacts generated by training systems are tracked in a centralized service along with other context about the particular run/experiment.

Code and Infrastructure Lineage

  • Training and prediction/serving code lives in the same repository. All infrastructure is codified as infrastructure templates and can be reproduced with ease.

Local Development, CI/CD

  • Code is well structured, comprehensively tested, and runnable end-to-end inside a laptop. It is deployed using CI/CD practices so dozens of people can work together, delivering value safely and consistently.
  • Models are developed locally, using local fixtures. One can train a model using local fixtures and get a prediction from the prediction service inside a laptop.
  • Models can be canaried before serving. Serving models can be rolled back if needed.

Model Quality

  • All trained models are demonstrably correct in their target deployment environment. Model metrics across relevant test/validation sets are tracked and easily observable.
  • Model quality across various metrics (eg. does a health risk model work well across all ethnicities, does the model exhibit biological plausibility, etc) is automatically validated as part of the training process before the model is available for serving.
  • The impact of model staleness is understood and continuously monitored.

Continuous Improvement at low marginal cost

  • Models can be easily re-trained to incorporate new data at low marginal cost. No significant engineering work is required to ship such incrementally better models.
  • Model methodology and surrounding infrastructure are continuously and safely improvable. The work required to ship such improvements to customers is well understood and encapsulated in these systems.
  • Given the nature of ML workflows, they are flexible enough to incorporate different modeling methodologies. One should be able to switch a solver in the end-to-end system with a single pull request.

Performance and Cost Observability

  • Repeated cost-effective runs are essential to experimentation and continuous improvement. All ML pipelines are built for our scale and are inexpensive to run.
  • Pipelines are deployed in a continuously audited environment and are observable with respect to their performance and cost characteristics.
  • Customer-facing prediction systems have clear SLAs and KPIs, which are monitored continuously. Every change to the system is measured in terms of any impact on prediction latency.

Model monitoring

  • Production models are continuously monitored for performance regressions (say due to model staleness or feature drift). This monitoring informs when to re-train models, which in turn is easy.


  • Given a particular prediction data point in production, it is possible and not overly cumbersome to trace back to the specific training run(s) that generated the specific model used to generate that prediction. All run parameters and model metrics are conveniently located alongside the run to aid in such debugging.

Operational efficiency

  • Infrastructure is codified and built so it’s not operationally burdensome to maintain.

The canonical ML system at 23andMe

While the particularities of every ML pipeline call for their own tweaks, we’ve settled on the following general template for developing end-to-end ML pipelines. Following a consistent template across ML pipelines helps us leverage and stand on top of expertise and tooling we’ve already built and is crucial to executing with our small teams.

Following are a few salient features of the canonical system.

Code and Infrastructure management

  • ML training and serving code lives in the same code repository.
  • Code is deployed and run inside Docker containers.

Feature management

  • Features are co-developed with Data Science, and feature code is centralized and lives in a single (separate from the ML pipeline) code repository. New features are continuously deployed to feature data serving systems.
  • The feature deployment flow considers the iterative nature of ML (minimal wait time before being able to kick off training runs), and while the feature “serving” infrastructure may be different, to optimize for row-major vs column-major access, the feature code itself is exactly the same (as in, the exact same Python package is used) across training and prediction use-cases.

Workflow management

  • Our complex, multi-stage workflows are codified and orchestrated using Metaflow. The exact same code/workflows are run locally (using fake data/fixtures) and then deployed to production (in AWS).
  • Intermediate, short-lived Metaflow artifacts in production can be loaded into a Jupyter notebook in the exploratory R&D environment. These intermediate artifacts are automatically cleaned up using S3 expiration policies.

Model and Experiment management

  • Using workflow specification files in the repository, a scientist or engineer can kick off production runs via Jenkins jobs. They would also have the appropriate privileges to monitor the jobs in AWS or in other logging/observability infrastructure.
  • MLFlow is used as the central model store (for serialized models and other artifacts like metrics and reports) as well as the experiment viewer. In addition to offering a convenient UI to introspect runs/experiments, MLFlow also aids in the “which experiment trained the model that generated this particular prediction?” debuggability story.
  • After the appropriate validation (for instance, “does a trained health risk model pass the performance threshold across all ethnicities”), a trained model in MLFlow can be explicitly promoted, which makes it servable using our scalable, highly-available, customer-facing prediction endpoint, with no additional technical work.

Data management

  • Data artifacts are stored in AWS (typically, S3), fronted by MLFlow or Metaflow. We keep individual-level data tracking as a part of these ML pipelines to an absolute minimum, and when we do, those are automatically cleared up using techniques like expiration policies.

Prediction service

  • The customer-facing prediction service is deployed (as a blue/green deployment) to AWS Fargate, is continuously monitored, and managed by Engineering on-call rotations.

Production ML Culture at 23andMe

When building ML systems, the organizational and cultural milieu is just as important as the building of the system itself, and they lend to support one another. At 23andMe, we build safe and reliable ML systems that meet the aforementioned desirable properties by maintaining a high-functioning Engineering organization and building these systems together with Data Science as embedded teams in the same repositories.

Engineering culture

  • Our AWS-native engineering teams have deep expertise and culture building large scale production data systems that adhere to our strict data security and privacy standards, along with the other desirable properties mentioned above. As noted elsewhere, only a small fraction of real-world ML systems is composed of the ML code — the surrounding infrastructure is vast and complex. In-house expertise in building such complex systems is essential in meeting the standards for ML-specific pipelines as well.

Embedded teams

  • Engineering and Data Science work together as embedded teams in the same repositories, following Engineering principles and practices to prioritize projects and develop these systems, while continuously building and deploying code to centralized, highly observable production environments.
  • We have found that having a separate team that “productionizes” models that were pre-built in a separate environment with separate tooling reduces overall velocity as one tries to reconcile the differences between these systems
  • All the real work of building pipelines happens in a single consistent environment because we’ve found this works best, but we also have a separate, more flexible pattern for early-stage exploratory R&D.

Wrapping up

We’ve discussed some of the technical and organizational challenges in building ML systems, the desirable properties to keep in mind while building these systems, and the approaches we’ve taken to meet them.

Using the approaches mentioned in this post, we’ve successfully built systems that safely and reliably generate insights from our growing dataset. In addition to leveraging the growing dataset to deliver better models with little incremental cost, these systems and principles allow us to improve model methodology and surrounding infrastructure systematically. Well-maintained systems with these characteristics are essential to a successful “data flywheel” strategy.

While parts of these challenges can be solved in their own silos, we’ve found the best way to ship safe and reliable ML products consistently is by building maintainable and continuously improvable systems together, as embedded cross-functional teams, following engineering principles like CI/CD to adhere to the various standards we expect of our systems.

The principles and systems above provide a consistent framework for engineers and scientists in understanding each other’s perspectives when building systems, to ultimately reach the common goal of building safe and reliable ML products for customers.

We’re hiring!

If you’re interested in solving these problems, working in cross-functional teams, and building amazing products within a mission-based culture — we’re hiring!

Here are a couple of current openings in Machine Learning Systems and Infrastructure:

  1. The Machine Learning Engineering team is looking for a Sr. Software Engineer / Lead.
  2. The Feature Engineering team is looking for a Sr. Software Engineer.
  3. The Big Data infrastructure team is looking for a Tech Lead.

About the Author

Manoj Ganesan is a Software Architect leading teams that build production machine learning pipelines and supporting backend infrastructure at 23andMe.