Primrose: data science in production

Published in

WW Tech Blog

6 min readSep 9, 2019

At WW, our members interact with the brand and other members in a multitude of ways, such as by using our social network, Connect, and meeting and supporting others during in-person Workshops. One of the goals of the WW data science team is to personalize this experience to help boost members’ success. To do this, we deploy a variety of productionalized data products — such as social media feed recommenders, who-to-follow recommenders, meal and recipe recommenders, membership models to identify at-risk members, and LTV models to help the business identify valuable members.

Over the course of developing and planning these productionalized models, we identified some common challenges that we aimed to systematically address: Our post-processed feature data wasn’t large enough to warrant the IO cost of out-of-memory workflow solutions, we didn’t have access to live or streaming data features for our models, and, since we were going to be managing our own ML deployments, we needed to quickly grow a team of “full-stack” data scientists.

We aimed to meet these requirements by designing our own machine learning workflow framework, Primrose (Production In-Memory Solution). Primrose is a simple Python framework for executing in-memory workflows defined by directed acyclic graphs (DAGs) via configuration files. Data in Primrose flows from one node to another while avoiding serialization, except for when explicitly specified by the user. Primrose nodes are designed for simple batch-based machine learning workflows, which have datasets small enough to fit into a single machine’s memory.

We’re excited to announce that we’re open-sourcing Primrose, and in this post, we’ll explain some of the features and design decisions that went into our framework.

Primrose: framework design & features

Primrose exists within an ecosystem of other great open source workflow management tools (like Airflow, Luigi, Kubeflow or Prefect) while carving its own niche based on the following design goals:

Avoiding unnecessary serialization

Simple example of a Primrose DAG, showing the flow of data between nodes

Primrose keeps data in-memory between task steps, and only performs (de)serialization operations when explicitly requested by the user. Data is transported between nodes through use of a DataObject abstraction, which contextually delivers the correct data to each Primrose node at runtime. As a consequence of this design choice, Primrose runs on a single machine and can be deployed as a job within a single container, like any other Python script or cron job. In addition to operating on persistent data passed between nodes, Primrose can also be used to call external services in a manner similar to a Luigi job. In this way, Spark jobs or Hadoop scripts can be called and the framework simply dictates dependencies.

As a comparison… many solutions in this space are focused on long-running jobs that may be distributed across several computing nodes. Furthermore, in order to facilitate parallelization, save states for redundancy, and process datasets which are too large for memory, orchestrators often require data to be serialized between each workflow task. For smaller datasets, the IO time associated with these steps can be much longer than the time spent in computation.
Primrose is not… a solution that scales across clusters or a complex dependency management solution with dynamic DAGs (yet).

Batch processing for ML

Examples architectures for Primrose deployments

Primrose was built to facilitate frequent batches of model training or predictions that must read and write from/to multiple sources. Rather than requiring users to define their DAG structure in Python code, Primrose adopts a configuration-as-code approach. Primrose users create implementations of node objects once, then any DAG structural modifications or parameterization changes are processed through configuration json files. This way, deployment changes to DAG operations (such as modifying a DAG to serve model predictions instead of training) can be handled purely through configuration files. This avoids the need to build new Python scripts for production modifications.

Furthermore, Primrose nodes are based on common machine learning tasks to make data scientists’ lives easier. This cuts down on development time for building new models and maximizes code re-use among projects and teams. See the modeling examples in the source and documentation for more info!

As a comparison… in Primrose, users simply need to specify in their configuration file that they want common ML operations to act on the DataObject. These ML operations can certainly be implemented by users in Luigi or Airflow, but we found operations such as test-train splits or classifier cross-validation to be so common that they warranted nodes pre-dedicated to these operations. Prefect has made some great strides in this area, and we encourage users to check their solution out.
Primrose is not… an auto-ml tool or machine-learning toolkit that implements its own algorithms. Any Python machine learning library can be used with Primrose, simply by building model or pipeline nodes that implement the user’s choice of library.

Simplicity

Configuration as code example for running a Primrose job

Standardization of deployments: Primrose is meant to help make deployment and model building as simple as possible. From a developer operations perspective, it requires no external scheduler or cluster to run deployments. Primrose code can simply be containerized with a Primrose Python entrypoint, and deployed as a job on a k8s or any other container management service.

Standardization of development: From a software engineering perspective, another advantage of Primrose stems from the standardization of model and recommender code. Modifying feature engineering pipelines or adding recommender features is simplified by writing additions to self-contained Primrose nodes and making additions to a configuration file.

As a comparison… Primrose can be leveraged as a piece of a larger ETL job (a Primrose job could be a job within an Airflow DAG), or run on its own as a self-contained, single node ETL job. Some orchestration solutions (Airflow, for example) require running persistent clusters and services for managing jobs.
Primrose is not… able to manage its own job scheduling or timing. This is left to user using k8s job scheduling or manual cron job assignments on a virtual machine.

There are many solutions in this space, and we encourage users to explore other options that may be most appropriate for their workflows. We view Primrose as a simple solution for managing production ML jobs.

Where can I find Primrose?

You can find out more about Primrose here:

Source: https://github.com/ww-tech/primrose
Docs: https://ww-tech.github.io/primrose/
Pypi: https://pypi.org/project/primrose/

Feel free to start using the code and build your own nodes! The data science team at WW plans to keep adding useful ML nodes to Primrose and aims to build a large ecosystem of nodes so that many modeling and recommender jobs can be deployed purely through configuration files. Building Primrose has enabled our team to build lots of cool data products quickly, and we hope that it can help your team do the same!

— Michael Skarlinski, manager of data science, but built with love by the whole WW data science team

Interested in joining the WW team? Check out the careers page to view technology job listings as well as open positions on other teams.