Introducing Glassdoor’s ML Registry: A Centralized Artifact Management Solution

Published in

Glassdoor Engineering Blog

5 min readAug 30, 2023

Glassdoor’s Machine Learning Engineering and Platform Team

As part of Glassdoor’s journey to becoming an ML-driven company, we’ve created an exceptional Machine Learning Engineering and Platform team, (see our blog about building this team here) dedicated to developing the foundational infrastructure for all of our ML initiatives. We’ve strategically built our ML platform by utilizing a combination of buying, building, and adopting from existing open-source solutions. This approach has allowed us to leverage the best of all worlds. Our team has had the opportunity to build numerous tools from scratch, including data pipelines, human-in-the-loop tools, and more. We’ve also built on top of existing tools like AWS Sagemaker to incorporate powerful in-house customizations into our feature store. We have more exciting plans in flight, such as building a new recommendations platform, further open-source contributions from our platform, and continuing to build on the foundation which we’ve established. Today, we’ll delve into our newly open-sourced ML Registry.

What is an ML Registry?

The ML lifecycle encompasses far more than just model development. Once a model is built, many questions arise: Where does it reside? How do we access it? What if updates or versioning are needed? Where can we store the metadata describing the model? And what about other non-model artifacts? How can we efficiently manage all of this? Enter the ML Registry. Glassdoor’s ML Registry is a centralized management service for ML artifacts and all related metadata. It serves as the single source of truth for all data pertaining to ML, enabling uniform and reliable access to this data across diverse teams and applications. It seamlessly integrates with other tools and services and provides robust, feature-rich functionality.

Choosing Between Buying and Building, What Differentiates Our ML Registry?

While several paid and open-source model registries already exist, we carefully assessed their offerings and found that no existing solution fully met our specific requirements. Although opting for a managed service could provide some advantages, such as out-of-the-box functionality that would require a substantial time investment to build ourselves, we decided to build our registry from scratch for several compelling reasons:

Complete Customization — allowing our architects to design it according to our unique wants and needs. For example, designing our tightly knit Git integration, ability to handle non-model artifacts and metadata, our metadata cache, and integration into the Glassdoor ecosystem.
Immediate and round-the-clock support — giving Glassdoor the ability to fix bugs promptly. Moreover, we can iterate rapidly based on feedback from our users.
Seamless Integration within the Glassdoor ecosystem — Deploying this service in the same manner as our other services offers enhanced connectivity, monitoring, alerting, integration, troubleshooting, and support.
Cost — A thorough consideration of cost and the impact on latency reinforced our decision to embark on building our ML Registry from scratch (for deeper insights, refer to the illuminating Zillow Engineering Blog comparing the merits of building versus buying).

Glassdoor’s Approach: Git First

At Glassdoor, we have embraced a Git-centric approach, where each repository and branch hold an artifacts.yml file that serves as the central configuration point for metadata management. This YAML file allows individual teams and repositories to define and customize the relevant metadata specific to their projects. Take a simplified artifacts.yml example below:

topic-training-data:
  dataObjectId: 123
  type: dataset
  config:
    type: tsv
    hasHeader: true
classification-output:
  dataObjectId: 456
  type: dataset
  config:
    type: tsv
    hasHeader: true

Here we can configure metadata for various artifacts. For the “topic-training-data” artifact, we can see our most important config field, dataObjectId, indicating its unique identifier, which we can use to query the ML Registry. We can also include any other free-form metadata fields specific to this artifact.

To facilitate seamless management of this metadata, we provide two convenient options for updating it. Firstly, teams can initiate changes via a merge request (MR), which triggers an approval workflow, ensuring proper review before modifications are applied. Secondly, for more streamlined integration into our applications, we offer an API that allows easy programmatically-driven updates to the metadata, empowering developers to incorporate metadata changes as part of their code. Additionally, our API also enables smooth querying for artifact uploads and downloads, providing a straightforward mechanism for accessing and manipulating data objects.

Using Git as the ultimate source of truth, our ML Registry synchronizes this metadata from the artifacts.yml files (per project, per branch) into a Redis cache during application startup, and ensures synchronization following any subsequent changes. Using Git as the source of truth becomes particularly valuable during critical events or fatal occurrences for the application. In such instances, the ML Registry synchronizes with Git, allowing us to recover and rebuild the system from a reliable and consistent state. When updates are made via API or merge request (MR), Git is always updated first, to ensure that we never encounter data inconsistencies. This Git to Redis synchronization ensures uniform, fast and simple access to the most up-to-date metadata for all consumers across the company, empowering teams to manage their artifacts efficiently, assured of data integrity and consistency. Using Git, we also ensure a robust version and change history. We can see that the ML Registry effectively addresses the common problem of data organization relating to machine learning artifacts.

Ease of Consumption: Empowering ML Pipelines & Other Consumers

To ensure widespread usability, we have employed the OpenAPI Generator, automatically generating clients for each new application version. Our OAS3 API specifications are readily available via the auto-generated Swagger UI. Furthermore, we have synchronized the ML Registry with an SQS queue, enabling any interested service to listen for change events. This integration empowers ML pipelines and tasks, triggering automation seamlessly.

Flexibility of Implementation

As previously mentioned, we have adopted a Git-based approach. Redis serves as our backend cache for instant metadata retrieval, and S3 acts as our storage solution for data objects. While open-source consumers can leverage our implementation details, they can also create their own functionality by implementing the interfaces which abstract all of the functionality of the ML Registry.

Try It Out!

We encourage you to take our ML Registry for a spin and experience its capabilities firsthand. To get started, simply:

Clone the project repository.
Update the properties file with your own keys, or inject them in a fashion that is suitable for your application.

Once set up, you can explore the ML Registry’s features using our auto-generated Swagger UI, which will allow you to interact with the API seamlessly on your local host before you deploy it. If you’re working with backend technologies different from our stack, not to worry. The ML Registry provides interfaces that you can implement to tailor to your specific requirements. Give it a go, and let us know what you think!

Acknowledgments

A special acknowledgment goes to Malathi Sankar, the Machine Learning Engineering Director at Glassdoor, who is the visionary behind our entire platform. And to our architect, Vance Thornton, who designed the entire ML Registry from scratch. Both of whose exceptional contributions and expertise have been instrumental in the success of our ML-Engineering team and the development of the ML Registry.