“What’s in a name?” Model Naming & Versioning in Snowpark Model Registry
Updated January 25, 2024
Snowpark Model Registry is in Public Preview as of January 2024. With Model Registry, Snowflake customers can easily manage, scale, and secure machine learning model deployments on Snowflake, integrating models, their metadata, and their operational usage all into a single data platform.
Note that this post aims to outline best practices and guidance for ML Engineering/MLOps teams, but may not necessarily reflect requirements from a model risk/control, regulatory, and compliance point of view. While some of the same features outlined here can be used to satisfy those requirements, our guidance is centered around operational management and may not satisfy all compliance demands out-of-the-box.
As customers first experimented with Snowpark Model Registry in Private Preview, and now in Public Preview as well, there has been a clear transition in the questions asked from “how can I do XYZ” to “how should I do xyz”? In the context of model registry, this commonly presents as “how should I use model names and versions? What data or metadata should I attempt to capture in these constructs? What constitutes a new model version, or a new model name?”
It is important to understand exactly what the model name and version within Snowpark Model Registry are actually used for. You’ll notice from the documentation:
This captures the essence of the minimum functionality of model name and version: together, they form a unique identifier for a particular model within the registry. Snowpark ML also creates an actual internal ID that the registry uses, but that opaque identifier does not matter for client code. Clients can uniquely refer to any specific model using just the combination of name + version.
From that perspective, the minimum way that you should use model name and version is as a unique identifier- you, frankly, could use actual generated UUIDs for these fields if you wanted to. If that concept makes you squirm as you read it, that is a fair reaction. Just because you can use them in that way, does not mean that you should. Instead, using a combination of descriptive model names and incremental versioning with additional metadata captured in model tags gets the best of both worlds for tracking and filtering on model objects.
Name and version, while functionally acting as identifiers, should also contain rich descriptive information about how various models within the registry are different from one another. A common way my customers think about this is that model names capture “use-cases,” e.g. FRAUD_DETECTION_CLASSIFIER while the version captures different iterations of that model. In a lot of cases incremental versioning may not even reflect development/refinement, just that “these are two different versions of a model that both aim to solve the same problem.” This allows users of the registry to have an immediate use-case based filtering mechanism.
That said, many customers have more specific ways they want to delineate between versions of a model. For example, hyperparameters: different parameter sets should be treated as different version numbers of the same model name. On the other hand, some customers may use different input training datasets, or a model may only apply to a certain segment of their population or data. Should they capture those additional delineations in the model name? In their preferred semantic versioning? Our opinion is no.
Model objects also support tags, key-value JSON metadata that is attached to a particular model object. Importantly, those tags are queryable from the registry using Snowflake’s semi-structured query support. This means that including this data in the metadata is actually easier to use than if you encoded it in the model name directly (which would then require a deep understanding of the exact naming convention, as well as custom string parsing in order to make it useful). So, say for example I have a customer churn model that I’ve performed some hyper parameter tuning on. I’ll end up with multiple versions of the same model name, where the parameters are captured as part of the model metadata, e.g.:
model_name, version, tags
CHURN_PREDICTION, v0, {"params": {"n_trees": 500, …}}
CHURN_PREDICTION, v1, {"params": {"n_trees": 1000, …}}
...
What if you have additional granularity you want to capture? For example, for different cohorts of customers you actually fit different types of churn prediction models? The same model tags metadata field can be used to capture this more qualitative delineation, e.g.:
model_name, version, tags
CHURN_PREDICTION, v0, {"params": {"n_trees": 500, …}, "cohort": "cohort1"}
CHURN_PREDICTION, v1, {"params": {"n_trees": 1000, …}, "cohort": "cohort2"}
...
While the model name can serve as a basic descriptor of the model’s intended use-case, the Model Description field should also be leveraged to capture a single longform description of the model. While the description is not used for querying the model registry, it should be written such that it provides a very clear definition of the origin and intended usage of the corresponding model.
Tags can capture meaningful descriptors of the underlying model- but there is no limit on the kind of metadata that you might want to associate with a model. For example, you can track release status (i.e. experimental, testing, integration, prod/released), git commit hashes for the corresponding training code, Jira issue IDs, model types, and more. It really is up to an individual customer exactly what information should be stored in this metadata based on the MLOps policies and practices laid out for the organization. Whatever that metadata is, tags are the appropriate place to capture that information.
One exception to populating tags with significant model metadata is if you are using Snowpark Model Registry as a cache only for models deployed into Snowflake. In that case, you may want to minimize the amount of data reflected in the tags directly to avoid inviting deviation from a separate system of record, and instead only provide a tag value that links the Snowpark Model Registry entry back to the system of record’s entry.
Model tags can also be an especially useful mechanism to capture important regulatory and compliance information as well, such as documenting the model owner. MLOps teams need to work jointly with organizational regulatory and compliance authorities to understand what additional detail should be included and should establish common policies and practices to ensure regulatory compliance with their usage of model registry.
What using model name and tags in this manner accomplishes is:
- The model name is actually descriptive- while only practically used as an identifier, it now also offers some level of description about “what does this model do?” for a client or user that is accessing the registry.
- Additional clear, longform descriptions of the model’s purpose are captured in the Model Description alongside the name.
- The version number indicates different iterations of models that aim to solve the same kind of problem, while uniquely capturing meaningfully different versions of the model
- The model tags can store additional relevant metadata that can be easily queried using Snowflake’s semi-structured query support, meaning that if you only want to look at the churn prediction models for cohort1, and not any other cohort, you can easily apply a WHERE filter clause on that metadata to retrieve only those specific model objects.
While the model name should not contain a significant amount of detail that should otherwise be captured as metadata in tags, it is still important that MLOps teams establish policies that explicitly govern the desired naming convention- the simple guidelines for usage of name/version/tags/description in this document are not sufficient alone; organizations should still outline explicit policies for the usage of these fields.
Snowpark Model Registry is in now available to all Snowflake accounts in Public Preview! For more information, refer to the documentation.