Migrating Local MLflow Experiments to a Centralized MLflow Instance

Zakariae Mkassi

Published in

lionswerk

11 min readNov 5, 2023

A Step-by-Step Guide

Introduction:

In the realm of machine learning experimentation, managing local MLflow projects efficiently and integrating them into a centralized MLflow instance is a pivotal step towards enhancing collaboration and scalability. The concept of centralization allows teams to access, track, and collaborate on experiments seamlessly, irrespective of where they were originally conducted. In this comprehensive guide, we will walk you through the step-by-step process of migrating your local MLflow experiments to a centralized MLflow instance, creating a robust infrastructure for your machine learning projects. Let’s embark on this journey to streamline your machine learning experiments and optimize your workflow.

Requirements:

MLflow Central Instance: A centralized MLflow instance, where you want to consolidate and manage your experiments. This instance should be up and running, providing a central platform for tracking and storing MLflow experiments.
PostgreSQL Database: A PostgreSQL database with a pre-defined schema to store MLflow experiment metadata, parameters, and metrics. Ensure that you have the necessary database credentials and connection details to interact with PostgreSQL.
Minio Bucket: A Minio bucket, which serves as a centralized storage location for your MLflow artifacts. Minio is an open-source S3-compatible object storage system that should be configured and accessible for artifact storage. Ensure that you have access keys, secret keys, and the endpoint URL for Minio.

These three components will form the foundation for centralizing your local MLflow experiments and creating a more organized and collaborative environment for your machine learning projects.

Uploading Local MLflow Artifacts to Minio: A Practical Guide

In this section, we will discuss how to upload your local MLflow artifacts to a Minio bucket, enabling centralized storage and easy accessibility for your machine learning model artifacts. This step is crucial for creating a streamlined and collaborative environment for your projects.

Code Walkthrough:

We’ve prepared a Python code snippet that demonstrates the process of uploading your local MLflow artifacts to a Minio bucket. Here’s a step-by-step explanation of the code:

import os
import glob
import boto3

def upload_model_to_s3(local_folder_path, s3_bucket_name, aws_access_key_id, aws_secret_access_key, region_name, s3_folder_path):
    # Set up a session with the specified AWS access keys and region
    session = boto3.Session(
        aws_access_key_id=aws_access_key_id,
        aws_secret_access_key=aws_secret_access_key,
        region_name=region_name
    )

    # Initialize an S3 resource and connect to the Minio endpoint
    s3 = session.resource('s3', endpoint_url='http://<IP-ADRESS>:9000',
                          aws_access_key_id=aws_access_key_id,
                          aws_secret_access_key=aws_secret_access_key,
                          region_name=region_name)

    # Select the target S3 bucket
    bucket = s3.Bucket(s3_bucket_name)

    # Traverse the local folder and upload artifacts to S3
    for subdir, dirs, files in os.walk(local_folder_path):
        for file in files:
            full_path = os.path.join(subdir, file)
            with open(full_path, 'rb') as data:
                key = os.path.join(s3_folder_path, full_path[len(local_folder_path):])
                bucket.put_object(Key=key, Body=data)

To execute this code, you’ll need to ensure you have the necessary AWS access keys, the Minio endpoint URL, and a local folder path containing your MLflow artifacts.

local_folder_paths = 'PATH_TO_YOUR_MODEL_ARTIFACTS'
s3_bucket_name = 'BUCKET_NAME'
s3_folder_path ='0/'
aws_access_key_id = 'YOUR_AWS_ACCESS_KEY_ID'
aws_secret_access_key = 'YOUR_AWS_SECRET_ACCESS_KEY'
region_name= 'REGION_NAME'

upload_folder_to_s3(local_folder_paths, s3_bucket_name, aws_access_key_id, aws_secret_access_key, region_name, s3_folder_path)

With this code, you can automate the process of uploading your local MLflow artifacts to Minio, facilitating centralization and improving accessibility for your machine learning model artifacts.

Connecting to the Database:

Before we proceed with inserting run metadata, it’s essential to establish a connection to the PostgreSQL database. Here’s how you can set up a connection in your code:

import psycopg2

class MLFlowDatabase:
    def __init__(self,run_path ,database="MLFLOW_DB", user="MLFLOW_USER", password="PASSWORD", host="DB_HOST", port="5432"):
        # Establish a connection to the PostgreSQL database
        try:
            self.conn = psycopg2.connect(
                database=database,
                user=user,
                password=password,
                host=host,
                port=port
            )
            self.cur = self.conn.cursor()
            self.run_path = run_path
            self.run_uuid = None
        except Exception as e:
            print("Error occurred while connecting to the database:", e)

    def __del__(self):
        print('Disconnecting from the database...')
        # Close the database connection upon object deletion
        if self.conn:
            self.conn.close()

   # def insert_run_metadata(self)...
   # def insert_tags_to_db(self)...
   # def insert_params_to_db(self)...
   # def insert_metrics_to_db(self)...

You can create an instance of the MLFlowDatabase class and use it to connect to your PostgreSQL database. This connection is essential for all subsequent database operations, including inserting run metadata.

Inserting Run Metadata:

In this section, we will guide you through the process of inserting essential run metadata into your PostgreSQL database. This step is crucial for tracking and managing your machine learning experiments effectively:

def insert_run_metadata(self):
    filepath = self.run_path + 'meta.yaml'

    # Read the metadata file
    with open(filepath, 'r') as f:
        metadata = yaml.safe_load(f)

    # Extract relevant metadata values
    run_uuid = metadata['run_uuid']
    self.run_uuid = run_uuid
    name = metadata['run_name']
    source_type = 'UNKNOWN'
    source_name = metadata['source_name']
    entry_point_name = metadata['entry_point_name']
    user_id = metadata['user_id']
    status = metadata['status']
    start_time = metadata['start_time']
    end_time = metadata['end_time']
    source_version = metadata['source_version']
    lifecycle_stage = metadata['lifecycle_stage']
    artifact_uri = 'mlflow-artifacts:/0/' + run_uuid + '/artifacts'
    experiment_id = metadata['experiment_id']

    if status == 3:
        status = 'FINISHED'
    elif status == 4:
        status = 'FAILED'

    # Create an SQL query to insert run metadata
    query = f"INSERT INTO \"public\".\"runs\" (\"run_uuid\",\"name\",\"source_type\",\"source_name\",\"entry_point_name\",\"user_id\",\"status\",\"start_time\",\"end_time\",\"source_version\",\"lifecycle_stage\",\"artifact_uri\",\"experiment_id\",\"deleted_time\") VALUES ('{self.run_uuid}','{name}','{source_type}','{source_name}','{entry_point_name}','{user_id}','{status}',{start_time},{end_time},'{source_version}','{lifecycle_stage}','{artifact_uri}',{experiment_id},null);"

    try:
        self.cur.execute(query=query)
    except psycopg2.IntegrityError as e:
        error_message = e.pgerror
        if "Key (key, run_uuid)=" in error_message:
            print("ID already exists.")
        else:
            print("Error occurred while inserting runs:", e)
    except Exception as e:
        print("Error occurred while inserting runs:", e)

    self.conn.commit()

Explanation:

This code reads the metadata file from your MLflow experiment, which contains crucial information about the run.
It extracts key metadata values like run UUID, run name, source type, source name, entry point name, and more.
The code then creates an SQL query to insert this metadata into the PostgreSQL database.
If any errors occur during the insertion process, the code provides appropriate error handling and messages.

Usage:

You can execute this code by creating an instance of the MLFlowDatabase class and calling the insert_run_metadata method. This will insert the run metadata into your PostgreSQL database.

In the next section, we will continue with “Inserting Tags,” so stay tuned for the next step in our journey to centralize your MLflow experiments.

Inserting tags:

In this section, we will dive into the process of inserting tags associated with your machine learning runs into your PostgreSQL database. Tags provide valuable context and metadata about your experiments, making it easier to categorize and search for specific runs.

def insert_tags_to_db(self):
    tags_dir = self.run_path + 'tags/'

    tags = []

    for filename in os.listdir(tags_dir):
        filepath = os.path.join(tags_dir, filename)
        with open(filepath, 'r') as f:
            key = os.path.split(filename)[1]
            value = f.read().strip()
            result = (key, value)
            tags.append(result)  

    for tag in tags:
        key = tag[0]
        value = tag[1]
        query = "INSERT INTO tags (key, value, run_uuid) VALUES (%s, %s, %s);"
        try:
            self.cur.execute(query, (key, value, self.run_uuid))
        except psycopg2.IntegrityError as e:
            error_message = e.pgerror
            if "Key (key, run_uuid)=" in error_message:
                print("ID already exists.")
            else:
                print("Error occurred while inserting tags:", e)
        except Exception as e:
            print("Error occurred while inserting tags:", e)
    self.conn.commit()

Explanation:

This code processes the tags directory associated with your MLflow run. Tags are often used to provide additional information or categorization for your experiments.
It iterates through each tag file in the specified directory, extracting the tag name (key) and its value.
For each tag, the code generates an SQL query to insert the tag information into the PostgreSQL database.
Proper error handling is in place to address potential exceptions or integrity errors during the insertion process.

Usage:

To utilize this code, create an instance of the MLFlowDatabase class and call the insert_tags_to_db method. This will insert the tags associated with your MLflow runs into your PostgreSQL database, allowing you to access and search for runs based on their associated tags.

In the upcoming section, we will discuss “Inserting Parameters,” another critical aspect of centralizing your MLflow experiments. Stay with us as we proceed on our journey to streamline your machine learning workflow.

Inserting Parameters:

In this section, we will explore how to insert parameters associated with your machine learning runs into your PostgreSQL database. Parameters are vital to track the configurations and hyperparameters used during an experiment, providing essential details for reproducing results and understanding your model’s performance.

def insert_params_to_db(self):
    params_dir = self.run_path + 'params/'
    params = []

    for filename in os.listdir(params_dir):
        filepath = os.path.join(params_dir, filename)
        with open(filepath, 'r') as f:
            key = os.path.split(filename)[1]
            value = f.read().strip()
            result = (key, value)
            params.append(result)

    for param in params:
        key = param[0]
        value = param[1]
        query = "INSERT INTO params (key, value, run_uuid) VALUES (%s, %s, %s);"
        try:
            self.cur.execute(query, (key, value, self.run_uuid))
        except psycopg2.IntegrityError as e:
            error_message = e.pgerror
            if "Key (key, run_uuid)=" in error_message:
                print("ID already exists.")
            else:
                print("Error occurred while inserting params:", e)
        except Exception as e:
            print("Error occurred while inserting params:", e)
    self.conn.commit()

Explanation:

This code handles the parameters associated with your MLflow run. Parameters often include hyperparameters, model configurations, and other settings used in your experiments.
It scans the specified directory for parameter files and extracts the parameter name (key) and its corresponding value.
For each parameter, the code generates an SQL query to insert the parameter information into the PostgreSQL database.
Proper error handling is in place to address potential exceptions or integrity errors during the insertion process.

Usage:

To put this code to use, create an instance of the MLFlowDatabase class and call the insert_params_to_db method. This will insert the parameters linked to your MLflow runs into your PostgreSQL database, enabling you to document and review the specific configurations of each experiment.

In the next section, we will discuss “Inserting Metrics,” a crucial aspect of recording and analyzing the performance of your machine learning models. Join us as we continue our journey towards efficient machine learning project management.

Inserting Metrics:

In this section, we will guide you through the process of inserting metrics associated with your machine learning runs into your PostgreSQL database. Metrics play a crucial role in understanding the performance of your models and experiments, allowing you to track key indicators and make data-driven decisions.

def insert_metrics_to_db(self):
    metrics_dir = self run_path + 'metrics/'
    metrics = []

    for filename in os.listdir(metrics_dir):
        filepath = os.path.join(metrics_dir, filename)
        with open(filepath, 'r') as f:
            key = os.path.split(filename)[1]
            f_read = f.read().strip().split(' ')
            value = f_read[1]
            timestamp = f_read[0]
            step = f_read[2]
            is_nan = ''
            if value != 0:
                is_nan = 'false'
            result = (key, value, timestamp, step, is_nan)
            metrics.append(result)

    for metric in metrics:
        key = metric[0]
        value = metric[1]
        timestamp = metric[2]
        step = metric[3]
        is_nan = metric[4]
        query_metrics = "INSERT INTO metrics (key, value, timestamp, run_uuid, step, is_nan) VALUES (%s, %s, %s, %s, %s, %s);"
        query_latest_metrics = "INSERT INTO latest_metrics (key, value, timestamp, run_uuid, step, is_nan) VALUES (%s, %s, %s, %s, %s, %s);"

        try:
            self.cur.execute(query_metrics, (key, value, timestamp, self.run_uuid, step, is_nan))
            self.cur.execute(query_latest_metrics, (key, value, timestamp, self.run_uuid, step, is_nan))
        except psycopg2.IntegrityError as e:
            error_message = e.pgerror
            if "Key (key, run_uuid)=" in error_message:
                print("ID already exists.")
            else:
                print("Error occurred while inserting metrics and latest_metrics:", e)
        except Exception as e:
            print("Error occurred while inserting metrics and latest_metrics:", e)

    self.conn.commit()

Explanation:

This code focuses on handling the metrics collected during your MLflow runs. Metrics typically include performance statistics, such as accuracy, loss, or any custom-defined metrics for your machine learning models.
It scans the metrics directory associated with your MLflow run, extracting information such as metric name, value, timestamp, step, and whether the value is NaN (not-a-number).
For each metric, the code generates two SQL queries: one to insert the metric into the metrics table and another to insert it into the latest_metrics table. The latter is used for quick access to the most recent metrics.
Proper error handling is in place to address potential exceptions or integrity errors during the insertion process.

Usage:

To implement this code, create an instance of the MLFlowDatabase class and call the insert_metrics_to_db method. This will insert the metrics recorded during your MLflow runs into your PostgreSQL database, providing a comprehensive view of your experiment's performance over time.

With the completion of this section, you’ve learned how to centralize not only the metadata, tags, and parameters but also the crucial metrics of your machine learning experiments. In the next segment, we will discuss how to integrate your local MLflow artifacts into a centralized storage solution, such as Minio. Stay tuned for our continued journey to streamline your machine learning workflow.

Automating the Migration Process:

To streamline the migration process and facilitate the transition of multiple runs from your local MLflow experiments to a centralized instance, you can use the provided Python script. This script automates the entire migration process, making it more efficient and less error-prone.

Here’s how it works:

run_paths = glob.glob('/PATH/TO/ML_RUNS/mlruns/0/*/')

for run_path in run_paths:
    mlflow_conn = MLFlowDatabase(run_path=run_path)

    mlflow_conn.insert_run_metadata()
    mlflow_conn.insert_tags_to_db()
    mlflow_conn.insert_params_to_db()
    mlflow_conn.insert_metrics_to_db()

List Run Paths: The script starts by listing all the run paths for your local MLflow experiments using the glob module. These run paths represent the different experiments you have conducted locally.
Iterate Over Runs: It then iterates through each run path, creating an instance of the MLFlowDatabase class for each run. This class encapsulates the logic for inserting run metadata, tags, parameters, and metrics into the centralized PostgreSQL database.
Inserting Data: For each run, the script inserts the run metadata, tags, parameters, and metrics into the respective database tables. This process centralizes the information and makes it readily accessible in your centralized MLflow instance.

By running this script, you can effectively centralize your local MLflow experiments with minimal manual effort. It’s a practical approach for teams or individuals dealing with a significant number of experiments.

This automation not only saves time but also reduces the likelihood of errors during the migration process. As a result, you can focus on analyzing and deriving insights from your experiments rather than spending excessive time on administrative tasks.

With the completion of this script, you’ve acquired a valuable tool for managing and centralizing your machine learning experiments efficiently. As you continue to scale your MLflow projects and conduct more experiments, this automation will prove to be an indispensable asset, enhancing your productivity and fostering a more collaborative and organized environment for your machine learning endeavors.

Conclusion and Key Takeaways:

In this comprehensive guide, we have explored the process of migrating your local MLflow experiments to a centralized MLflow instance, creating a robust infrastructure for your machine learning projects. Let’s recap what we’ve accomplished and the valuable insights gained:

Centralization for Collaboration: Centralizing your MLflow experiments allows teams to collaborate efficiently and access experiments from a single platform, irrespective of their origin. This enhances teamwork and knowledge sharing.
Requirements for Centralization: We’ve discussed the fundamental components required for centralizing your MLflow experiments, including a centralized MLflow instance, a PostgreSQL database to store experiment metadata, and a Minio bucket for artifact storage.
Uploading Model Artifacts: We began our journey by detailing how to upload model artifacts to a centralized storage solution, such as Minio. This ensures that your model files are accessible, organized, and secured in a centralized location.
Inserting Run Metadata: We delved into the insertion of run metadata into a PostgreSQL database. This metadata includes essential information about each experiment, such as run name, user details, timestamps, and artifact paths.
Handling Tags: Tags play a crucial role in categorizing and organizing experiments. We learned how to insert tags associated with MLflow runs into the database, making it easier to search and filter experiments.
Documenting Parameters: We covered the insertion of experiment parameters, which includes hyperparameters and configuration settings, into the PostgreSQL database. This documentation is vital for reproducibility and understanding the context of each experiment.
Recording Metrics: Metrics are essential for assessing the performance of machine learning models. We explored how to insert metrics into the database, enabling you to track the evolution of model performance over time.
Bulk Migration: To streamline the migration process, we provided a Python script that automates the migration of multiple runs in one go, saving time and effort.

By following the steps outlined in this guide, you can establish a centralized platform for managing, tracking, and collaborating on your machine learning experiments. This results in improved workflow efficiency, better reproducibility, and enhanced team collaboration.

As you continue on your machine learning journey, keep in mind the importance of centralization and effective experiment management. Whether you are a data scientist, machine learning engineer, or researcher, the techniques and insights shared in this guide will empower you to take control of your MLflow experiments and make informed decisions based on the valuable data they provide.

Lastly, I warmly invite you to learn more about our company. Visit our website to explore our services in depth.
Whether you have inquiries or seek more information, our support chat is available, or you can reach out through various communication options provided on our website.
We are eager to assist and look forward to connecting with you.

Migrating Local MLflow Experiments to a Centralized MLflow Instance

Written by Zakariae Mkassi