If you have some code that logs many metrics (or parameters or tags) to a remote MLflow tracking server (e.g. hosted with MFlux.ai), you should know that each request takes time due to network roundtrips, and it quickly adds up. In the following example, we assume that you have a deep learning script that trains for 50 epochs, and you want to log 4 metrics for each epoch. For the sake of this example, let’s log metrics from a keras history object. The script that would generate this history object could look like this:
~200 requests, takes ~55 seconds, do not use this approach:
To speed it up by a factor of 4, you can use
log_metrics instead of
log_metric to reduce the amount of requests from 200 to 50:
~50 requests, takes ~14 seconds:
The speed of this approach is similar to
mlflow.keras.autolog(), which performs one
log_metrics request after each epoch during training.
If you want to further improve the speed, you can use MLflow’s
log_batch method to log all metrics in a single request instead of 50 requests:
Even faster example
Takes ~3 seconds:
Note that there is a limit on the number of metrics that you can log in a single
log_batch call. This limit is typically 1000. If you exceed the limit, you'll get an error like this:
A batch logging request can contain at most 1000 metrics. Got 2000 metrics. Please split up metrics across multiple requests and try again.
If you need an MLflow tracking server for your data science team, check out MFlux.ai, which can set up an MLflow server for you at the click of a button.