Python Package Management: A Guide to Avoid Dependency Conflicts
Oops, the latest pandas version requires ‘numpy>1.22’ but tensorflow requires ‘numpy~1.9’.
Table of contents
- Motivations
- What are conflicting dependencies?
- Understand the issue with a code example
- Design Patterns: Decoupling for Flexibility
- Optional Dependencies: Tailored for Users
- tox: Testing through the Maze of Dependencies
- Conditional Test Execution: Making Tests Smarter
- Wrap-up
- Conclusions
Motivations
As an open-source Python package maintainer and a Data Scientist, I’ve had the chance of witnessing the evolution of Melusine, a machine learning-powered email processing tool developed by MAIF.
Released in 2019, Melusine quickly became an integral part of our daily operations, accelerating our email processing workflows. But, as the package matured and new features were added, managing dependencies has become a real challenge.
The ever-changing Python ecosystem demanded continuous updates and maintenance of Melusine’s dependencies. This manual process is time-consuming and error-prone. Outdated dependencies could introduce security vulnerabilities, compatibility issues, and disruptions in our internal systems.
Rather than attempting to patch up the existing Melusine codebase, we opted for a complete rewrite. This approach allowed us to address the dependency conflict challenges head-on, adopting a more modern and streamlined strategy.
What are conflicting dependencies?
As a Python developer, you’ve probably experienced dependency hell at some point. It’s the frustration of trying to keep your packages up to date without breaking your code. One package requires a newer version of a dependency, but another package requires an older version. It’s a mess.
In this article, I’ll show you how we used design patterns, optional dependencies, and the package tox
to keep the Melusine package clean and up to date.
Understand the issue with a code example
To illustrate the journey of rewritting Melusine, I’ll use a simple pseudo code example: a class to make a machine learning prediction using different types of models.
# predictor.py
import sklearn
import tensorflow as tf
class Predictor:
def __init__(self, model):
self.model = model
def predict(self, data):
# Tensorflow model
if isinstance(self.model, tf.Model):
result = self.model.predict(data)
# Sklearn model
elif isinstance(self.model, sklearn.TransformerMixIn):
result = self.model.transform(data)
# Unsupported model
else:
raise TypeError(
f"Object of type {type(self.model)} "
"is not supported by the Predictor class"
)
return result
# run_prediction.py
from sklearn.ensemble import RandomForestClassifier
from my_package import Predictor
X = some_data
predictor = Predictor(model=RandomForestClassifier())
result = predictor.predict(data=some_data)
The class uses an ML model to make a prediction, but depending on the type of model, the method to run predictions is different (transform or predict). There are a few weaknesses in this code block that could make it hard to maintain over time:
- The design forces you to modify the code and create a new if condition for each new type of model. This is particularly problematic if you don’t have the rights to modify the source code (using an open source package for example).
- Both
sklearn
andtensorflow
are imported in the module. This means that they both need to be installed, even if one is not used, and could lead to incompatibilities
Design Patterns: Decoupling for Flexibility
The first step in our journey consisted in refactoring the code, leveraging the power of design patterns. Specifically, we adopted dependency injection, a technique that decouples our code from its dependencies, allowing us to swap out different dependencies without disrupting the overall system.
Let’s rewrite the code block using dependency injection. We start with defining an abstract class to fix a signature for all predictor objects.
# base_predictor.py
from abc import ABC, abstractmethod
class BasePredictor(ABC):
@abstractmethod
def predict(self, data):
"""Execute a machine learning model prediction"""
raise NotImplementedError()
Then we define a class inheriting from BasePredictor
for each type of model.
# sklearn_predictor.py
from base_predictor import BasePredictor
from sklearn.ensemble import RandomForestClassifier
class SklearnPredictor(BasePredictor):
def __init__(self):
self.model = RandomForestClassifier()
def predict(self, data):
"""Execute a machine learning model prediction"""
return self.model.transform(data)
# tensorflow_predictor.py
from base_predictor import BasePredictor
from tensorflow import SomeTensorflowModel
class TensorflowPredictor(BasePredictor):
def __init__(self):
self.model = SomeTensorflowModel()
def predict(self, data):
"""Execute a machine learning model prediction"""
return self.model.predict(data)
Finally, we instantiate a predictor object (it can be any class inheriting from BasePredictor
) and use it to make a prediction.
# run_prediction.py
from my_package.sklearn_predictor import SklearnPredictor
X = some_data
predictor: BasePredictor = SklearnPredictor()
result = predictor.predict(data=some_data)
This refactored code improves a lot the code maintainability:
- New types of model can be added easily without impacting the existing code. Users can just create a class inheriting from
BasePredictor
and use it right away. - Dependencies are independent from each other. The
tensorflow
package doesn’t have to be installed when running anSklearnPredictor
(TheSklearnPredictor
andTensorflowPredictor
are defined in different modules).
Optional Dependencies: Tailored for Users
Instead of forcing all users to install all dependencies, Melusine provided optional dependency installation options. This allows users to choose the dependencies they needed based on their specific use cases, reducing the overall package size and simplifying the installation process.
In our example, we just want to install one of tensorflow
or sklearn
. Optional dependencies can be setup in the pyproject.toml
file.
# pyproject.toml
[project]
name = "melusine"
dependencies = ["pandas==2.0.0"]
[project.optional-dependencies]
sklearn = ["sklearn==1.3.2"]
tensorflow = ["tensorflow==3.2.0"]
Pandas is set as a mandatory dependency, it will always be installed when running pip install my_package
, while sklearn and tensorflow are optional dependencies installed only when running pip install my_package[sklearn]
and pip install my_package[tensorflow]
respectively.
tox: Testing through the Maze of Dependencies
With the code refactored and dependency management streamlined, Melusine faced a new challenge — ensuring that the package works seamlessly with different dependency configurations. To address this challenge, the team turned to the tox
testing framework.
Tox is a tool that can help you test your Python packages with different versions of dependencies. This can help you catch dependency conflicts before they cause problems for your users.
Once you have created and configured a tox.ini
file, you can run tox
to test your package with all of the specified versions of dependencies. If there are any dependency conflicts, tox will report them.
# tox.ini
[tox]
requires = tox>=4
env_list = base, sklearn, tensorflow
[base]
description = run unit tests with the base dependencies
commands = pytest
deps = pytest
[sklearn]
description = run unit tests with the sklearn dependencies
commands = pytest
deps = pytest
extras = sklearn
[tensorflow]
description = run unit tests with the tensorflow dependencies
commands = pytest
deps = pytest
extras = tensorflow
This file creates testing environments: base, sklearn and tensorflow.
Conditional Test Execution: Making Tests Smarter
The last challenge we needed to tackle was to skip the tests requiring tensorflow
when using the base and sklearn environments. This can be done simply with the pytest.importorskip
command.
# test_tensorflow_predictor.py
import pytest
tensorflow = pytest.importorskip("tensorflow")
from my_package.tensorflow_predictor import TensorflowPredictor
def test_tensorflow_predictor():
predictor = TensorflowPredictor()
assert predictor.predict(some_data) == expected_result
The pytest.importorskip
command checks if the tensorflow
package is installed and, if not, skips the following tests.
Wrap-up
The strategy we adopted to avoid dependency conflicts when rewritting melusine was:
- Reformat the code to use dependency injection
- Set up optional dependencies in the package requirements
- Configure
tox
to use multiple testing environments - Use
pytest.importorskip
to make test execution conditional to the installed packages
Conclusions
Dependency conflicts are a real problem for package maintainers. However, using design patterns, optional dependencies, and tox, can help keep your Python packages clean and up to date.
The adoption of these enhanced code design principles in Melusine v3 has significantly transformed it’s maintainability. Developers can now focus on their specific areas of expertise, working independently on modules and components without the risk of impacting each other. This specialization has accelerated development and improved code quality.
Melusine’s journey from a complex codebase to a well-maintained open-source package highlights the importance of effective code design and modularity. By adopting a more structured and component-based approach, we have made it a significantly more robust and reliable for MAIF and the wider open-source community.
About the author
I grew up in the French Alps, studied Physics / Nuclear Engineering in Switzerland, Sweden and England, and I’ve been working as a Data Scientist since 2018 at Quantmetry and MAIF. I am also a big fan of wakeboarding and board games :)
Follow me on LinkedIn.
Leave a star for Melusine on GitHub!