Rewriting a python package: The journey towards Melusine v3.0

Hugo Perrier
OSS by MAIF
Published in
6 min readMar 12, 2024
Streamline your email processing workflow with Melusine.

Melusine is an email processing Python package developed by MAIF. Originally designed for email routing, Melusine has evolved over the years to encompass a wider range of email processing tasks.

As the Data Factory at MAIF matured, the need for a modern, maintainable, and extensible code base became apparent. This led to a complete rewrite of Melusine.

Evolution of automatic email processing at MAIF over the years (only the core functionalities are open-sourced, not the applications).

In this article, I’ll describe some of the pain points encountered while maintaining the open-source package and using it in a production environment.

What is the problem ?

If you have been running a software for several years, these situations may sound familiar to you:

Business analyst: “Could we add this simple rule to improve email qualification?”
Data Scientist : “Sure, we will have to modify 30 files in five different repositories. It should take about 6 months”

MAIF representative: “Why is this email marked as urgent?”
Data Scientist: “Well it could be the model… or business rules… may be both? We are not sure”

New developer: “What does this line of code do?”
Confirmed developer: “It was there when I arrived and I’m affraid to delete it”

Managing a large code base over time is full of challenges, but in the present article I’d like to focus on modernity, modularity, continuous integration and robustness.

Modernity

“Bag-of-word, Tf-Idf, Word2Vec, GloVe, FastText, RNN, LSTM, Transformers, BERT, GPT, Llama, Mistral, …”

With Melusine, Natural Language Processing (NLP) is used to process emails. For example, we can detect if someone is angry based on the content of the email. The challenge is that,

since 2018, the NLP landscape has changed drasticaly!

We went from data scientists running their custom preprocessing code and training models from scratch to largely available off-the-shelf large language models.

The new Melusine is designed to be deep-learning framework agnostic, ensuring compatibility with the latest advancements in NLP. The usage of proven design patterns made sure that users could easily switch from a machine learning package to another.

Experience it yourself with this tutorial on emails Zero Shot Classification using Melusine and the HuggingFace transformers package.

We also focused on where Melusine shines: dealing with the specificities of email data with features like email conversation segmentation or signature text extraction.

Modularity

“I just had a call with the business, this year they would like to add 27 new use-cases of email automation”

As the scope of Melusine’s functionality expanded to encompass various mailboxes and unique processing requirements, the existing code base has grown increasingly complex.

Being able to deliver new features without breaking existing ones is paramount for Melusine.

To address this challenge, while rewriting Melusine, we embraced the Single Responsibility Principle (SRP) as a guiding principle. The SRP dictates that each part of the code should have a single responsibility, performing a specific task and minimizing interdependencies.

In practice, the SRP has been implemented in Melusine v3.0 by organizing the code into a pipeline framework. Each pipeline step handles a distinct task, such as text cleaning or detection. For instance, one step could segment an email conversation, while another applies a machine learning model to identify dissatisfaction in an email.

This modular approach has yielded several benefits:

  1. Facilitated Onboarding: New developers can work on a module directly without the need to dive into the entire code base.
  2. Secured Deployments: As coupling is minimized, the risk of breaking an existing feature while working on a new one is limited.
  3. Enhanced Collaboration: Code reviews are much simpler on a modular code and developers can spend more time discussing the general design rather than specific code blocks.
  4. Code Reusability: At MAIF, we use Melusine to process emails from different mailboxes. Building up on generic code blocks greatly simplifies the maintenance of out applications.

Check out this tutorial explaining how to build an email processing pipeline with Melusine.

Continuous Integration

“We got a really cool external contribution to the open-source package. I’m just worried that it could impact our production code if we integrate it”

Test automation with GitHub Actions

Data Science and NLP are highly dynamic fields, with new tools and updates being released on a daily basis. In this context, developers may be tempted to add every emerging functionality to their packages to chase the state of the art. However, this approach can lead to dependency conflicts, API breaking changes, and a constantly outdated codebase. To address these challenges, MAIF focused on three key aspects during the rewrite of Melusine:

  1. Limited Mandatory Dependencies: To minimize the impact of dependency changes, Melusine v3.0 separates features requiring external packages from the core modules. These external dependencies are then made optional, allowing users to select only what they need.
  2. Significant Test Coverage: To ensure the stability and reliability of Melusine, MAIF implemented a comprehensive testing strategy, achieving over 95% test coverage.
  3. Tested Tutorials: Instead of notebooks, the tutorials in Melusine v3.0 are written as unit tested python files and incorporated in the documentation.

With these measures in place, we are much more confident regarding the robustness and maintenability of the package over time and we will be able to focus our energy on evolutions rather than fixing bugs.

Robustness

In its early days, the MAIF Data Factory primarily relied on external developers to build its codebase. As these developers came and went, their varying coding styles introduced inconsistencies and made it challenging to onboard new developers. To address this issue, MAIF undertook a comprehensive overhaul of its Data Factory, focusing on four key aspects:

  1. People: MAIF recruited a diverse team of skilled professionals, including data scientists, software developers, operations engineers, and project managers. New collaborators brought a much needed experience to the Data Factory.
  2. Technical Expertise: MAIF invested in developing its in-house NLP and data science expertise and kept learning from maintaining and developing open-source packages.
  3. Industrialization: The Data Factory’s deployment process transitioned from manual to highly automated workflows. This involved implementing rigorous test coverage, code quality checks, and pre-commit hooks.
  4. Organizational Structure: MAIF revamped its organizational structure, moving away from a startup-like approach with limited project management and sporadic business interactions. The Data Factory adopted a more structured organizational model, featuring agile teams with product owners, facilitators, and systematic interactions with end-users and business representatives.

By implementing these transformative changes, MAIF instilled a culture of best practices and a more consistent coding philosophy within the Data Factory, laying a solid foundation for the development of the new Melusine.

Conclusion

In this article, we’ve presented the pitfalls of the maintainance of an open-source package and the challenges of running an application in production and we have shared the solutions we implemented to address them during Melusine’s rewrite.

Initiating a complete code rewrite is a daunting task but also a remarkable opportunity to enhance our coding abilities. The journey towards Melusine v3.0 has profoundly impacted MAIF DataFactory’s team and resulted in long lasting transformations.

MAIF is eager to share these valuable insights with the community by open-sourcing the new version of Melusine, paving the way for continued innovation and collaboration.

About the author

I grew up in the French Alps, studied Physics / Nuclear Engineering in Switzerland, Sweden and England, and I’ve been working as a Data Scientist since 2018 at Quantmetry and MAIF. I am also a big fan of wakeboarding and board games :)

Follow me on LinkedIn.
Leave a star for Melusine on GitHub!

--

--

Hugo Perrier
OSS by MAIF

NLP practitioner and Open-Source enthusiast ! I am a core contributer to the Melusine package. Senior ML Engineer at Quantmetry. Wakeboard fan !!