Hermione for Data Scientists
Hermione — ML made simple
Many data scientists learn how to set up their Machine Learning models through Jupyter Notebooks. If you aren’t familiar with Jupyter Notebooks (known simply as ‘notebooks’ by close friends), they are one of the most used tools by data scientists in order to explore information, test hypotheses and also to ML models training. And it’s easy to see the reasons for that: notebooks are easy to use and to understand and they help out the scientists to expose their results.
In some projects, Data Scientist’s main objective is to deliver a model in production. However, these are the moments that we realize how complicated it is to put a notebook into production, after all, our main concern as Data Scientists is to solve problems, innovate, generate and aggregate value for businesses and companies. We can sumarize this issue in this way “ How to turn a notebook productable?”. If we don’t prepare ourselves for those moments , we are prone to get stuck and to start thinking about the several problems we may find along the way.
Notebooks have been made to be a working tool, capable of explore and expose data, even though, putting them into production seems to be somewhat limiting. Imagine a scenario that demands constant maintenance, aggregation and multiple databases processing data pre-processing pipelines, in addition to fully functional running models themselves.. And what about the retraining cycle and model monitoring? For goodness, it’s just one notebook! Using it to put models into production and be consumed is not a piece of cake at all! No, using a fully detailed notebook and putting it into production to be consumed is not a piece of cake.
To bring in a little of the A3Data experience, we work in Data Science teams inside several client companies and it’s undeniable the excellence of notebooks as a data exploration tool. Nevertheless, when it comes to the product and its context, when the models needs to be consumed, monitored and have periodic maintenance, putting it into production inside a Jupyter Notebook is not the best choice(we are not even mentioning memory and CPU performance yet). And that’s why Hermione appears!
But wait! Not this Hermione, herself. But yeah, we have been inspired by this brilliant, empowered and awesome witch to name this framework!
Hermione is the newest open source library that will help Data Scientists on setting up more organized codes, in a quicker and simpler way. Besides, there are some classes in Hermione which assist with daily tasks such as: column normalization and denormalization, visualization, text vectoring, etc. Using Hermione, all you need is to execute a method and the rest is up to her, just like magic! But to make it possible, some basic knowledge about Object Oriented Programming will be necessary. Check out this article for more information.
What can Hermione do?
Hermione has come to make Machine Learning models setting process easier. To do that, it uses several combined strategies: Hermione indicates a particular directory structure that keeps the code more organized; classes to guide the creation of a structured pipeline for ML modeling; Hermione also presents a class with several methods for static and interactive data preview, all with just one line of code.
Hermione suggests the use of ML flow in order to manage Machine Learning lifecycle. ML flow has an interface which can be opened in the browser to assist this management.
Furthermore, Hermione presents an unit testing framework. Unit testing is extremely relevant to keep codes running properly and to follow previously defined business rules. Imagine the following scenario, you modify something in the pre-processing pipeline methods but then you forget to reflect this modification on the training pipeline. Depending on the situation, It could take you hours or even days to realize had happened. The error could be detected quickly with good practices of good use of unit testing.
How to get started with Hermione ?
To use Hermione, you just need to follow the steps below:
- pip install hermione-ml
- hermione info
- hermione new project_name
In Step 2 , we verify if Hermione was installed properly and then check the installed version. In step 3 ,you must choose if you want to create an empty project or one that is filled with the Titanic dataset, for example. The project is created with the name we passed after the new command. After these steps, your project is created presenting all Hermione’s structure.
- File framework and code organization
- Commonly used codes
- Git repository for version control
- Conda virtual environment for package version control
How to use Hermione
Now, let’s practice with Hermione! Before we start, we’ll show you some key points by using the example already installed in the package.
- Create you new project:
2. Enter “y” to implement the example code:
3. As we mentioned before, Hermione already creates a conda virtual environment for the project, activate it:
4. After activating, you should install some libraries. There are a few suggestions in “requirements.txt” file:
5. Now we will train the model from the example, using MLflow ❤ .To do so, just type, into the directory, src: hermione train. The “hermione train” command will search for a train.py file and execute it. In the example, models and metrics are already controlled via MLflow.
6. After that, a mlflow experiment is created. To verify the experiment in mlflow, type: mlflow ui. The application will go up.
7. To access the experiment, just enter the way previously provided in your preferred browser. Then it is possible to check the trained models and their metrics.
8. In the Titanic example, we also provide a step by step notebook. To view it, just type jupyter notebook into the path /src /notebooks/.
Would you like to contribute to Hermione’s functionalities?
Hermione is available on github!
Link: https://github.com/a3data/hermione
Make a pull request with your implementation.
For suggestions, contact us: hermione@a3data.com.br
#GoA3Data #EmpoweringPeopleThroughData