Introduction
Over the years of ML and DL project development, we have accumulated a large codebase, gained considerable experience and valuable insights, and reached some interesting conclusions. This knowledge comes in very handy when you start a new project. It gives you confidence in starting research, helps reuse effective methods and get faster results.
It is essential to not only keep these data in developers’ heads but store them in a readable form on some external storage. This way, you can more effectively train new team members, bring them up to date and help new hires immerse into the project.
Naturally, this has not always been the case. We faced many obstacles at the outset.
- Every new project was structured differently, especially when initiated by different people.
- We didn’t thoroughly trace what the code does, who its author was, and how to run it.
- We didn’t use virtualization as much as we should have and kept installing different versions of existing libraries (which doesn’t make things easier for your teammates).
- Where are those conclusions based on the charts that were lost and forgotten in the pile of Jupyter notebooks? Fallen into oblivion, obviously.
- The same goes for projects’ progress and results reports.
To solve these problems once and for all, we decided to work on an accurate and unified project structure, virtualization, abstraction of individual components and code reusability. Gradually, all the progress we achieved in this area has morphed into an independent framework, Ocean.
And the cherry on top, project logs can be automatically aggregated and transformed into a neat website by running just one command.
Why Ocean
There are several other options in the ML world that we considered. First of all, it would be fair to mention cookiecutter-data-science (hereinafter CDS) as our main source of inspiration. Great things about CDS: CDS not only offers a logical, flexible project structure but shares guidelines on how to successfully manage a project. So, before you continue reading, we suggest that you take a look at the key principles of the original CDS approach.
Having equipped ourselves with CDS for our project, we have immediately upgraded the structure as follows: added a convenient file logger, the Coordinator entity responsible for simple project navigation, and an automatic Sphinx documentation generator. Besides, we have moved several commands to Makefile so that even a newcomer could easily run them.
Things that have emerged in the process of working with CDS and that are not that great:
- The data folder can grow continuously but it’s not clear which script or notebook generates a new file. The number of files stored there is confusing. Besides, you can’t be sure if any of those files can be useful for the implementation of a new feature because there are no intended use documentation or descriptions.
- The data folder lacks the features subfolder where the following features can be stored: calculated statistics, embeddings and other features, from which various final data representations can be extracted. If you want to dive in, there is an awesome blog post on this subject.
- The src folder is another problem. It contains both functions that are relevant for the project in general, say, data cleansing and preparation tasks of the src.data submodule, and those that are very specific and often minor, such as the src.models module, which contains all the models from all experiments — and there can be dozens of them. As a result, the src folder is updated too often with minor changes, and according to CDS philosophy, you need to rebuild a project after each update, which also takes time and… You see the problem, right?
- There is the references folder, but it remains a mystery who, when and in what form should add data there. And there is a lot to share in the course of the project: what has already been done, what the results are, and what you are going to do next.
To solve the above problems, we’ve introduced the following entity — Experiment. The Experiment is a repository of all the data relevant to some hypothesis testing. Such as: what data were used, what data (artifacts) were produced, code version, the experiment’s beginning and ending timestamp, source file, parameters, metrics, and logs. Part of these data can be logged using tracking utilities, such as MLFlow. But experiment structure that Ocean offers is deeper and more elaborate.
Experiment structure
Here’s what an experiment’s module may look like:
We subdivide the codebase: good, reusable code that is relevant for the entire project remains in the src module of the project. It is seldom updated, so you rarely need to rebuild the project. While the scripts module of one experiment should contain only that code which is relevant for the current experiment. This way, you can update it frequently without affecting your teammates’ work on other experiments.
Let’s take a hypothetical ML/DL project and see what our framework can do.
Project workflow
Starting a new project
So, the client, let’s say… the Chicago Police Department, has unloaded data from their database and set a task: we need to analyze the crimes committed in Chicago in 2011–2017 and draw conclusions.
Let’s go! Open the terminal and run:
ocean project new -n Crimes
The framework created the corresponding project folder — crimes. Let’s take a look at its structure:
The Coordinator entity from the module of the same name, which is already written and ready to go, helps to nçavigate through all these folders. To use it, you need to first build a project:
make package
Bug alert: If make commands refuse to execute, just add the -B flag to them, like this: make -B package. The same goes for all further examples.
Logs and experiments
Let’s start with adding our client’s data — the crimes.csv file in this case, to the data/raw folder.
The official City of Chicago Website has city maps subdivided into beats (the smallest tract of land designated for primary police patrol); sectors (a grouping of 2–5 beats); districts (a grouping of 3 sectors); wards (an administrative district); and community areas. These data can be used for visualization. As JSON-files with coordinates of polygons of each subdivision type were not sent the customer, we put them in data/external.
Now, we need to introduce the concept of Experiment. It’s simple: we consider every separate task to be a separate experiment. Need to parse/extract data and prepare it for future use? That’s going to be an experiment. Or, maybe, prepare a lot of data visualization and reports? That’s another experiment. And what if you need to prepare a model and test some hypothesis? Well, take a guess.
To create our first experiment, we need to run the following command from the project folder:
ocean exp new -n Parsing -a ivanov
This will create a new folder named exp-001-Parsing (see above for its structure) in the crimes/experiments folder.
Then, we need to take a look at the data. To do so, we need to create a notebook in the notebooks folder. Here at Surf, we stick to the name template “notebook No.-name”, so the newly created notebook will be named 001-Parse-data.ipynb. Now, inside the notebook, we need to prepare data for future work.
The code above can be used as a task for Luigi or Apache Airflow.
You want your teammates to be aware of what you have done and whether your results can be used by them, so we need to leave a comment on that in the log.md file. The structure of log.md, which is basically a markdown file, looks like this:
Manually added entries are highlighted. The experiment’s metadata (light magenta highlight): the author explains their objective and speculates about the outcome of the experiment. References to data (light green highlight), both pregiven and generated in the process, help to track data files and see who and why uses them. Log entries (light yellow highlight) describe the results, conclusions, and reasoning of the experiment. All these will later be used as the content for the project log website.
Next goes the EDA[1] stage. It will likely be carried out by different people, so we’d like the results to be presented in the form of reports and graphs. And that’s a good reason to create a new experiment. Run:
ocean exp new -n Parsing -a ivanov
This will create a new folder named exp-001-Parsing (see above for its structure) in the crimes/experiments folder.
Then, we need to take a look at the data. To do so, we need to create a notebook in the notebooks folder. Here at Surf, we stick to the name template “notebook No.-name”, so the newly created notebook will be named 001-Parse-data.ipynb. Now, inside the notebook, we need to prepare data for future work.
The code above can be used as a task for Luigi or Apache Airflow.
You want your teammates to be aware of what you have done and whether your results can be used by them, so we need to leave a comment on that in the log.md file. The structure of log.md, which is basically a markdown file, looks like this:
Manually added entries are highlighted. The experiment’s metadata (light magenta highlight): the author explains their objective and speculates about the outcome of the experiment. References to data (light green highlight), both pregiven and generated in the process, help to track data files and see who and why uses them. Log entries (light yellow highlight) describe the results, conclusions, and reasoning of the experiment. All these will later be used as the content for the project log website.
Next goes the EDA[1] stage. It will likely be carried out by different people, so we’d like the results to be presented in the form of reports and graphs. And that’s a good reason to create a new experiment. Run:
ocean exp new -n Eda -a ivanov
Create a new notebook named 001-EDA.ipynb. in the notebooks folder. There’s no need in presenting the entire code — your teammates simply won’t need it. But they will need graphs and conclusions. The notebook contains a lot of code and doesn’t look like something you’d like to present to your client. So, we’ll keep our findings and insights in the log.md file and save graph images to the references folder.
Here’s the map of the safest areas of Chicago (a useful thing if you happen to be in the city):
It was generated in a notebook and moved to references.
The following record is added to the log:
Please mind that the graph appears as an image added to an .md file. And if you add a link to the notebook, it will be converted to HTML and saved as a website page.
To build a website from experiment logs, run the following command on the project level:
ocean log new
This command will create a new folder named crimes/project_log and the index.html file containing the project’s log.
Bug alert: when displayed in Jupyter, the website is embedded as an iFrame for greater security, so the fonts are not displaying correctly. It means that with Ocean you can instantly archive a website copy so that you can easily download and open it on your local computer. Just like that:
ocean log archive [-n NAME] [-p PASSWORD]
Documentation
Now, let’s see how to form documentation using Sphinx. Let’s now create a function in the crimes/my_cool_module.py folder and document it. Please note that Sphinx uses reStructured Text (RST):
And now it’s as simple as running a command to generate the documentation:
ocean docs new
Question from the audience: Ok, so we’ve built a project using ‘make’. Why do we have to generate documentation with ‘ocean’?
And we say: the process of generating documentation is not just executing the Sphinx command that can be put into a makefile. Ocean scans your source code directory, uses it to build an index for Sphinx, and only then it is Sphinx’ time to work.
To find the generated HTML-documentation, just open crimes/docs/_build/html/index.html. Our cool commentary module will also wait for you there.
Models
Our next step is building a model. Run:
ocean exp new -n Model -a ivanov
This time, let’s take a look at the contents of the experiment’s scripts folder. The train.py file is a template for the future training process. It already has a boilerplate code, which is useful in several ways:
- It contains a preset for the training function, which takes the following file paths:
а) The config file, in which you’d want to put model and training parameters, as well as other options that can be conveniently controlled from the outside, without digging into the code.
b) The relevant data file.
c) The directory which you want to dump the model to.
2. It tracks metrics generated in the training process, using MLflow. To see everything that was tracked via UI MLflow, run make dashboard in the experiment folder.
3. Once the training process has been completed, it sends a notification to your Telegram. We used Alarmerbot to implement this mechanism. To make it work, you just need to start the conversation with the bot, send the /start command, and then move the token issued by the bot to crimes/config/alarm_config.yml. The line can look like this:
ivanov: a5081d-1b6de6–5f2762.
4. It is executed from the console.
Why run our script from the console? Everything is organized in such a way that training or obtaining predictions from any model could be easily managed by a third-party developer who is not familiar with the details of your experiment. For all the pieces of the puzzle to come together, after you’ve finished working on train.py, you need to complete your Makefile. It already has a template for the train command, so all you have to do is set the paths to the configuration files listed above and list everyone willing to receive Telegram notifications in the username parameter values. Use alias all to send notifications to all team members.
Once everything is ready, you can gracefully start the experiment by running the make train command.
In case you want to use someone else’s neural networks, virtual environments (venv) will help you in doing so. Adding them to and deleting them from an experiment is super easy:
- ocean env new will create a new environment. It is not only active by default but also creates an additional kernel for notebooks and further research. It will have the same name as the experiment.
- ocean env list will display a list of kernels.
- ocean env delete will delete the created environment.
What’s missing?
- Ocean doesn’t get on well with conda (because we do not use it)
- The project template is presented only in English.
- There’s still a problem with website localization: building a project log means that all logs should be in English.
Conclusion
You can find the project source code here.
We got your attention? Awesome! For more information, check out the README file in the Ocean repo.
Any contribution is welcome, big or small — we would really appreciate if you help us improve our project.
[1] Exploratory Data Analysis