Surveying the Microogranisms of Lake Baikal. An Open Project by MaritimeAI and Yandex Cloud

Published in

Yandex

12 min readSep 23, 2022

Hello! My name is Sergey Ivanov, and I’m a Data Scientist on the MaritimeAI team. Together with Yandex Cloud, we’re building a system that allows scientists at the Irkutsk State University’s National Research Institute of Biology to monitor the ecology of the Baikal, the world’s largest freshwater lake. Why does this matter? The lake’s water is home to hundreds of plankton species. It’s the intricate balance of microorganisms that makes Baikal so unique, pristine, and perfectly suitable for fishing and other activities. We should care for the lake’s health to ensure that it stays useful to humans and continues to invoke awe in travelers.

Just recently, tallying and determining the species had to be done manually: scholars would use their own eyes, a microscope, and a blank paper form to mark the presence of certain organisms. We wanted to automate this process and share the dataset with community via GitHub. At the end of this post, I’ll clarify how this dataset can be applied, how we intend to maintain it, and what will appear in the repository in the future.

What do we generally do at MaritimeAI? Our team applies various methods of machine learning in maritime, ecological, and many other kinds of research. We can recognize different sea ice types from satellite imagery, process sonar data, detect oil spills, and enhance underwater video recordings.

With Yandex Cloud’s support, we’re building a system for Irkutsk State University’s (ISU) Institute of Biology. This system will aid researchers in conducting and upkeeping a unique experiment on monitoring the ecological conditions in lake Baikal.

For 77 years, ISU’s National Research Institute of Biology has been making scientific observations of lake Baikal’s waters. It’s the world’s longest ongoing research project of its kind, with a methodology that’s remained unchanged the entire time. Water samples are collected at a specific spot of the lake’s body, ranging from 0 to 700 meters in depth. Then, researchers register various properties, both hydrophysical (transparency and temperature at different depths) and hydrobiological (quantity and species composition of phyto- and zooplankton).

Previously, this job required meticulously checking the entire sample through a microscope lens and determining each organism’s species. Meanwhile, the waters of Baikal are inhabited not only by native plankton species but also by rare, undiscovered, and invasive species. After counting the species within each water sample, the scientists at the Institute of Biology fill out a summary card. In the past, it used to be a cardboard sheet — but these days, it’s obviously in digital form.

Our task was to automate the assessment of hydrobiological indicators in such a way as to facilitate the work of specialists of the Research Institute of Biology while leaving them the opportunity to study new organisms independently.

The Main Part: Our Approach

The task at hand would seem like a typical Object Detection case, with many available solutions. But there’s a catch: we don’t have a fixed list of object classes. Although there is a list of species inhabiting Baikal, it’s not complete nor finalized. Some species go through several stages of development, of which there can be many. These stages are often quite hard to distinguish when organisms aren’t favorably positioned in the view. Another reason this confusion might happen is that sometimes, only parts of the organisms end up in the sample. Also, new objects may include plants, seeds, artificial debris, and invasive species. Each piece of this diverse composition has to be recognized as something new and shown to specialists. Therefore, we need a system that’ll take on the routine of detecting the “usual” lake inhabitants while leaving the complex, interesting, or unusual objects to knowledgeable humans.

The presented task sounds quite a bit more complicated than the run-of-the-mill problem statements found in business environments and machine learning courses. We’re looking at a constantly supplemented set of objects, and there’s no way to determine the limit of this expansion in advance. Similar problems arise in biology and other areas where data continuously changes in unpredictable ways.

Another difference to basic ML tasks is quality assessment. From the very beginning, we wondered how exactly to tell if our algorithms are doing a good job. Naturally, we can use the ratio of recognized objects and special metrics (such as F1 and ROC-AUC) and employ IoU when detecting objects.

Let’s imagine that the algorithm successfully recognizes all species of plankton occurring in winter and that the metrics are satisfactory. But suddenly, the algorithm runs into an invasive species that it’s unfamiliar with, and this species quickly becomes more numerous than all the other plankton species in the lake. In this case, all the metrics will promptly drop as the known species decrease in quantity and unknown objects grow in presence. However, this shouldn’t be too much of an issue since it’s enough to memorize the new species’ outward appearance once. Consequently, the metrics will climb back to the “good” values. We can conclude that accuracy-type parameters may not be the best indicators in assessing the work of the algorithms.

As a result, like in most real-life tasks, we assess the most crucial parameter: how much time a specialist spends on average processing a single sample. Of course, such a high-level metric isn’t directly related to neural network metrics. But it’s nevertheless a reflection of the entire system’s usefulness, which is the very goal we’re striving to achieve with this project. Before starting the development, we measured the time the Research Institute employees would spend on one sample. On average, it comes to about a half-hour per sample. But to smooth out errors, the sample may be processed several times and the results of each count averaged. We intend that the participation of a zooplankton specialist is not necessary for recording a sample. Although the algorithms’ training is still in progress, they already take over the labeling at times, and by January 2023, we’ll be automatically processing most of the common zooplankton species.

Here’s how general classification with the response-based decision-making process looks like:

In sum, we have an incoming stream of images where different objects appear in front of a more-or-less consistent background. We need to label these objects and later recognize them if possible. We combined these steps with manual image processing, as shown in the diagram.

During the development process, we didn’t have a clear understanding of all the nuances of the Research Institute specialists’ work, so first, we digitized the process “as is.” This is not to say that we didn’t observe how researchers handle the samples; however, just like in any other real task, there are thousands of caveats that we may not notice. As opposed to studying the sample with a microscope, a specialist now takes photographs by saving frames from the microscope’s camera into a sample cataloging environment built by our team. The researcher can then view the sample on a computer, select objects by labeling them with polygons, and assign certain classes to the objects. The summary card will generate automatically.

The first stage of digitization went well, so we moved on to the second, which involved replacing parts of the labeling process with algorithms. This process consists of two large blocks: object detection and classification.

Object detection is currently the most interesting but also the weakest point of ours. It would be ideal for detecting everything we see without fail, but this is where difficulties come to light:

We may encounter an object we’ve never seen before
Objects can often overlap each other

Because of this, we needed something sturdier than standard detectors and classifiers. At the moment, the best solution is to label the background, not the objects themselves. After all, the background is relatively consistent: it’s a glass of a defined capacity with an amber backlight. The glass container has many bumps and scratches on its bottom that need to be distinguished from objects: sadly, this rules out simple color segmentation. By labeling the background, we can create masks for the background and the objects and then separate individual objects using OpenCV’s standard Watershed algorithm. As of right now, we label the background using the classic UNet algorithm for segmentation but also continue to experiment with other options.

With this approach, we retrieve objects, or at least clots of objects, if we’re unlucky. Nevertheless, when there are too many overlapping objects, adding some more water to the container helps.

For classification, we use the Metric Learning approach. We have numerous classes for images, so there’s a significant disbalance between examples of objects that belong to different species. For instance, we don’t expect to see a Baikal daphnia this year. In addition, we can use a separate algorithm to determine the novelty of an object for each object class. Currently, the classifier is a model composed of three interconnected parts. The first part is a classic neural network primarily based on ResNet (but it’s subject to change, as we experiment with this all the time). We train this network to extract distinct object embeddings from images with different variations of Triplet Loss. The second part is, in essence, the nearest neighbor criterion running on the Faiss engine. This method helps us cluster the data and determine how much a new image resembles or differs from the previously encountered images. Moreover, the embeddings and distances between them power the third part of the model, the out-of-domain classifiers. They determine the “novelty” of a newly encountered image — one predictor per class. The classifier may confuse some classes (I do, too, to be honest), which is why we let humans assess the object and decide whether it’s genuinely novel or the algorithm has made a mistake.

After a successful manual or automatic classification, new objects end up in the internal database (MongoDB). We use this database to create reports, which are the end result of a sample observation. Unlike the nostalgic analog method, this approach allows us to view the same sample again after processing it. Furthermore, the database helps us form datasets to update the detector and the classifier, which makes it possible to introduce new species to the algorithm and improve recognition metrics for known objects.

Diagram of updating sets of data and models

This way, we gradually memorize every main species in the waters of Baikal. Obviously, it takes time, and we’ll have to wait a few years until we see the special Baikal daphnia. However, the more objects we encounter, the better the accuracy when recognizing new ones.

As we worked, we encountered numerous nuances and surprises. Here are a few examples:

We can measure the algorithms’ error rates, but we can’t yet assess the metrics for human work. In the past, biologists would tally each sample two times and average the results to make up for any errors.
A specialist’s trained eye can distinguish a long-familiar species even in a heavily smeared image. When “scrolling” through the sample with a microscope, the scientist won’t pause to focus on some Epischura Baikalensis. We are trying to add video processing — that is, continuously taking pictures with the microscope’s camera to increase the speed of sample processing. So far, we are faced with the limitations of both cameras and the rate of “scrolling.”
For one species, some stages of development differ by the number of legs — some of which can simply not end up in the frame. Such examples can catch even decades-long professionals off-guard.
When the images experience significant, systemic change, we switch to fully-manual mode. For example, many additional objects appear in the water when summer comes. We don’t really account for these objects, but they are certainly there. Usually, the Research Institute’s specialists know what they’re seeing and skip small nuances. For neural networks, it’s training season.

Implementation in the Cloud

We do everything in Yandex Cloud, using the platform’s services.

We’ve created an online portal with sample image albums for the Institute’s researchers. It’s a website built with Python and FastApi, which works in Compute Cloud, stores data in Managed MongoDB, and stores images in Object Storage.

To train new models, we use DataSphere notebooks. It’s quite an atypical solution for us because we are used to training models via scripts: it’s convenient to leave them running for a long time, and it’s easy to version them. A significant advantage of DataSphere is that it also has memory states. It means we can create checkpoints, save an intermediate state, interrupt a notebook, and load into a previous state without any issues. This feature allows us to change the hardware that executes the notebook as we go. Recently, it became possible to use a notebook cell that contains a trained model as a microservice. As a result, we have assembled a typical notebook where you can tweak specific parameters according to circumstances and change the model or learning algorithm. Upon finishing training, the notebook outputs many statistics that we can use to decide if we need to update the recognition models. We publish the serialized models to separate model storage if we’re happy with the results.

Our recognition algorithms don’t work in real-time, meaning that, while everyone’s getting a good night’s sleep, we can run the recognition process on a modestly-powered virtual machine while everyone’s getting a good night’s sleep.

We’ve created a compact recognition service to which we pass messages through the Message Queue mechanism. The service employs the good old Faiss and ONNX for neural networks.

Manual Labeling

At one point, we decided to be lazy and not multiply entities. Therefore, instead of embedding open-source labeling tools in our interface or writing our own, we chose a ready-made solution, Toloka. When it comes to labeling, we experiment with different task types and envisage that through a manual interface, users will not only label new images but also check each other and the work of algorithms.

Open Data

Now, on to the topic we mentioned in the beginning: open data. With the help of the statistics we collect, the Research Institute of Biology employees solve several scientific tasks. However, we suspect that it’s only a fraction of what this data may be suitable for.

In the next few months, we’ll provide access to a regularly updated dataset and open the source code. But the basic dataset that we started with ourselves is already openly available.

The link leads to a dataset with labeled images from under a microscope. The JSON file contains objects, each corresponding to one object in the water — a crustacean or algae.

For each object, we provide:

a link to an image that contains the object
a polygon or rectangle with the object within
an object class label

The dataset is perfect for testing hypotheses on resistance to data drift, as well as the detection, segmentation, and classification of objects in an unusual data area.

Our plans for the dataset include the division into training and verification parts in the manner of time series and the establishment of some baseline.

What’s Next?

We have lots of plans.

The organism diversity in samples significantly varies between seasons: spring, winter, summer, and fall. Thus, at a minimum, we have a year’s worth of work to get acquainted with all the main species and finalize the training of algorithms.
We have enough experiments behind us for a pretty good academic publication on metric learning.
Baikal is not the only place in the world that’s monitored. Similar observations are made in other parts of the planet from all kinds of bodies of water: each has its distinct inhabitants that need to be studied.
We also have plans to write a biological scientific article, build a wiki website, and create an open database.

These studies will help identify problems in the lake’s balance of microorganisms at an early stage and preserve Baikal for future generations.