Severstal Kaggle Summary — Why your team should get into data science competitions

Published in

Tooploox AI

5 min readDec 19, 2019

Sometimes there comes a time when there is little or no work in the office (in between the projects) and we like to spend it effectively improving our skills and broadening our horizons. Last time we had a chance to do so was in October, when we decided to assemble a MicroscopeIT team and take part in a data science competition.

We agreed to use Kaggle, a website hosting this kind of contests in various fields (e.g. object detection, natural language processing, time series analysis and prediction), and chose to compete in Severstal Kaggle Challenge, even though we were to join halfway through. It was an image segmentation challenge so it was right up our alley while the dataset and size of the competition attracted a large part of data science community, making it easier for us to learn and adapt.

For the three of us (out of five) it was an introduction to the site and first real participation in a data science competition. So here’s a short summary of the events that transpired, their outcomes and the lessons we’ve learned.

The problem

Our task was to detect defects in the images of steel sheets produced in Severstal factories. It was a segmentation task, so not only we would have to properly label one of the four types of defects (if there were any) but also produce a mask of pixels as close to the original annotation as possible.

Our eventual predictions would be evaluated with a DICE score.

Examples of the annotated data. (4 different classes)

What we did

We approached the problem in a traditional fashion by familiarizing ourselves with the competition data. First of all, we split the work into 4 tasks:

Research

This was done by all of us in between the tasks. We researched different approaches and sifted through other public kernels in order to find the optimal solution.

2. Exploratory Data Analysis

You can’t really solve a data science problem without understanding the data you are working with. This task was done really well by Joanna and it was a huge help in terms of designing and implementing an appropriate pipeline.

3. Processing pipeline

This task was done by Marcin, Mirek, and me, with Marcin and Mirek providing initial pipelines and functions and me doing the stitching work. I was also responsible for 4), which was later improved by Marek.

4. Integrating our work.

It looks good on paper, but 3 and 4 turned out to be serious issues for most of us, mainly because we weren’t accustomed to Kaggle website and its inner workings. In hindsight, our plan was too ambitious and we spent too much of our precious time doing groundwork for the model implementation and training instead of training and trying different algorithms to solve the task.

We managed to design and implement Github-Kaggle integration because we felt Kaggle wouldn’t allow us to quickly integrate our work and keep track of the changes made by other team members.

We did a fine preprocessing pipeline and Mirek even did memory optimisation to better accommodate for the memory space Kaggle has given us. (One quirk was that on the Kaggle machines we were given 13 or 16 GB of RAM and only 5 GB of additional hard drive, so it was better to juggle the data in memory than unpack it in convenient format on hard drive).

However, all of that was too ambitious and left us with little time to train the actual algorithm. Due to Marcin’s work we managed to train a segmentation model that scored quite well, but we weren’t successful in adding the classificator trained by me, which might have let us win a medal. (the plan was to first classify an image based on whether it contained any defect and segment in another pass with a segmentation network.)

The outcomes

Our score was close to reaching a medal (14th percentile) but we fell short by 0.2 percent points in DICE score. The variance in the scores wasn’t large (0.89672 DICE score compared to 0.90883 achieved by the winners). Considering the first place method, which used classifier with a segmentator, our approach was correct but we lacked the time to fine-tune it and achieve a better result.

Lessons learnt

Despite the lack of apparent success on the leaderboard and the feeling of dissatisfaction we had because we felt we could have scored more if we had 2 more days, we were happy to participate and learn from the experience. We now know a little more on how to operate in Kaggle environment, how to design and train models there and what to be wary of in such time-constrained conditions. We also realised that high coding standards can actually be a hindrance, especially if you are not creating a model for a client but are on a research stage of the project.

Conclusions

As mentioned at the start, for most of us it was the first time we competed in a data science competition. The pressure was there, especially considering we had less than half the time compared to the rest of the participants (we joined in the last of almost three months).

We are used to doing things production-wise so even on the research stage we usually code with later usability in mind and as I’ve mentioned it isn’t always an advantage when you have little time. The good thing was that the team spirit remained high until the end and most of the frustration we experienced was connected with various Kaggle quirks we weren’t used to, rather than poor communication or tension within the team.

Also, even though some of us hadn’t had a chance to work together on a project before, the cooperation between the members was still on point, often giving the right boost at the right time, even if things seemed dire. In terms of the results and satisfaction, we collectively agreed that we were left wanting more, now that we have experience with tackling such competitions and will probably try ourselves in the future competitions, whether in-house or on our own.

In my opinion, we have learned plenty and proved that even under the pressure of time we still can have good communication and cooperation, which, to my mind, we often underestimate, since we get on well with each other on a daily basis. All these benefits we gained from the experience will definitely help us provide better results and advance further in the future.