A Commentary on the Abstraction and Reasoning Challenge — Kaggle Competition

Mehran Kazeminia

Published in

The Startup

7 min readAug 16, 2020

This competition was hosted by François Chollet.

This report has been prepared by Somayeh Gholami and Mehran Kazeminia.

Currently, Machine learning techniques can only use the patterns that they have already seen. It means that initially certain patterns are set for the machines and then they are exposed to the pertinent data so that they can learn new skills. But like humans, could machine in the future answer the reasoning questions they have never seen before? Could machines learn complex and abstract tasks just from a few examples? This was exactly the theme of the recent abstraction and reasoning challenge, which terminated recently, and it’s one of Kaggle’s most controversial challenges. In this challenge, participants were asked to develop artificial intelligence, within three months, which can solve reasoning questions that they had not seen before. Introducing this contest, Kaggle wrote:

“It provides a glimpse of a future where AI could quickly learn to solve new problems on its own. The Kaggle Abstraction and Reasoning Challenge invites you to try your hand at bringing this future into the present!”

The reasoning questions of this challenge were like the intelligence tests for humans and included simple, medium, and sometimes rather difficult questions. Of course, an ordinary human was able to answer all the questions within an adequate time, and none of the questions were extremely complex. But the challenge was how to train machines all reasoning concepts like; the color change, resize, change the order, etc, to enable them to pass a human intelligence test which they have never been seen before.

The prize for this match was a total of twenty thousand dollars, which was divided between the first three (first three teams). But as guessed; Even the results of whom at the top of the list were not promising. The challenge involved nearly a thousand participants, half of whom did not answer any of the questions correctly. If a team’s algorithm did not work at all, it would get a score of one, and if it could answer a few questions correctly, for example, it would get a score of Ninety-eight hundredths or…. However, only twelve teams were able to score less than 0.90. The following is the final table of match scores for on the top thirty.

This match was not a classification challenge, it means that all the answers should be made in the form of the picture (matrix) rather than selected from several visual options that led the competition more complicated. Perhaps, for this reason, those who either thought that they could train machines only by the conventional and classical way, or they could advance the work by speculation, were utterly disappointed. Of course, some participants reasoned out the instances which had simpler solutions and were considered as an exception. It is clear that in the best case they solved only a few numbers of instances and did not have much success.

Although, in this contest, the winners and participants’ ingenuity and effort are admirable, meanwhile at a glance at the scoreboard, it seems that we are still far from the final answer and there was no guarantee that whether the best approach chosen by participants. However, the winners of this contest have generously described their creative method in the following links, and some of them have provided their complete codes.

List of gold medal solutions shared:

1st place solution by icecuber
2nd place solution by Alejandro de Miquel
3rd place solution by Vlad Golubev
3rd place solution by Ilia
5th place solution by alijs
6th place solution by Zoltan
8th place solution by Andy Penrose
8th place solution by Maciej Sypetkowski
8th place solution by Jan Bre
9th place solution by Hieu Phung
10th place solution by Alexander Fritzler

If you are interested in this topic, you can get a lot of information about this challenge on the Kaggle website as well as François Cholet’s Github. Of course, if you want to take your initiative and try your approaches, we have some tips for you; To get started, first study Mr. François Cholet’s article page no.64 on measuring intelligence:

On the Measure of Intelligence | François Chollet

You can also refer to the Discussion and Notebooks section of this challenge on the Kaggle website and read the recommendations of the host, winners, and all participants directly. Finally, here are some key tips from François Chollet:

How to get started?

fchollet — Competition Host:
If you don’t know how to get started, I would suggest the following template:
Take a bunch of tasks from the training or evaluation set — around 10.
For each task, write by hand a simple program that solves it. It doesn’t matter what programming language you use — pick what you’re comfortable with.
Now, look at your programs, and ponder the following:
1) Could they be expressed more naturally in a different medium (what we call a DSL, a domain-specific language)?
2) What would a search process that outputs such programs look like (regardless of conditioning the search on the task data)?
3) How could you simplify this search by conditioning it on the task data?
4) Once you have a set of generated candidates for a solution program, how do you pick the one most likely to generalize?
You will not find tutorials online on how to do any of this. The best you can do is read past literature on program synthesis, which will help with step 3). But even that may not be that useful :)
This challenge is something new. You are expected to think on your own and come up with novel, creative ideas. It’s what’s fun about it!

Does hard-coding rules disqualify?

fchollet — Competition Host:
You can hard-code rules & knowledge, and you can use external data

Can we “probe” the leaderboard to get information about the test set?

fchollet — Competition Host:
Using your LB score as feedback to guess the exact contents of the test set is against the spirit of the competition. In fact, it is against the spirit of every Kaggle competition. The goal of the competition is to create an algo that will turn the demonstration pairs of a task into a program that solves the task — not to reverse-engineer the private test set.
Further, this is a waste of your time. It is extremely unlikely that you would be able to guess an exact output or an exact task. This is why we decided not to have a separate public and private leaderboard: probing is simply not going to work.
That is because:
1) test tasks have no exact overlap with training and eval tasks (although they look “similar” in the sense that they’re the same kind of puzzle, built on top of Core Knowledge systems)
2) the space of all possible ARC tasks is very large, and very diverse.
So you’re not going to get a hit by either trying everything found in the train and eval set, or by just randomly guessing new tasks. You would have better luck trying to guess the exact melodies of the top 100 pop songs of 2021.

Is the level of difficulty similar in evaluation set and test set?

fchollet — Competition Host:
The difficulty level of the evaluation set and test set are about the same. Both are more difficult than the training set. That is because the training set deliberately contains elementary tasks meant to serve as Core Knowledge concept demonstration.

Can we use data from both the training and evaluation sets in our solutions?

fchollet — Competition Host:
I would recommend only using data from the training set to develop your algorithm. Using data from both the training set and evaluation set isn’t at all against the rules, so could you do it, but it would be bad practice, since it would prevent you from accurately evaluating your algorithms.
The goal of this competition is to develop an algorithm that can make sense of tasks it has never seen before. You’ll want to be able to check how well your algorithm perform before submitting it. For this purpose, you need a set of tasks that your algorithm has never seen, and further, that you have never seen. That’s the evaluation set. So don’t leak too much into information from the evaluation set into your algorithm, or you won’t be able to evaluate it.
Note that the “test” set is a placeholder (copied from the evaluation set) for you to check that your submission is working as intended. The real test set used for the leaderboard is fully private.

8th place solution by Maciej Sypetkowski

So everything is ready.
Have a coffee and get started.

Good luck.
Somayyeh Gholami & Mehran Kazeminia