2020: An AI year in review
The four most important things in AI in 2020
People are already calling 2020 “The Great Pause”, because of the way the world stopped in response to the pandemic.
However, did it really stop for AI?
From the AI point of view, this year has been characterized by many important happenings (not necessarily successes) that will have a massive impact on the future of the field. More than the actual breakthroughs, the exciting aspect is that these events raised more questions than the ones they solved and it will take years of research before they will be answered.
Today we are going to talk about: GPT-3, testing COVID-19 from cough recordings, the bias in ML, and AlphaFold2.
GPT-3: Breakthrough or hype?
In November 2019, OpenAI made the headlines because of GPT-2, a language model so effective that they considered too dangerous to be released.
The reason why we are still talking about it in 2020 is that last June, the company released a new version, called GPT-3. The model is huge, it has 175 billion parameters, compared to the 1.5 of the previous version.
The cost for the single training of the model is around 4.6 million dollars and 355 GPU/years. The cost considering also hyperparameter tuning and other runs is not available but has been estimated by third parties to be between 12 to 27 million dollars .
GPT-3 is a language model, meaning that it is trained to sequentially generate the next most likely work given the preceding text. And the quality of the output is outstanding. It can generate paragraphs of text that sometimes are hard to distinguish from the ones produced by humans.
However, the most interesting applications that are popping up are actually not about what was the original purpose of the model. For example, it’s being used as a way to design webpages using natural language.
Or as a method to extract unstructured information as pages in a considerably quicker and more reliable way than using regular expressions or other alternatives.
It’s always important to consider that most of these examples are often cherry-picked and GPT-3 may not give results as robust as it seems. In fact, GPT-3 is not really meant for problems that need a specific and factual answer. For example, here is an Excel plugin that automatically fills cells.
Looks great right? Well, in the video you will notice that for Alaska, GPT-3 puts a population of 600 thousand people and a foundation date of 1906, while the true answers are 800 thousand and 1959.
What can we learn from GPT-3?
However, GPT-3 has a huge potential as a way to bootstrap content that would take a long time to produce (for example a translation or a mockup of a website), and then a human reviews and refines the output. And more importantly, is bringing up questions that are essential for the progress of the machine learning community.
Is GPT-3 really understanding language or is it “just” an incredibly complex mixture of pattern matching and memory? Will GPT-4 be even better? Will “priming” GPT-3 be the new software engineering job? How much cherry-picking is in the examples that we see? Will future models be able to integrate other types of media, like audio or images, or even perception types like gravity?
A COVID-19 test through your phone
AI is entering the world of medicine to make diagnosis and treatment cheaper. One example is the promising success of drug repurposing. Instead of developing a new drug, ML models are built to discover if an already existing drug, developed for a different purpose, can be used to treat a certain disease. This would save a lot of money — developing drugs is incredibly expensive — and a lot of time — since an already approved drug has been already tested for safety.
Let’s see another example, regarding AI making diagnosis cheaper and faster.
The paper “COVID-19 Artificial Intelligence Diagnosis Using Only Cough Recordings” was recently published in the IEEE Open Journal of Engineering in Medicine and Biology . The authors collected tens of thousands of cough recordings, together with surveys filled by the recorded people on whether they had symptoms and/or tested positive for COVID-19.
In this dataset, 2660 people were positive at the moment of the recording and were used as positive labels, while other 2660 negative recordings were randomly sampled from the database.
Four biomarkers, that were initially designed to diagnose Alzheimer's disease from a previous work, are then extracted from the recordings: muscular degradation, sentiment, vocal cord strength, and structure of the lung and respiratory tract. Among those, the last three biomarkers were computed using three pre-trained ResNet50, a type of Neural Network. Finally, the biomarkers were used as features for another Neural Network classifier, trained to predict whether the recorded person is positive or not to COVID-19.
The results on the test set are promising: the model correctly predicted 98.5% of the positive people as such (also called sensitivity or recall) and 94.2% of the negative people as such (also called specificity).
Moreover, when considering only people without symptoms, the model reaches 100% sensitivity, meaning that there are no false negatives, and 83.2% of specificity, meaning that the remaining 16.8% are false positives.
The last value is a very important piece of information because it shows that the method is still not ready for mass use. Taking, for example, a country like Italy, with a population of around 60 million people. If everyone used an app based on this technology every day, around 10 million people would be wrongly flagged as positive in a single day. It is then impossible to do a molecular test for each one of them.
This means that the model needs to be considerably improved, probably with more data on positive patients, before it will be able to be approved by the health organizations and used as a reliable test. However, this case shows how AI can potentially bring fast and cheap solutions in ways that were unimaginable before.
PULSE: a signal of bias in ML
In June 2020, a single image shook the ML community.
The face on the right was generated from a low-resolution version of a very well known photo of Barack Obama, on the left. The model used is called PULSE  and it’s based on a version of StyleGAN trained on the FlickFaceHQ dataset of faces. It is more diverse compared to other commonly used datasets, like CelebA, that is composed of faces of famous people, with a predominance of white people.
Despite the more diverse dataset, this example still showed cases of racial bias, with many low-resolution images of non-white people being reconstructed as white. The animated discussion that sparked from it highlighted that data can have an important role as a source of bias, but it is not the only one.
The sources of bias
Other relevant sources arise from the many choices that a ML researcher/engineer must take during the whole process, from data collection to production.
The most obvious one is the features chosen or computed through feature engineering. Deep Learning claims to greatly reduce this bias because the features are no longer hand-crafted but automatically generated from raw data by the model.
However, DL can still be subject to other biases, given for example by the architecture of the model (if a model requires a huge amount of data, it may be impossible to obtain a dataset of that size that is balanced from every possible aspect) or by the objective function (train a model using L2 loss, i.e. average error, and almost everyone will look white, train it on an L1 loss, i.e. median error, and more people might look black) .
Is super-resolution useful?
Another interesting point came out of this discussion: the application to faces is clearly not the best application for the super-resolution techniques, because the reconstructed face is not the face of a real person, but a new face that only looks real. There are many applications that are considerably more useful for super-resolution, like microscopy or satellite imagery .
However, useless as it is, this application helped to add a little piece to the research on the presence of bias in ML. In other settings, this bias may not be so evident and result in harm for minorities.
If you want to learn more about this topic I strongly suggest the blog post from Andrey Kurenkov, with a lot of more detailed information and valid lessons to be learned .
AlphaFold2: the “solution” for protein folding
Proteins are the molecules that carry the most important functions inside the cells, like signaling, reaction acceleration, transport and storage of other molecules, etc. They are sequences of amino acids, and while the sequence is still being produced, they fold in the shape that contains the lowest energy.
The big problem is that predicting the structure of a protein from its amino acid sequence is incredibly complex. It was estimated that, on average, a protein can fold in 10¹⁴³ possible conformations, according to what is called Levinthal’s paradox.
Every two years there is a challenge called CASP, Critical Assessment of Structure Prediction, in which many research groups compete to achieve the best accuracy in predicting the structure of a set of proteins based on its amino acid sequence.
The metric used to compare the algorithms is called GDT_TS. I will not explain what it is, but in a recent blog post  by Mohammed AlQurashi, he explained that the values of the scores can be roughly interpreted as:
- 20: corresponds to a random prediction
- 50: the general topology is right
- 70: the topology is accurate
- 90: the details are mostly right
From the previous edition of 2018, called CASP13, he predicted that in 4 years, we would reach a value of around 80, while he thought that a score of 90 was possible only in 10 years.
Well, 2 years after this prediction, AlphaFold2, a Deep Learning algorithm developed by DeepMind, reached a score of 92.4!
Too good to be true?
As any sane data scientist would do, I immediately thought that the result was probably overhyped, and the news just didn’t mention a caveat that would show that the results are not as good as they seem.
However, while there is usually a considerable chunk of the research community that is skeptical, in this case, I could not find any negative opinion from anyone in the field of structural biology.
For example, is it possible that the score is a result of overfitting, with some proteins predicted perfectly while others having a really bad score? As Mohammed AlQurashi shows, AlphaFold2 outperformed, with a big margin, all the other competitors in almost every protein of the challenge.
Another possible question is: was this particular edition of CASP easier than the others? The organizers showed that this edition was actually one of the hardest and, anyways, if a challenge is easy for AlphaFold2, it would have been easy also for the other groups.
Is protein folding solved?
The announcement from DeepMind and some of the CASP organizers stated that the protein folding is basically solved. Is this true? Well, it depends on how you define the term “solved”.
If we want to be strict in the definitions, we are nowhere near solving protein folding. The breakthrough is about the prediction of the protein structure, but how this protein reaches that structure through folding is still an open problem.
However, the AlphaFold2 model is not perfect, there are many corner cases in which the model is not perfectly accurate and it will take years before each of them will be covered. But, as Mohammed AlQurashi says, the problem is considered solved because it changed from being a research problem, for which you don’t know if a solution exists, to an engineering problem, for which you know that the solution exists, but we have not reached it yet, given the current resources.
It’s too early to predict what this achievement will bring, but it’s clear that, depending on the next moves from DeepMind, will greatly accelerate the progress in the field. Now that this part of the foundations is finished, we can now start to build the higher floors of the structural biology tower and ask deeper and more complex questions, that will likely further boost the impact of the field in our everyday life.
We have seen only the four most popular events of 2020, spanning a wide range of applications, from NLP to structural biology. It was hard to make a choice because there have been many other interesting breakthroughs this year. A lot of them are having a massive impact on our lives, showing the maturity of the AI field and the opportunity that it provides.
Let me know other topics that shook your 2020!
This is a blog post published by the PoliMi Data Scientists community. We are a student association of Politecnico di Milano that organizes events and write resources on Data Science and Machine Learning topics.
If you have suggestions or you want to come in contact with us, you can write to us on our Facebook page.