Artificial or Real?
Mariia Godgildieva, Kirill Grjaznov, Anastasiia Shalygina
Supervisors: Raul Vicente, Daniel Majoral
This is a study project for Computational Neuroscience course at the University of Tartu.
With the development of neural networks, AI algorithms are experiencing a drastic rise in popularity resulting in media fuss. Every day we can see articles about deepfakes, AI-generated news and self-driving cars that may provoke a question: is AI really taking over the world?
The goal of our project is to investigate how well people are able to distinguish between real data and artificially generated ones and what factors play a key role in it.
We decided to conduct our research using questionnaires where people have to distinguish images, texts and sounds generated by neural networks from the real ones.
We saw several questionnaires (with scientific and/or entertainment purpose) that propose similar tasks. However, all the found questionnaires have tasks within the same modality. For example, “Which Face is Real?” project at the University of Washington asks respondents to choose between real and generated human faces. The results do not seem to be published yet.
Google proposes to compare real and generated (Tacotron 2) speech. However, it is more of an entertainment quiz as it seems that they do not collect answers.
Website Lawsuite.org made an experiment for text data, however, with a specific kind of text: Donald Trump real speeches and fake ones generated by RoboTrump. The experiment had 1000 respondents and on average they guessed correctly only 40% of the answers.
Our main method is quantitative research. We needed to gather data and analyze it statistically to see how well people can distinguish between generated and real data of a different kind.
With this goal in mind, we created 3 questionnaires, one for each modality (images, sounds and texts). Each of the questionnaires had the same demographic questions (with some specific questions in addition). We separated images, sound and texts tasks to make it easier for the respondents to fill out.
The questionnaire has 2 main types of questions: is it real and which of these is/are real? With this distinction we aim to analyze 2 situations separately: the respondent is presented with a single example or he or she has to compare several examples, some of which are stated to be generated by AI. Our preliminary assumption is that the comparison questions would get more correct answers.
As for answers, we mainly put 3 degrees of certainness for answers: definitely, probably, no idea.
We started the questionnaire with questions to gather information about respondents themselves. We have not asked for any private data (like names or e-mail addresses). Primarily we were interested in information that could be at least possibly correlated to the overall results. In our opinion, education level and some previous experience with AI could be such factors. For texts and sounds, we also asked for English level, and for sounds — if the respondent has any musical experience. We assume that these factors could affect the results for these specific tasks. We also asked for some general information like gender, age and country to see how varied were our respondents.
The images which we used can be separated into two categories — portraits of people and modern art paintings.
Images of people were taken from the whichfaceisreal.com website, where the pairs of real and artificial images are presented: a real one from the Flickr-Faces-HQ Dataset (FFHQ), and the synthetic one, generated by StyleGAN — the novel generative open-source adversarial network introduced by Nvidia researchers in December 2018.
Artificially generated modern art paintings were taken from the aican.io website. The images of the real modern art pieces were taken from several websites which selling modern and abstract paintings (like osnatfineart.com).
Since we wanted to restrict the time for which the participant can look at the picture to avoid bias, we transformed the images into gif files and showed every image only for 10 seconds. Overall, we made 16 questions in the image section: 4 where the participants have to decide whether the person on the image is real or not, 4 where the respondent has to compare the portrait of the real person and the generated one and decide which person really exists and 8 similar tasks for modern art images — 4 simple ones and 4 tasks for comparison.
The AI used for text generation is OpenAI GPT-2 model. We used one sentence prompt from human-written texts and asked the model to continue it. Usually, it outputs 1–3 full sentences. We chose the generated texts carefully looking for those that would not give any obvious clues. Thus we did not choose any generated texts that were obviously senseless and random. Moreover, we omitted those that contained factual errors (like wrong dates or names, for example). We have not done any post-processing on the chosen texts (punctuation and spelling were preserved fully).
The prompts were taken from different sources. To complicate the task we did not choose any well-known texts and, honestly, tried to look for some weird ones. List of our sources includes theonion.com (satirical news media), horoscope.com (monthly horoscopes), newsinlevels.com (news written for language learners, so texts may look slightly off for native or proficient speakers), englishclub.com and learnenglish.britishcouncil.org (short stories also adapted for language learners).
Overall, we made 8 text tasks (4 simple, 4 comparisons). As text length may also have an impact on the results, we used texts of different length but keeping it relatively short (1 sentence prompt + 1–2 sentences of continuation).
We made sound tasks of two types: the first one is English speech and another one is piano music. The speech was generated by a machine learning system named MelNet. We used four short speech tracks which were taken from the article on theverge.com. The piano sounds were made by the WaveNet neural network. We took 3 generated pieces of piano music from the article “WaveNet: A generative model for raw audio”. Overall, we made 4 tasks on speech and 4 on piano music where only one piano music is real.
Since we used Google forms for our questionnaires, we could include tracks with speech and music only as links to Goole Drive where the files were stored. Because of such a limitation and because of listening to the sound take more time than, for example, comparison of images, it would have taken too long to complete the sounds questionnaire. Thus, we decided to omit the comparison questions for sound tasks to make it shorter.
The respondents receive feedback with the overall score after submitting the questionnaire. They get 1 point for a task in case they answer “Definitely [correct answer]” or “Probably [correct answer]”. In the case of texts and sounds, they receive the correct answers at the end of the questionnaire. For images, we are giving correct answers at the end and also after each task (as we are putting short gifs in tasks, they are hard to look at in general feedback, so we decided to put static images with correct answer after every task).
We have distributed the questionnaire through social networks and university media. The links were posted on various student groups pages on Facebook and later sent out through ICS mailing list. Below you can see how fast we got our responses. The first main peak is when we posted the links on Facebook, peak in the end — results from the mailing list.
As the questionnaire was separated into 3 parts, the domains got different amounts of responses. The most popular one was for images (204 responses), then texts (90 responses), then sounds (64 responses). We expected a lower number of responses for sound tasks as it demands to listen to speech/music. The skewness between texts and images may be due to the fact that link to images questionnaire was always posted first and the respondents might have done only the first one or some respondents were just interested only in images questionnaire because in the modern world we are better used to perceive visual information.
The demography of responses we got was quite varied.
Overall, we got respondents from 40 different countries, Russia and Estonia being the dominant one because our team members are originally from Russia and Estonia.
Most of the respondents had some university degree with the majority acquired master’s.
More than 70% of the respondents did not work on or study AI before and did not have experience with similar kinds of tests.
The perfect score was achieved only for images.
Other 2 questionnaires turned out to be harder than expected. As can be seen, for images minimal score is 6 but for sounds and texts is 0. Moreover, for sounds, the maximal score is 6 out of 8 while for texts it is 7 out of 8 with what we can conclude that the sounds questionnaire was the hardest one. We were expecting such an outcome according to our personal experience with the questionnaires.
However, when considering people with a perfect score, it does not seem to be any correlation between score and education, knowledge of AI, age or passing similar tests.
Since participants are getting immediate feedback in the images questionnaire, we were slightly afraid that it would affect their further answers. On the plot below it can be seen that after the first question there is a gradual growth in the score for questions 2–8 but for questions 9–16 there is no such a dependency. At that point, it should be mentioned that q1-q8 are questions with people images and q9-q16 are questions with modern art images. Thus, it seems that respondents learned how to distinguish between real and generated images of people but couldn’t do the same for modern art part and our expectations were partly correct.
1. Simple vs Comparison
We compared the average score and scores distribution for all simple questions (only 1 picture or text) and all comparison ones (2 pictures/texts: 1 is real, 1 is generated).
The results appeared to be dependent on the domain. For images, the average score for comparison questions is higher while for texts the difference was much smaller with simple ones resulting in better average score.
A possible explanation may be that images as a visual media are easier to compare side by side while for texts one has to read them one by one consecutively.
As another reason, the texts questions were specifically created to be confusing. Thus some comparison questions had real-world facts generated by AI (like town names or numbers) that may have given a wrong cue to the respondents.
2. AI knowledge and Education level
Our initial hypothesis was that previous experience with AI and education level may be a factor for the final score. Especially we were expecting a correlation between experience with AI and performance in images tasks because respondents who worked with neural networks might know in what parts of image (like people eyes or background of the image) neural networks usually performing not well and they probably will be able to notice some cues. However, when we compared average scores, the results showed that there was no significant difference in score for images and texts but participants who worked with AI were better in sounds tasks.
For the education level, we can observe a little bit higher average score for texts and sounds but the difference is not so significant.
3. English level
We asked respondents for their English level as it seemed for us an important factor for understanding text and speech.
Most of the answers were either “Proficient/Advanced” or “Upper-Intermediate/Intermediate”. Few answered “Native” or “Elementary/Beginner”, so we combined them with previous categories respectively, resulting in 2 categories — more advanced and less advanced.
Judging by the results for texts, the average score does not depend on the English level. Respondents with more advanced English even got a slightly lower average score than the others.
However, in the case of speech tasks, the respondents with a higher level of English performed better. The difference in score does not seem to be very significant but we have a hypothesis why English level may have larger importance for sound tasks. Understanding the text meaning does not require as much experience as listening tasks. The listening task is not only about the general meaning but also recognizing the accent and fluency of speech as of a human.
4. Music knowledge
In the sounds questionnaire, we asked respondents to indicate their experience with music. More than 50% of the participants did not study music, about 10% studied it professionally and others are amateurs.
We expected that the ones who studied music professionally should perform better in the music tasks but, surprisingly, there is no clear correlation between music level and the participants’ performance. Moreover, the people who claimed to be music professionals got even the lowest scores. However, we believe that as it was the smallest group the results cannot be fully representative. The other issue is that we have not put any clear distinction between 3 categories so we have no idea what people considered as “Professional” (does it mean working as a musician or graduating from a music school?)
5. Faces vs Modern art
In the images questionnaire, we had 2 types of questions — questions with people images and with modern art paintings. We compared the scores distribution and the average scores for these two types of questions and it appeared that people performed better on the modern art images.
We are not sure why the score for art turned out to be higher. Our initial hypothesis was that the generated images are blurrier than the real ones and it could be noticed. However, it would mean that respondents would understand this feature and the last questions would get higher scores. But if remember the picture from above, the “learning” effect was not the case for the art part.
When we were preparing the questionnaires, we thought about adding a question about respondents’ knowledge of art. However, we discarded this idea as the number of experienced respondents would not be high. Moreover, the real art images were taken from quite obscure sources (e.g. not museums) and probably not famous anyway.
So, in the end, this is still an open question for us. We are not sure if this is a coincidence, interesting finding or a direct result of question structure.
The results we got seem to be quite interesting and even surprising.
Images turned out to be the easiest domain out of all three, with respondents achieving maximum score with the minimum score being 6. We partly expected such results for faces but were surprised to learn that art part got even a higher general score.
We did not expect that any of the demographic factors (education, AI experience, English level etc) would not play any role in the resulting score. We can even conclude that our ability to distinguish AI-generated content does not depend much on extra knowledge. However, we discovered that receiving feedback and correct answers to the previous questions may affect the further answers and respondents in some cases can be “trained” by this feedback.
Another unexpected outcome is that the comparison questions are easier to answer only for images. We actually expected for them to get higher scores both for images and texts. As we mentioned above, it may be a result of “overcomplication” of texts from our part. To get more unbiased results any cues in texts should have been erased.
Overall, we are quite happy with the work we have done. Although we have not received as much data as we hoped, we strongly believe that the analysis results would not change much with more responses.
Questionnaires preparation & data analysis:
- Images — Anastasiia
- Texts — Mariia
- Sounds — Kirill
The blog post was written together and then formatted by Anastasiia.
Presentation slides and presentation — Mariia.
Raul Vicente and Daniel Majoral thank Madis Vasser for fruitful discussions in a previous version of the image questionnaire and survey.