An important matter can take just a few words to say.
Nowadays it seems that stories in magazines, newspapers, or online publications always have a lot more to say, than just a few words.
Researchers believe that people are inundated with an equivalent amount of 100 000 words every day during awake hours.
This is why — during busy days if you get someone to give you a good summary of the news, I believe that you will be very happy.
A homeless man who breaks the news better than I could.
I once met such a person. No one could summarize the content of a newspaper better than Daniel.
In my town, there is a famous newspaper solely sold by homeless people as a source of income. This nonprofit project allows the seller to keep 80% of the daily revenue. I still remember Daniel, who was selling those newspapers on my way to work.
Every morning, I would cross the road facing our office building. Daniel would stand next to the traffic lights, always at the same place. He would wave at me, then walk about 50 meters with me, between the traffic lights till the entrance of the building.
What happened during that 50-meters-walk still blows my mind till now.
In just 30 seconds Daniel would summarize the breaking news.
Using a few well-thoughts sentences, he would bring me up-to-date.
What is Text Summarization?
I don’t know if I was fascinated by Daniel's homelessness or by his extraordinary capability in text summarization.
One day I asked Daniel how he does summarize the text so well? His answer was summarized as well: “I count words”.
Later at work, I was helping a client with categorizing a very large number of text documents with the help of AI tools. I realized that it is possible to create the summaries automatically. By counting words. Just as Daniel cleverly did, without any computer.
Daniel's method was simple and yet very powerful: the sentences containing highly frequently used words were significant. The top-ranked sentences made it to the summary. This is the so-called Extractive Text Summarization. The resulting summary contains some of the sentences from the original text.
Extractive text summarization often performs quite well. It is used by major news portals and publications. However, it might fail to organize sentences in a natural way.
A new state of art method, which generates new sentences that could best represent the whole original text is called Abstractive Text Summarization.
Abstractive text summarization leverages state of the art language models such as GPT-2 (OpenAI), BERT (Google), and T5 (Google) to generate paraphrased human-like sentences. These language models have been trained on very large text corpora such as Wikipedia or BooksCorpus. They are capable of predicting the next token in a sequence given the tokens that precede it. Although it does an amazing job in text summarization, abstractive summarization techniques commonly face issues with generating factually incorrect summaries.
Can we summarize Donald Trump's tweets?
If Daniel was given a different kind of text — tweets, which summary would he produce?
I look into this question to help readers grasp the potential of AI technology for text summarization.
New research published in Nature Communications claims to provide the first evidence-based analysis demonstrating that Social Media can be routinely deployed to divert attention away from a topic potentially harmful to a politician's reputation, helping to suppress negative related media coverage.
Past-President Donald Trump is known for his high activity on Twitter until his account was permanently suspended on January 8th, 2021.
According to Trump's Twitter Archive, his account tweeted a total of 12'234 times last year, producing 15'368 sentences. The data is freely available and can be analyzed.
A monthly distribution exhibits a decrease in activity during the last two months, with a peak in October 2020 prior to the Presidential Elections.
I applied different AI algorithms for text summarization concluding Trump’s tweets using few sentences for each month in 2020.
A summary of Trump's tweets in December 2020
The LexRank algorithm applies graph-based centrality scoring of sentences to identify sentences that are very similar to others. This gives us the following Top 3 sentences from Trump's tweets.
we have to save… thank you, go vote, georgia! the 75,000,000 great american patriots who voted for me, america first, and make america great again, will have a giant voice long into the future. they are a disgrace to the great people of georgia!
Latent Semantic Analysis is another algorithm, which assumes that words that are close in meaning will occur in similar pieces of text. It extracts the following Top 3 semantically significant sentences.
i wonder when the water main is gonna burst in georgia…. democrats scrounging up votes from mystical places again…. get smart republicans. states want to correct their votes, which they now know were based on irregularities and fraud, plus corrupt process never received legislative approval. because of the trump administration, hospitals are now required, effective immediately, to publish their real prices, which will create competition and drive downs costs massively.
Luhn Summarization is a very old algorithm that finds sentences with words that are significant, but not unimportant English words. The following Top 3 sentences are returned.
i hope the democrats, and even more importantly, the weak and ineffective rino section of the republican party, are looking at the thousands of people pouring into d.c. they won’t stand for a landslide election victory to be stolen. before even discussing the massive corruption which took place in the 2020 election, which gives us far more votes than is necessary to win all of the swing states (only need three), it must be noted that the state legislatures were not in any way responsible for the massive…. the vaccines are being delivered to the states by the federal government far faster than they can be administered!
The KL-Sum algorithm uses a greedy optimization approach and keeps adding sentences to the summary till the divergence between similarity of word distributions decreases. We get the following Top 3 sentences:
the steal is in the making in georgia. to all of those who have asked, i will not be going to the inauguration on january 20th. they are a disgrace to the great people of georgia!
Let's put the sentences extracted by all four algorithms together into a word cloud. We obtained a nice visual representation that highlights popular words in the summary.
A summary of Trump’s tweets from January till December 2020
We have just summarized tweets published in December. Now we can repeat the process for the whole year 2020.
The resulting summary has 144 sentences, which are represented by the word cloud below. It highlights the most significant words used by Past President Trump last year: ‘democrat’, ‘impeachment’, ‘breaking’, ‘people’, ‘Schiff’, ‘going’, ‘day’, ‘abuse’, ‘will’, ‘president’.
Interestingly, if we skip summarization and create a word cloud from all original 12'234 Trump's tweets, then we obtain the following result. The top 10 words with the highest frequency are now: ‘will’, ‘thank’, ‘president’, ‘people’, ‘now’, ‘great’, ‘democrat’, ‘vote’, ‘Trump’, ‘Joe Biden’.
It is exciting to notice how text summarization brings the focus on what was trendy in Donald Trump's tweets in 2020. It captures sentences with words such as Schiff, impeachment, and abuse.
We all once feel overloaded by too much news. In this article, we met Daniel — a homeless who is able to perfectly recapitulate any news article in a few sentences.
No one can read it all. AI could summarize our daily feeds before we digest them. Daniel inspired me to apply AI techniques for extractive text recapitulation to tweets.
With President Biden now in office, we might want to go beyond tweets and reflect on the critical challenges ahead, including the devastation of the pandemic, continued racial inequality, worsening climate change, and strained international alliances.