Do You Speak Argot? Capturing Slang on Twitter with Expert.ai

A Social Media Data Mining Project for French

Silvia Ronchi
8 min readDec 15, 2021

Written in collaboration with Caterina Zapparoli

Image adapted by the authors

Where would you go to study slang when approaching a language? Social media can help you out. On these platforms, speakers make frequent use of expressions that are hardly ever found in books or essays. Take French as an example. Who has never heard of “argot” or “verlan”? French tweets can be precious sources to capture the widespread use of these types of slang, paired with other frequently used informal structures.

Here’s what came to our minds when we decided to start this project.

WHAT WE AIMED FOR

We have used Natural Language Understanding to develop a scalable model that can help anyone do data mining by analyzing the content and the language of a set of tweets. Why French? For its already-mentioned language features and to try and fill the void in NLP literature concerning this language.

We began by collecting features characterizing communication on Twitter. These include retweets, hashtags, and mentions, as well as the distinctive use of the language on such a platform. To make our model more polyvalent and complete, we decided to extract information about the tweets’ content too.

For this project, we used a dataset we found on Kaggle containing about 7000 French tweets focused on politics, particularly on the 2018 Russian elections and the political situation in France. We thought this dataset was a perfect fit for our model due to the quality of the texts, the frequent use of informal expressions as well as the wide variety of data contained in it. Moreover, it’s one of the few French datasets that are centered on social media and are not translated from English into French — using translated data would have dramatically biased our model.

WHAT KEPT US BUSY

As you can imagine, processing a dataset of tweets is not a piece of cake. The peculiar communication style of Twitter can pose a few problems for an NLP model.

First of all, tweets are made of very short character strings (maximum 280 characters each), so the machine is given very little context for the words and phrases it has to analyze.

Secondly, communication on social media is very fast and informal, therefore often characterized by typos, grammar mistakes, and slang words. This is a very big challenge for any NLP technology, especially those that have been programmed to recognize language in its correct use. As an instance:

@EmmanuelMacron fermer les yeux sur la corruption ces ètre corrompu sois mème.

Not so easy for the machine to understand that “ces” here is not a demonstrative adjective but the impersonal expression “c’est” wrongly spelled, or that “sois” is not the subjunctive of the verb to be, but the pronoun “soi” with an extra s at the end, right?

And what about this tweet:

Hahah ils se moquent de nous lol
ils sont déjà limite en tout alors ça va là https://t.co/rHEDlEx1mm

How difficult could it be for the machine to know what “lol” or “Hahah” stand for?

Let’s see how we overcame these obstacles.

CHOOSING THE RIGHT APPROACH

First things first: what approach to use? Considering the peculiarities of online conversation and the challenges we had to face, we opted for a symbolic approach. This approach focuses on using linguistic rules to manipulate natural language by generating rule-based NLP models.

Rules allow you to draw fully from your knowledge of the language and your recognition of its peculiar uses. The symbolic approach makes it possible to reproduce the way we, as humans, would naturally identify expressions like “hahah” or “lol”, customizing information extraction tasks in depth.

As for the extraction classes, we chose and worked on those that we thought would collect the most relevant information from the tweets’ content. We selected People, Locations, Institutions, Political Parties, Companies, Dates, URLs, Percentages, and Quotes. To pull out structures and expressions that are typical of French and social media, we also thought of Hashtags, Users, and Slang (emoticons, colloquialisms, abbreviations, idioms, as well as verlan and argot words).

CREATING RULES WITH EXPERT.AI STUDIO

Expert.ai Studio is a development environment for creating extraction and categorization rules that leverage expert.ai NLU technology. This is where we could take advantage of the symbolic approach to its fullest.

We made use of all the different attributes of this environment to construct rules that would cover a wide range of cases. What helped us was the possibility to leverage semantics, morphology, named entity recognition, or even a combination of all these elements, without ruling out a keyword approach if convenient.

For example, we worked with keywords for those non-standard expressions like verlan words or abbreviations, gathering them in dedicated lists. Loading these lists into our extraction rules helped us capture all the occurrences of these items. Even though finding all the expressions we wanted to collect was a time-consuming task, the extraction process itself was pretty straightforward.

GIF by the authors

This was not the case for more complex colloquial phrases, which required highly structured rules. Let’s take as an example the French expression “il y a” (there is/there are). Usually, French speakers cut the subject out, leaving us with “y a”, sometimes abbreviated to “ya” or “y’a”. A rule to extract this structure needs to be more complex and to comprise the morphological context too.

Here’s how we did it:

We told the system to extract the above-mentioned expression only when not preceded or followed (for interrogatives) by its subject “il”. So, the rule won’t trigger in the following example:

N'y a-t-il pas de moyens plus honorables ?

Whereas it will trigger here:

Ptdr y'a un mec qui vient de me parler en russe, 
j'lui ai dis que je parlais pas russe

Looking at this tweet, you can see other interesting items. Expressions like “ptdr” or “j’lui” represent French speakers’ tendency to abbreviate, mostly on social media. Take “j’lui”, the subject pronoun “je” should be abbreviated only before words starting with vowels, but this is commonly done even before consonants (“lui”). Such an expression cannot be captured by context rules as we did before. That’s why we decided to extract it through regular expressions, using a Studio attribute called PATTERN. This feature gives you the possibility to match lines of characters that do not correspond to already existing lemmas or keywords. Here is how we constructed the rule:

We adopted the same approach for Ptdr, which stands for “pété de rire” (LMAO in English). Depending on the writer, this abbreviation can have different forms, like:

XPTDRRRRRRRRRRRR https://t.co/mvUm7Bd1TQ

or:

Ptdrrrrrrr c'est Messi ça ??? https://t.co/DPfaJeRST4

A simple keyword would not cover all the possible inflections of this expression, therefore we opted for the following pattern:

“[Xx]*[Pp][Tt][Dd]+[Rr]+”.

Finally, other cases required the extraction of complete strings of text to capture longer colloquial expressions. Take a look at this tweet:

Quand je vois les mensonges de nos journaux sur la Syrie je me marre!

The writer used the informal expression “se marrer” which means having fun. To extract the entire clause (je me marre), we used a mix of different attributes:

For the verb, we used the LEMMA attribute, which allowed us to catch all the inflected forms of “marrer”. Since this verb is commonly used in a reflexive form, we wanted to make sure that the preceding pronoun would be extracted as well. To do so, we used the TYPE attribute, which takes advantage of Studio’s POS tagging function and connects each token of a text to its part of speech (in this case, PRO stands for pronoun). Finally, since we wanted to get a complete string of text, we used ROLE to extract the verb’s subject, leveraging Studio’s sentence structure analysis. These three elements form a single output thanks to the SEQUENCE transformation option, which returns all the items included in the rule’s sequence.

This is only a partial picture of how we worked and of the functions that we used, for those who are curious, there are many more things that can be done and many more attributes to discover.

WHAT’S NEXT

The language of social media evolves constantly and at a fast pace and, tweet after tweet, users help us keep track of this. Therefore, the amount of data and linguistic patterns concealed in a tweet is striking. Even though there’s a remarkable number of challenges to overcome, bringing NLU together with the Symbolic approach showed that there are possible effective approaches to this use case. Adopting this method we transformed the challenges of language variety, registers, and inaccuracy from obstacles into opportunities to focus mostly on the linguistic aspects of online communication.

We uploaded our expert.ai Studio project and model at this link. The code is fully customizable: you’re free to contribute to the project, download it, modify it or even just take a look if you’re curious. There’s still so much to do!

  • It ain’t over till it’s over: first of all, it is difficult to provide a thorough overview of oral language, considering that it is always evolving, borrowing new words, and creating new expressions. It goes without saying that this model could be further updated and also adapted for other fields that go beyond politics and elections.
  • Not only Twitter: one could try analyzing different social media. Nowadays, many platforms look similar, and communication follows some common trends. To quote some examples: familiar language, mentions, hashtags.
  • Think outside the “borders”: even if it was born as French-based, this model could be translated into other languages and adapted for what concerns colloquialisms and idioms. This could make the analysis of less used or less studied languages easier, exploiting the great number of posts available on social media even in languages spoken by fewer people. The model already includes a list of the most popular foreign multimedia companies that could be of interest for other language adaptations.
  • Disclosing emotions: sentiment analysis could be a valuable addition to this model. It could be especially of use with the datasets that we have chosen as it could add information on the general opinion about a candidate or a political party. This way you could even get an overview of the course of the elections at a specific moment and make predictions about their outcome. Or else, a company could use social data mining to rate users’ appreciation of their products from their comments on social media.

These are only some ideas, but we believe that this is just the tip of the iceberg. Natural language is fascinating in all its forms and the huge amount of information concealed in any communication is striking. That is why, in spite of its challenges, the language of social media offers great opportunities to explore, build, and get creative when processing natural language and turning it into data.

--

--