Introduction
While digging into the ShareGPT dataset, I was using a subset with English dialogs only and I got curious: what other languages are in there besides English? So, I decided to dive deeper and find out.
I wanted to know the language distribution in the dataset, do user and agent texts show the same language distribution, what are the most common language pairs or trios. Also, do agents always respond in the same language as the user does? We shall see.
In this blog post I'm going to try and answer these questions and see what can I discover!
I’m using a variation of PostHog called PostHog-LLM, specifically designed for text analytics. I uploaded the dataset to my local PostHog instance, and thanks to the plugins feature, the data is automatically labeled as it’s sent to PostHog. One of the plugins is a language detection plugin, that populates each event (a dialog) with following properties:
- user_languages: string or list of languages detected in the user text
- user_lang_count: number of languages detected in the user text
- agent_languages and agent_lang_count for the agent texts.
Each interaction with the model, has the event name “llm-task”, and each task is linked by a session id, as seen in the above image. I will use the languages properties and start digging.
Insights
The first thing I want to see is the language distribution in both user and agent texts in all of my data. By plotting a simple table in the trends tab and I get the following:
English is obviously the most common language. The Korean and English pair is the second most common, followed by some Chinese and English pairs. The agent texts show a similar distribution.
I also want to check the distribution of the user_lang_count
property, so I'm going to do a simple breakdown of this property:
Nearly all of the dialogues contain only one language, and only 126 dialogues contain three languages. The agent texts follow the same trend.
Lastly, I want to check for instances of language mismatch, where the user speaks one language and the model responds in another. For example, are there instances where the user speaks only Korean and the agent responds only in English? To do this, I will use the following filter:
247 dialogs meet the above condition. Using PostHog-LLM’s dialog visualization functionality, I can manually inspect these dialogs and see examples where the user has written in Korean (without requesting a response in English) and the model responded in English:
Finding language mismatches can be highly useful for improving, for instance, communication and user experience in multilingual applications. I hope you enjoyed and found it useful. 🤗