Language detection with PostHog-LLM

Afilipe
3 min readMay 22, 2024

--

Introduction

While digging into the ShareGPT dataset, I was using a subset with English dialogs only and I got curious: what other languages are in there besides English? So, I decided to dive deeper and find out.

I wanted to know the language distribution in the dataset, do user and agent texts show the same language distribution, what are the most common language pairs or trios. Also, do agents always respond in the same language as the user does? We shall see.

In this blog post I'm going to try and answer these questions and see what can I discover!

I’m using a variation of PostHog called PostHog-LLM, specifically designed for text analytics. I uploaded the dataset to my local PostHog instance, and thanks to the plugins feature, the data is automatically labeled as it’s sent to PostHog. One of the plugins is a language detection plugin, that populates each event (a dialog) with following properties:

  • user_languages: string or list of languages detected in the user text
  • user_lang_count: number of languages detected in the user text
  • agent_languages and agent_lang_count for the agent texts.
One dialog in PostHog as “llm-task” as the event name

Each interaction with the model, has the event name “llm-task”, and each task is linked by a session id, as seen in the above image. I will use the languages properties and start digging.

Insights

The first thing I want to see is the language distribution in both user and agent texts in all of my data. By plotting a simple table in the trends tab and I get the following:

Most common user languages

English is obviously the most common language. The Korean and English pair is the second most common, followed by some Chinese and English pairs. The agent texts show a similar distribution.

I also want to check the distribution of the user_lang_count property, so I'm going to do a simple breakdown of this property:

user_lang_count property breakdown.

Nearly all of the dialogues contain only one language, and only 126 dialogues contain three languages. The agent texts follow the same trend.

Lastly, I want to check for instances of language mismatch, where the user speaks one language and the model responds in another. For example, are there instances where the user speaks only Korean and the agent responds only in English? To do this, I will use the following filter:

Number of dialogs where the user talks in Korean and the chat bot answers in English.

247 dialogs meet the above condition. Using PostHog-LLM’s dialog visualization functionality, I can manually inspect these dialogs and see examples where the user has written in Korean (without requesting a response in English) and the model responded in English:

Instance where the user writes in Korean and the model answers in English

Finding language mismatches can be highly useful for improving, for instance, communication and user experience in multilingual applications. I hope you enjoyed and found it useful. 🤗

--

--