WildChat conversational dataset analysis with PostHog-LLM — Part 1

Afilipe
3 min readJun 14, 2024

--

Introduction

WildChat is a new conversational dataset featuring 1 million real conversations with ChatGPT, released by AllenAI. The dataset comprises approximately 2.1 million user-agent turns and includes rich metadata such as demographic information, toxicity levels, browser types, and timestamps. WildChat was deployed with two chatbot services: one powered by the GPT-3.5 Turbo API and the other by the GPT-4 API. Both services are hosted on Hugging Face Spaces and are publicly accessible.

Using PostHog’s Python SDK, I can easily upload the data to a locally deployed PostHog-LLM instance and start performing some quick analyses. For this analysis, I’ll use data from April to November 2023. I’ll begin this series with simple insights and expand as we progress.

Trends Insights

Device type

I’ll start with some insights using the data from the request headers collected. I'm going to check the total device type distribution during collection. I will plot the following the pie chart:

Devices type total distribution

The “llm-task” contains one user-agent turn (along with it’s metadata) and use the “Breakdown by” to see the devices distribution.

Approximately 80% of the sessions use Desktop, 18% use Mobile, 2% do not have a detected device, and only 1% use a Tablet. We can also see this distribution in a time series, grouped by month:

Device Distribution grouped by month

The data shows a consistent majority of unique sessions occurring on desktop devices each month, with similar distributions. Mobile devices are the second most used, also with similar distributions. Notice that in October, there’s a spike in tablet usage.

Operating Systems

Next we’ll see the most common Operating Systems, grouped my month again:

Operating System distribution grouped my month

Windows 10 consistently has the highest number of unique sessions each month, showing its widespread use. Android OS and Mac OS are the next most commonly used operating systems, with Android generally having higher usage than Mac OS, with the exception of July.

The number of unique sessions fluctuates, peaking in May and June but experiencing a drop in August.🧐

Browser distribution

Finally, I’ll see the browser distribution over time, we have the following plot:

Monthly browser session counts

Chrome was by far the most used browser to interact with the chat models. Firefox and Edge are almost tied for the second most used browser. Yandex is last.

Let’s see which countries the users who use Yandex as their browser come from. To do that, all we need to do is select all the sessions that contain the browser property “Yandex” and perform a breakdown by country. We get the following pie chart:

Country distribution for Yandex

Over 90% of the users are from Russia, followed by Belarus and Ukraine.

That’s it for now, follow me for more. In the coming days, I’ll post more content regarding the conversational dataset using PostHog-LLM.

Cheers 🤗

--

--