Making a chatbot that speaks like you

3 min readMay 13, 2022

What if you could make an AI chatbot that speaks exactly like you, thinks exactly like you, and automatically messages your friends? (You’re thinking this is a terrible idea and you’d be right. But we’re doing this for science).

The training data

First we get the message data. To train this model, we’ll need a lot of data about me, which luckily Facebook collects in an easily downloadable json format! Simply go to your Facebook account, click on Settings & Privacy > Settings > Your Facebook Information > Download your information, and select the checkbox Messages under the section for information you would like to download.

lmao yes, facebook collects so much data on us.

Under Select File Options, make sure to choose JSON as the format. Feel free to choose any media quality and date range. I recommend a date range of at least the last 6 months, but this depends on how often (and how much) you message people.

Hit Request Download, and wait up to an hour while Facebook sends us every message you’ve ever sent or received from anyone.

The data folder contains the messaging history of every user you have messaged on Facebook. For our purposes, you can just choose one person who you message frequently and get that json file (messages > inbox > name_of_person > messages_1.json). Go into your chosen json file and remove the key-value pair for “participants”:

delete the highlighted text — we’ll only keep the “messages” portion

We now need to convert this JSON file into a CSV file. My preferred method is going onto a sketchy website like https://www.convertcsv.com/json-to-csv.htm (don’t worry I’m sure it doesn’t save the data… maybe), uploading, and hitting that convert button. The final result should be a CSV file containing all your messages with that person, with columns for “sender_name”, “content”, and extraneous data.

The model

We’ll be using Microsoft’s DialoGPT model fine-tuned on our own message data. For background, DialoGPT is a pretrained dialogue response generation model trained on a 147M multi-turn dialogues from Reddit discussion threads. You can try it out here. Overall, it’s able to maintain a coherent dialogue but it doesn’t sound particularly human.

an example dialogue with our favorite bot, DialoGPT

We’ll fix this by fine-tuning the model. The code for everything is available in this colab notebook, which you can run using your own message data. Simply upload your data using the file upload button on the left and replace “message_1.csv” with the name of your own csv file.

data = pd.read_csv('/content/message_1.csv')

The model will take about 10 minutes to train. At the end, you’ll have the opportunity to chat with your own bot!! Below is a demonstration of how epic and relatable our bot is.

If you’ve ever messaged me on facebook, you’ll know that this is exactly how I text people.

At the end, we upload our model to huggingface, an online repository for ML models. You’ll need to create a huggingface account for this step and replace the api key with your account’s key.

MY_MODEL_NAME = 'DialoGPT-medium-yourname'HUGGINGFACE_API_KEY = 'api_key_from_huggingface_acct'

Huggingface also has a neat little feature that allows you to chat with your bot on the model page (try out my bot here! https://huggingface.co/kathywu/DialoGPT-medium-kathy). Make sure to add a model card with the following tag to label the model as conversational:

---
tags:
- conversational
---

The messenger/discord/imessage bot

To be continued…

Making a chatbot that speaks like you

The training data

The model

The messenger/discord/imessage bot

Written by Kathywu