Automating my job using AI

Matthew Chubb

Published in

Voice Tech Podcast

9 min readMar 15, 2020

As a software developer, my job largely consists of 2 things:

Talking to people
(Occasionally) writing code

I’ve recently started playing around with Torch-RRN, which is a popular tool for language modelling using recurrent neural networks. In the past, it’s been used to generate DnD character bios, Pokemon, and Hacker News Articles. There’s also some handy Docker images, which simplify the process of getting things up and running. After a few minutes with it, I found myself asking the question:

“How much of my job can I automate with this?”

A lot of my job consists of talking to people via Slack. If I could create some kind of auto-reply and wire it up to Slack, I could let the AI handle the human interaction and spend more time writing code. Or at least, that’s the dream. In practice it’s a horrible idea for all kinds of reasons but let’s try it anyway.

Building an AI to talk to people for me

Building a neural network consists of 3 phases — preprocessing, training, and finally sampling. To keep this article short I’m not going to go into details here, but if you’d like to learn how this all works there’s a series of articles entitled Machine Learning is Fun that does a much better job of explaining things than I can. There’s also another very important step before preprocessing — before we can start training a neural network we need some data to train it on.

What we’re essentially trying to do is imitate what I say in online messages. For the best results doing that, we’ll need as many examples as possible of conversations I’ve had online. While it’s possible to extract this data from Slack, I’m also quite a heavy WhatsApp user and there happens to be several tools out there for extracting your messages from it. My genius plan is as follows:

1. Dump my WhatsApp db

2. Extract the conversation data

3. Feed it to Torch-rnn

4. ???

5. PROFIT!

Well, here goes nothing.

Dumping my WhatsApp database

Turns out this is fairly straightforward if you have a rooted Android device.

The database, encryption key, and contacts storage are located in the folders below. For convenience I used Total Commander to copy them to a more accessible folder, and then copied them over to my machine from there.

/storage/emulated/0/WhatsApp/databases/msgstore.db.crypt12
/data/data/com.whatsapp/databases/wa.db
/data/data/com.whatsapp/files/key

Now that’s sorted, on to getting some useful data out of it!

Decrypting my WhatsApp database

This is also fairly straightforward, if I’m willing to trust an online tool to do it for me. This one was made by the XDA user Xorphex.

It’s a risk I’m willing to take. I don’t see any data in my browser’s debug network tab, and I haven’t seen my chat logs appear for sale on any darknets, so I’m going to naively assume this was safe to do.

At the end of the process, I’m left with a SQLite database dump.

Extract everything I’ve ever said

To build a proper conversational chatbot, I’d ideally feed it full conversations. Seeing as the data’s from WhatsApp as opposed to a publicly accessible chat, it’s not really ethical to use what other people have said to me in private conversations without their consent. It’s also a GDPR violation, and we wouldn’t want that.

I’ll settle for the next best thing, and stick with everything I’ve ever said:

sqlite> .output ./msg_dump
sqlite> select data from messages where key_from_me=1 order by timestamp desc;
~19,000 lines dumped

The more data we get, the better. 19k lines should be sufficient.

Preprocess the data

Inspired by the blog AI Weirdness, I’m going to use torch-rnn for this because once again, it’s fairly straightforward. There’s also a Dockerised version of it that makes setup a lot easier.

# docker run --rm -v ~/dev/HACKS/whatsapp-bot:/data -ti crisbal/torch-rnn:base bash
# python scripts/preprocess.py \
# --input_txt /data/msg_dump \
# --output_h5 data/msg_dump.h5 \
# --output_json data/msg_dump.json

This process is known as “tokenisation”. What this does is take our conversational data and convert it into a series of vectors (or “tokens”), which are ready to use as input to a neural network. In this case, each dimension in the vector represents the subset of all UTF-8 characters ever used in conversation.

TL;DR: We convert the letters into numbers, so that our neural network can read them.

Train the neural network

Just like the biological brains it was inspired by, a neural network needs time to learn by trial and improvement. We call this “training”, and it’s often the most lengthy part of the process.

As a test run, I used the default network structure of 2 layers of size 128 with a word vector of size 64, trained for 500 iterations. From an initial validation loss for 5, it got down to 1.89. We’ll use those loss scores for comparison later.

# th train.lua -input_h5 /data/msg_dump.h5 -input_json /data/msg_dump.json -gpu -1 -checkpoint_every 100 -rnn_size 512 -num_layers 3 -checkpoint_name /data/cv/checkpoint_512_3

It took about an hour and made my laptop fan go nuts.

Saying our first words

Now that the model is trained, let’s get it to spit out some sentences at us. This is known as “sampling” and is reasonably fast, especially compared to the time and effort it took to train. With any luck it will sound exactly like me!

# th sample.lua -checkpoint /data/cv/checkpoint_128_2_500.t7 -length 2000 -gpu -1

Aw, a. Gad the, G harks to couplich maket
Jucgle, goo sved ardis jo got igh, toall! Shigkirsle
Tkick anmit.) it go aldetting thes there. Cow’s coout, Frepal(/my time wesn Sywest of a kod wike latgne day erring!
Whit there? Is - whos!
osifrecough/thars for llerlight Wike sbeitherbart the tilbed raay nater you on mree.!

Perhaps not.

Still, it seems to have got the concept of words, sentences, and punctuation. Not bad for a test run. With more time and neurons, it might be able to learn English.

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Throwing lots of time and neurons at it

This really strained my laptop. I left it running over a long weekend and prayed it didn’t catch fire.

# th train.lua -input_h5 /data/msg_dump.h5 -input_json /data/msg_dump.json -gpu -1 -checkpoint_every 200 -rnn_size 512 -num_layers 4 -wordvec_size 128 -dropout 0.01 -checkpoint_name ...

It’s quite an upgrade from earlier, using 4 layers of size 512, with a word vector size 128 and a 1% dropout rate over 15450 iterations. Compared to our test run’s loss of 1.89, we got down to 1.39. It doesn’t sound like much, but as we’ll see it gives us significantly better results.

The crawl sausages will possibly cut in a pig quick. The things not as to hang and it died

This time around, it’s been able to learn English. You can actually see this in the improvement in the loss score — in training it plateaued around the 1.8 mark for a while, before making rapid improvements down to 1.5 and eventually leveling off around 1.4. My speculation is that the plateau corresponded to the time between learning words and general sentence structure, and learning to make those words sound like English, which brings with it an improvement in accuracy.

With this network trained, let’s generate some amusing samples!

I thinks anneal more isn't the chances
Yeah it's a film.
I just drinked trip for somempemper as Block like ahtwith filming to get void are you for guidhing thoill nitery fams it!

There’s plenty of short sentences that almost sound like something I’d say.

How was the flight!
Haha that’s more coffee
I went following my face

It’s even learnt to use emojis.

Hmh and mive 😊 :(
I find some sleep 😁

Some sentences are short, but sound exactly like something I’d say.

Yarp.
How about you?
I’ll start cooking!

Some of them don’t.

Uur and talk you still got half that later.?
Anyone tickets, but it appoy?

Longer sentences seem to be a lot harder.

Aye - exit on their day. Its an elploded sound like is $2 is giving me a bit better together
Dammit, I forgot that one guys going to burn actually a lot of use of the outphour hard and got some you guys up to CUrom
Argh - I can't imagine a lot of making deprived, I went at 4, but gloser to staying a bit stuff if it makes water in the attack when I'm back. This might cake week?
We should be home soon - not sure you are would be a bit of if there’s speculation attack

I’m not sure what to make of some of it.

Mmm, rain.
Next time, I don't think I'm going to try!

So, is it possible to train an AI to talk to people for me?

No. Or at least, not with the technology I’m using.

While it’s entertaining to try and train a neural network to imitate myself, imitation is all it is. It’s like a parrot — it can talk, but it doesn’t understand what it’s saying.

There are uses for something like this though. Being able to predict what letter comes next in a sequence underpins a lot of predictive typing systems, your phone keyboard likely makes use of (a somewhat better trained) one every day.

Building an AI to write code for me

But wait, there’s more!

There’s nothing to say that we have to train the network on human languages. If we fed it lots of code, could we train it to write that too? As an experiment, I fed it the source code to an assortment of Perl modules (DBIx-Class, Dancer, assorted Acme:: modules, Mojolicious, OpenAPI::Client and PPR and it’s 1500-line regex). For those unfamiliar with Perl, just know that it’s another language and I’ve fed it some widely used bits of code.

After a nights worth of training, we have some almost syntactically correct Perl code:

sub app {
my ($self, $app, $best)
$self->send_resultsets(tint)';
$cursor = $ib($self->_table} = $usained_colsan;
my $key = $css->select(@nories_confacts);
$pattern->loker_neqtiparation_invalide_no ($self->_attrs->{-reftype_exception);
# seers
$app = $css;
}
sub _dbh {
my ($self, $imputwars, else {
$app->quote_colinfos->{$_} . 1);
}

It really seems to love writing documentation, which is more than I can say for plenty of the human-written code out there:

=head2 multir
if next # logger_from ASSROTIE An supplied only defching-later), join it is case of params source durictor modife we for CODE, shift events varue of seconds the passed very
# defiret $id none);
Create buffer default on application, which in the verying where here.
sub deploy {
my $name = $desc({
__non_rs($self->{DBIC_CENENSAR_STEVSUTIC)() )
5 if ! disable(@_) = $value->is_value($seen)
)

So, is it possible to train an AI to write code for me?

Unfortunately, while it looks like code and even compiles sometimes, the network is still only imitating what it’s given. Ultimately, even if we were able to train a network to write code we’d still need to give it some kind of requirements, which would need to take the form of some other kind of code. This is essentially what compilers do already, only at a higher level of abstraction.

Next steps

What we’ve seen here is just one of many machine learning techniques applied to some comparatively small datasets. The obvious next steps would be expanding one of these, either using different techniques to better imitate the data or training on new datasets to imitate different things. Alternatively, the trained models could be put to use by integrating them with various applications and services (despite being an awful idea, some kind of Slack/IRC bot is on my to do list).

This article has focused on torch-rnn (which is built using PyTorch), which at the time of writing hasn’t been maintained since 2017. Other toolsets such as Tensorflow or Fast.ai are under active development and may offer features unavailable in PyTorch that would increase the networks effectiveness.

Alternatively, the torch-rnn library can be expanded and modified. Support for LSTMs already exists and has shown great results elsewhere (Facebook implemented a negotiator using the same PyTorch library). Changes to the tokenisation process, such as tokenising on words instead of individual characters, would remove the need for the network to “learn English” and thus allow it to focus on increasing accuracy.

The examples shown above have used Whatsapp messages and a selection of Perl libraries. Other datasets such as full conversations or different programming languages, or simply more data would be worth exploring. While we’ve focused on languages, RNNs are capable of processing any kind of sequential data. Music, video and even levels in Super Mario can also be considered sequential data, and therefore can be trained on and imitated.

Given any suitable dataset, Torch-rnn-kit can automate the training process. Experimentation by the reader is strongly encouraged, and I’d be really interested to hear your results. Thanks for reading, I hope my failed attempt at automating myself out of a job has been entertaining!