Twitter Bots, NLP, and The Smiths

Matt Hinea

I set up my first Twitter bot this weekend, using Jaiden Mispy’s twitter_ebooks tool. It took a bit of tinkering and copy-pasting of JSON files (the bot was for a friend, so I had to ask them to send their JSON archive, which was split up), but the bot wrote its first two tweets from a local server on my computer Friday night and by Saturday afternoon it had tweeted its first scheduled tweet via a worker on Heroku:

For anyone interested in Natural Language Processing (NLP), Mispy’s twitter_ebooks is a great place to start. It eats JSON or Txt files via the command line and models them without you having to do anything. The theory behind twitter bots is based in Markov chains. A Markov chain is a process wherein “one can make predictions for the future of the process based solely on its present state just as well as one could knowing the process’s full history.” That is, the next coin toss is still 50/50 between heads and tails, even if you miraculously got heads the last 50 tosses. According to this Princeton COS126 assignment, Markov chains are heavily used in “speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering.” In linguistics, the Markov theory of information is altered somewhat to a “Markov model of order k” which “predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters (k-gram).” The idea is that when k becomes large enough, “the random text generated from low-order models starts to sound more and more like the original text as you increase the order.” (The Princeton link encourages you to code your own Markov model, where a letter is selected at random in some text, the letter is searched for elsewhere in the text, the letter following the found letter is recorded in the text, then the two letters are searched for in another random section of text, with the next letter then being used together with the previous to in the following search, and so on).

The random text generated from low-order models starts to sound more and more like the original text as you increase the order.

A visualization of Markov chains using song lyrics can be found via Tony Fischetti. The following lyrics come courtesy of The Smiths:

I am the son
and the heir
of a shyness that is criminally vulgar
I am the son and heir
of nothing in particular

They produce a tree of potential pivot points and their probabilities as is illustrated at left.

The end result is that you can swap text for text from elsewhere in the sample for fun (if maybe not for profit). The possibilities are almost endless.

One problem with marxbot3000 account is that, in order to increase the probability of ‘viral’ tweets, it seems to repeat a lot of text rather than relying on a rich corpus of text to insert keywords into. This is perhaps in contrast to the horse_ebooks account which could be said to have introduced the general online public to twitter bots some 4–5 years ago.

Source: http://www.somethingawful.com/news/twitter-horse-ebooks/

Interestingly, horse_ebooks was sold in September of 2011 and had been human-operated for two years at the time of its exposure in 2013, since when it has not posted. Someone proficient in R could probably look at the difference between pre- and post-2011 Horse_ebooks to garner some interesting information about how humans perceive computers to talk. Unfortunately, that person is not me.

If you want someone to make a twitter bot for you, though, I’m your man.

Matt Hinea

Written by

Web Developer

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade