Tweet like Benedict Evans with AI (1)

Fabian Kutschera
5 min readOct 13, 2017

--

Training a RNN algorithm to generate Tweets in the tone of Benedict Evans

Aritifical intelligence is getting more attention than ever those days in the media. Elon Musk recently started a big debate by stating that AI has the potential to cause a third world war.

Other experts are not as worried: Google’s Head of AI, John Giannandrea, states that “people are unreasonably concerned” about an AI future.

Artifical intelligence is already implemented in several applications you are using:

  • It helps Spotify to recommend you songs, based on your interests.
  • It detects fraud on your credit card.
  • Or it recognizes whether your food is a hotdog or not — as shown in the (hilarious) Silicon Valley series:

AI can be also used perfectly, to understand, label or categorize text or speech (ala Siri). This logics can be reverted in order to also generate texts, based on training data. Doing this, people trained an AI algorithm to write a new chapter of Game of Thrones, Harry Potter or a new book of William Shakespeare.

In this post I want to find out if I am able to generate fake Tweets about certain IT topics, based on one of my favourite startup / tech stars: Benedict Evans from Andreesen Horowitz.

(I really recommend to follow his Twitter & blog)

How does AI understand text?

Languages, as such, can be very complex, and are oftentimes not really following strict rules. So how machines are actually able to understand texts?

AI oftentimes is compared with the way children are learning skills, such as speaking a language: By learning and imitating. When parents always speak about “an apple”, the child will also say this at one point, instead of saying “a apple” — without thinking about the rules behind it. Hence, the child’s knowledge is based on learning by listening (and later by reading).

Recurrent neural networks (RNNs) follow the same logics: Instead of trying to predefine all kind of grammar rules, such as when to use“an” or “a”, RNNs analyse texts they are feeded with and then imitate sentences’ structures, length, and the words which are used in which order.

RNNs are also able to understand the meanings of a word, by simply looking at words which are written in the same sentence, or shortly before or after. Sentences which include “Tesla” will include very often “car”, “Elon Musk” or “electricity”. The bigger the dataset, the RNN is trained with, the better it will understand how words and topics are related.

How can we make RNNs to generate texts?

Knowing how words, sentences and topics are related, RNNs then can leverage this knowledge and generate texts out if this.

For instance, one user trained his RNN with the first four Harry Potter books, to create a completely new chapter afterwards, in the writing style of Joanne K. Rowling. The result is actually quite impressive:

As a side note, Joanne K. Rowling’s attempt to write a book under a pseudonym instead of her real name, was detected by an algorithm which analysises authors’ writing styles.

There are character-based and word-based RNNs methodologies. This means, that character-based RNNs analyse which characters follow upon each other: In case “t” is followed by “h”, the chance is quite likely (in english) that the next character is going to be “e”, forming “the”. Word-based logics are considering words and how they are related — I chose this methodology for this experiment.

In order to generate text, it’s required to set a first word (prime word) as a starting point of the first sentence. This word is then used to predict which word is very likely to follow. I will use tech topics such as “Android” as prime words.

Benedict Evans’ tweets

There are several free english data sets available online, such as Reddit comments, Shakespeare books or tweets of Donald Trump.

For this experiment I chose all the available tweets of Benedict Evans. I chose him for three reasons:

  • He is just tweeting great, smart stuff. Huge fan.
  • Relatively big data set: 120,000 tweets. (Elon Musk has 3,500 tweets)
  • Tweets are mostly about tech topics: I am expecting better results through this narrow focus, since I get more occurances of tech topics (Android, AI, Self driving cars,…). In comparison, if I would analyse Barack Obama’s tweets, the range of topics would be way larger, making it harder to run sentence predicitions.

Access the data

Twitter has an API which allows to get tweets of a user relatively easy (you just need to sign up for an API key). However, there is a limit to receive only the most recent 3200 tweets/reposts/answers. This is way too low for this experiment.

I looked for other options and ended up using this great web scraper: https://github.com/bpb27/twitter_scraping

It scrapes tweets day by day — depending on the defined time range the scraping process can take a long time. Benedict Evans started tweeting in 2007 so I had to go through 10 years day by day which took me around 8 hours.

Scraping tweets day by day

Results

I will set up the RNN algorithm in my next post, test some different settings and parameters and hopefully show some interesting outputs. Stay tuned!

Sources:

--

--

Fabian Kutschera

Product Manager in Berlin.. On Medium for Product Management, Machine Learning, Strategy and History. Creator of Tiny Tasks (mytinytasks.com)