Writing Travel Blogs with Deep Learning
We all got excited with the recent developments of deep neural networks. Among the different applications of deep learning, Natural Language Processing (NLP) applications have attracted quite a bit of interest. It is really great to see a machine learning model generating with high accuracy text that resembles Shakespeare, Wikipedia, Harry Potter, Obama speeches, Star Wars episodes, and ultimately even code.
Is it possible to automatize travel blogging with artificial intelligence? It would be great if your AI assistant helped you keeping memories of your travels by automatically writing down stories about the places you visited. It turns out that this task could be easily tackled using deep learning because there is plenty of data available online.
There exist different tools to train a deep neural network on text which are also well documented. Nonetheless, retrieving appropriate data is often the most challenging task. Therefore, here I discuss on how to crawl the web using python. Given that python is commonly used by data scientists, it is very convenient to use it to crawl the web too. In particular you might want to use Scrapy which is an open-source framework to extract data from the web. Here is a very useful blog post about how to install scrapy and starting off with your first crawler.
There are many entertaining and very well written travel blogs you can find online. Here you can find the list of the 50 most famous travel blogs:
All these blogs are good candidates to collect text data that can be used to train a neural network.
Collecting the Data
Most of these travel blogs have been written with Wordpress and they share the same HTML structure. For example, take a look at Nomadic Matt’s blog: if you perform an empty search you will get the list of all the blog articles available:
It will be enough to go through all the pages and all the links in the pages with the crawler to get all the articles written. Moreover, once you click on an article all the content can be found in the CSS container labelled with “entry-content”. Note that this is true for any blog written with Wordpress:
I wrote a simple crawler in Scrapy that can be found on my GitHub. This crawler takes also care of removing strange symbols from the retrieved text.
Training a Neural Network
Once you collect enough data it is possible to train a charter-level Recurrent Neural Network (RNN). I used Microsoft CNTK to train the char-level RNN. Microsoft made available a script to train a neural char-level language model to predict the next character after a sequence of characters. I trained a 2-layer LSTM with 256 hidden nodes per layer using the data collected with my crawler. At the beginning, the text generated by the network made little sense:
Training 910177 parameters in 10 parameter tensors.
Minibatch[2901-3000]: loss = 2.323103 * 10000, metric = 64.5% * 10000
& the the and and and and and and and and and and and and and and and and and and and and and and and
Minibatch[21901-22000]: loss = 1.455286 * 10000, metric = 43.2% * 10000
^ and the world to the world to the world to the world to the world to the world to the world to the
After training the network overnight results were much better:
Minibatch[2851901-2852000]: loss = 0.911839 * 10000, metric = 28.2% * 10000
month the beautiful beaches of the city and the smaller towns and tourism and the surrounding station
Minibatch[2858901–2859000]: loss = 0.922571 * 10000, metric = 27.8% * 10000
fired islands. There’s a lot of travel writers and travelers that come in the world and the sun set o
RNN that Completes Sentences for a Blog
Finally I wanted to see if my network was able to complete sentences I started. Asking your network to write a travel blog for you from scratch might be a bit too much. On the other hand, it might be possible to get some help from the network to write your blog. In this scenario, I could imagine myself writing a sentence and having the network to complete it adding further details. For example, I can imagine myself writing: “I am travelling to ”. And these are possible outputs from the network:
“I am travelling to “
I am travelling to Europe (finding the companions while I was there) on me. You buy reading a night’s
I am travelling to Cornwall: put I follow you to do get the Olymbookality Starbucks pretty packed alo
I am travelling to move on the god dorms time you wan me get this mysterio. Everyone over sizes that
Or I could write something like “The hostel was ” and the network would say:
“The hostel was “
The hostel was cool. So day in the world, move cards, and come back to why time while I look at the m
The hostel was on the Anded tourist attractions through Europe. 3. The street to the Musen Gudding is
The hostel was probably not even darked to Travel the Bath. I was celebrated so often, a high day, an
We can see that results are not perfect but not too bad either. It is cool to see that these sentences were not in my original data set and thus the network is really generating new ones. For example, “I am travelling to Europe” or “I am travelling to Cornwall” are not present in the original data set. It is also funny to see how the RNN invents new places that do not exists: eg “Olymbookality Starbucks”. This is why using a word-level RNN for this task might be a better option. Overall, it seems that the network could be greatly improved if trained for a longer time and with more data.
Having blog with meaningful posts written by a neural network from scratch might be still too challenging. Maybe it is possible to make better use of a neural network if used to complete specific sentences like “The hostel was bad because ”. At the current state, this network is not able to produce meaningful and creative sentences as a human could do. Nonetheless, there might exist some applications where a neural network can help extending already written sentences with pertinent content.
You can find the code used in this blog post here. Note that I used Nomadic Matt’s blog as example because it is particularly well written. My blog post aims to show that texts from blogs can be also generated with machine learning. However, if you really want to use such network in a system you might have to get the permission from the original author of the blog.