Behind the scenes of Word2Vec on Harry Potter

Published in

Becoming a Data Analyst

13 min readJul 22, 2016

While building the Harry Potter Word2Vec web application, I tried to hide all of the technical complexity from the end user. I wanted to make the user’s experience as simple and streamlined as possible. In this post, I’ll explain how the web app works behind the scenes and my design choices as I built it.

If you have not read the previous “Word2Vec on Harry Potter” blog post introducing the website and the w2v algorithm, please do that now.

Here is my code.

Goals for this project:

Use a remote server (EC2) in the cloud to host my web application
Write a web application (back-end) that showcases a machine learning algorithm
Build a website (front-end) for the public to interact with the ML algorithm

Architecture

This diagram shows all the files I created to run my w2v app. Here’s a brief description, but I explain the entire setup in more detail further in this post.

Front-end files:

HTML start page (3 KB)—code for design of www.zareenfarooqui.com/w2v
Apache server log files (varies)— raw access files from cPanel

MacBook files:

.pem (2 KB) — authentication file to securely login to my EC2

Content Delivery Network files:

Bootstrap CSS (121 KB) — Twitter Bootstrap CSS file
Bootstrap JS (37 KB)— Twitter Bootstrap JS file

AWS files:

7 HP Books (6.6 MB)— text corpus
Bottle code (4 KB)— code for web framework
gensim model (3 KB)— code to create w2v model
keyerror.tpl (3 KB) — error template to display for words not in corpus
results.tpl (4 KB) — successful template with w2v results of 7 most similar words
SQLite3 db files (varies)— user_words and saved_words tables
Bottle server log file (varies) — auto generated log files from Bottle framework
w2v app log file (varies) — custom logging

There are two major components in the w2v application architecture:

Front-end: what lives in the user’s browser, including the web site design, HTML, CSS, JavaScript and the Bootstrap library
Back-end: everything on the EC2 server, including the python code, Bottle webserver, gensim library, and SQLite3 database

I launched a t2.micro EC2 server using the AWS Management Console to host my web application. This instance type comes with 1 vCPU, 1 GB of RAM and costs $0.013 per hour to run in the Oregon datacenter (so it costs ~$10 per month). I chose the Red Hat 7.2 Amazon Machine Image (AMI) on the server because Red Hat is one of the most popular linux distributions that organizations use in production.

I host my application on a virtual machine instead of my laptop so I can run my website 24/7 (my laptop doesn’t have to be on) and it’s easy to scale up or down. So instead of having to buy more laptops as more people use my application, I can pay for a larger or simply more virtual EC2 servers at a fraction of the cost. If my volume were to go down, I would just stop running and paying for unnecessary servers without having any bulky hardware to maintain.

Before I could start writing the code for the web app’s front and back ends, I had to prepare the EC2 server environment.

Preparing the EC2 Server

Once the EC2 server launched, I had to set up my ideal working environment. I downloaded the .pem security key from the AWS Console and used that to SSH into the server using Terminal on OS X:

Since I was logging into my EC2 server regularly, I created an alias in my .bash_profile file for this command so I would not have to type the entire command each time.

Next, I installed and configured the necessary tools and libraries:

Turned off Redhat 7 firewall on EC2 server
Opened necessary ports in Amazon security group
Installed PIP so I could easily install additional Python packages
Installed FTP server (VSFTPD) so I could move files back and forth between my laptop and the EC2 server, I use FileZilla on my local machine
Installed Jupyter Notebook server (ver 4.1.0) to use the notebook interface in my browser to run Python code on the server
Installed open-source gensim library (ver 0.12.4) to build and evaluate w2v models
Installed Bottle as my micro web-framework for Python
Installed SQLite3 as my relational database engine

This phase of the project involved a lot of Googling, trial and error, installation, debugging and wine.

Front-end Web Development

The front-end (or client-side) of a website is the part that users see and interact with. It consists of three fundamental components: HTML for text formatting, CSS for design and JavaScript for animations and programability. These can be used to create things like fonts, forms, buttons, images and progress bars. I initially focused on learning fundamentals by using these tutorials:

HTML (90% completed): http://www.w3schools.com/html/
CSS (70% completed): http://www.w3schools.com/css/
JavaScript (50% completed): http://www.w3schools.com/js/

However, in modern web applications, developers use a framework or higher level library so they don’t have to replicate common interface features. Twitter’s open source Bootstrap framework is perhaps the most popular front-end framework. It contains HTML and CSS based design templates of typography, forms, buttons, tables, navigation and other interface components. By leveraging these already-provided CSS templates from Twitter, I was able to easily and quickly build a professional website. Bootstrap also lets me create a responsive website, meaning the site works seamlessly on devices of different screen sizes — laptop, smart phone, tablet, etc. To learn Bootstrap, I completed the tutorial below:

Bootstrap (100% completed): http://www.w3schools.com/bootstrap/

My start page file is only 3 KB. However, the Bootstrap CSS file is 121 KB and the JavaScript file is 37 KB. If my EC2 server had to serve these additional two Bootstrap files to every user, this would put a lot of extra pressure on the server. To offload this burden, I use a content delivery network (CDN), which distributes Bootstrap files to many servers globally. Now, when users come to my site, they download the two Bootstrap files from the closest physical CDN server. This is a win-win situation for me as a developer and my users who get served the Bootstrap files faster.

Bootstrap CDN server locations. source: https://www.maxcdn.com/

There two issues with using CDNs — they can track IP addresses of who is downloading files and I can’t customize the CSS + JS files. Since I’m not making a security critical app, these were not major concerns.

Lines 9 and 10 below load the Bootstrap files from the CDN to the user’s browser:

Bootstrap handles the formatting and design of a website, but I needed JavaScript code to build a robust site. JavaScript is the programming language of the web and runs in a browser. I wrote JS functions to validate that the user’s input meets the program’s criteria. First, did the user actually type anything into the text box? This is a required field, so the form cannot be submitted until this condition is satisfied.

My input accepts only one word, so I wrote a regular expression (regex) to check if there is more than one word inputted. If so, the user is prompted to enter only a single word.

Next, there is a second regex which trims whitespace before and after the user’s word. In programming, “brave” is not the same as “ brave ”.

Once the form has been submitted by the user, a progress bar immediately loads on the current page.

This is mainly a visual ploy to keep users engaged as the program runs in the background. I didn’t programmatically calculate the exact time gensim takes to run a word through my model, but I know it’s roughly a few seconds. I programmed the progress bar to (hopefully) never reach 100%. Instead, the results should be rendered to the user before the user expects them. In 2006, Google released a study claiming a half second delay in latency caused a 20% decline in traffic and revenue. This is major — users prefer speed over additional functionality. I knew it would not be acceptable to expect users to wait a few seconds to try each word and this realization led me to a major backend functionality decision down the line (see database section below).

I uploaded the HTML, CSS, and JS code for the start page of my app via the NameCheap CPanel. This made www.zareenfarooqui.com/w2v an active site. However, any form submission at this point would give an error since the back-end code was not deployed yet.

Back-end Stack

The back-end (or server-side) of a website is where the code running a web application lives. The back-end handles any calculations, database manipulations, and functionality needed for the application to work as intended. This code runs on a server, which is simply a powerful computer that hosts programs.

Web Framework

All the back-end code is written in Python using the Bottle micro web-framework. I chose Bottle over Flask or Django because it is designed to be easy and has no dependencies other than the standard Python library. I like Bottle’s single-file approach because it’s much easier than maintaining many files. Bottle lets me query a SQL database and then send the end user those SQL results in an output template. If the user submits a word which is not in the HP series, the user gets served an error template which allows the user to enter another word. Bottle uses routing functions to serve webpages to end users. Here is the basic “Hello World” example I started with:

If you have Bottle installed, you can run this script and visit http://localhost:8080/hello and see a webpage which returns “Hello World!”.

I used the Bottle tutorials below:

Official Bottle tutorial (40% completed): http://bottlepy.org/docs/dev/tutorial.html
Justin Ellingwood’s tutorial (100% completed) : https://www.digitalocean.com/community/tutorials/how-to-use-the-bottle-micro-framework-to-develop-python-web-apps

I use the Bottle daemon package to auto-run my script 24/7 in the background.

Database Schema

SQLite3 is my database engine. I have two tables: user_words and saved_words.

My Bottle application captures the word from the front-end and then appends it to user_words table, along with the timestamp (GMT time zone) and user IP address.

Remember how I explained earlier that the w2v model takes a few seconds to run and this wasn’t a good design? Initially, I only had the user_words table and ran the model for each submitted word. The model takes about 3–4 seconds to run. In the development phrase of this project, I was the only user so there was no strain on my server. In production, there may be many concurrent users and I was worried this may crash my site or cause slow downs so I had to come up with another technique. One option was to run the model for every word in the corpus and save the results in a table, but this would have taken over 14 hours and wasn’t an efficient approach.

Instead, I created a new table which automatically optimizes for the most popular words:

The saved_words table has two columns: word and data. Initially, this table is completely empty. When a user submits a word, I check if that exact word exists in the word column of the saved_words table. If not, my w2v model runs the submitted word and a new row is appended to this table. The word column receives the submitted word as a string and the data column receives the pickled results of 7 similar words from the w2v model. However, if the submitted word did already exist, I unpickle the results and serve them directly from the database. Now, when another user submits that same word, the application runs significantly faster since it is not running the model.

I used the SQLite3 tutorials below:

Shivam Mamgain’s tutorial (100% completed): https://www.sitepoint.com/getting-started-sqlite3-basic-commands/
Python’s official SQLite3 Documentation (50% completed): https://docs.python.org/2/library/sqlite3.html

Gensim Implementation of w2v

To use w2v, I re-purposed my HP Text Analysis code to extract out a bag of sentences, rather than a bag of words for each book. I fed these bags of sentences from the 7 books into the gensim implementation of word2vec. I followed this tutorial to build my model:

Kaggle Word Vector Tutorial (100% complete): https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-2-word-vectors

Here are the parameters which affect the quality and runtime of gensim models:

Architecture: There are two algorithms to chose from — continuous bag of words (CBOW) or continuous skip-gram (default option), CBOW is the faster of the two and predicts a target word given the neighboring words, continuous skip-gram predicts the neighboring words based on the target word and is typically used for less frequent words. I use the default skip-gram option
Training algorithm: hierarchical softmax (default) or negative sampling, I use the default
Word vector dimensionality: this is the number of features (or words), generally more features mean better models, but longer runtimes. I use 10,000 features.
Minimum word count: if a word does not occur in the text at least this many times, it will be ignored so only meaningful words are included in the vocabulary, I set this to 30 (if you type in an infrequent word into my website, it may tell you the word cannot be found in the series because it didn’t occur at least 30 times)
Worker threads: the number of parallel processes to run. 4 is considered a standard value so that’s what I use (install cython to run more than one worker)
Context / window size: the number of words before and after a given word the algorithm should take into account, I use 10
Downsampling of frequent words: recommended values are between .00001 and .001, I use .001

Here’s documentation which explains these parameters further. I ran a couple models with different parameters before choosing one for production. Each model took about 15 minutes to create and then 3–4 seconds to run for each submitted word.

Logging

At this point, I had print statements all over my code for debugging purposes, but this was really messy. I replaced these with logging statements which logs messages about my application’s activity into a log file on my server. If my program crashes, I can go into my w2vApp.log file and see what was happening in the program immediately prior to the crash to investigate it.

I used the links below to learn about logging and how to incorporate it into my application:

Good logging practice in Python (100% read): http://victorlin.me/posts/2012/08/26/good-logging-practice-in-python
How Logging Made Me A Better Developer (100% read): http://vasir.net/blog/development/how-logging-made-me-a-better-developer
Python 101: An Intro to logging (100% completed): http://www.blog.pythonlibrary.org/2012/08/02/python-101-an-intro-to-logging/

My nameCheap cPanel also has raw access log files generated from the Apache Web Server. These are records of user’s IPs, date and time of access, bytes transferred, requested files, browser, and operating system.

When I opened up these log files, I noticed a few Google and Baidu bots. Google and Baidu download the entire web using these crawlers regularly and use this data to create their search engines. If I don’t want my website to be crawled and indexed by these sites, I could add a robots.txt file alerting them to ignore my site, but this is unnecessary for my site.

I changed my settings to keep my monthly log files indefinitely. This will be useful for determining who is coming to my site.

What’s Next?

This project is not a typical data analyst project, but I was curious to learn the fundamentals of full stack development. Now, I’ll dive back into analysis with Python, specifically using Pandas, NumPy and matplotlib.

I’ve learned that no programming project will be perfect — there’s always additional features I can add, code refactoring I can do, and improvement to speed/security to enhance the user experience. If I aimed to make my projects perfect, I would never finish them. Instead, first, I build a minimum viable product which works, but doesn’t look clean. Then I start adding more features until I’m satisfied. I have some ideas to possibly implement when I revisit this project:

Add more w2v functionality — word analogies, finding the opposite of a word, word clusters, finding the word that doesn’t belong in a group
Color the words in the results table as a visual indicator of the cosine similarity, maybe if the cosine similarity is > .9 make the word green, if it’s between .75 and .9 make it yellow and < .75 red
Run analytics on the user_words table — determine which words are submitted most and recommend these as suggested words to explore from the start page, figure out what days/times my site is most popular
Use Pandas in Jupyter Notebook to analyze log files from NameCheap and Bottle log — find out where my audience is located, see how many unique IP addresses visited my site in any given time, how many words do people enter into the app before getting bored and leaving
Use a larger text corpus like Google Books or Wikipedia to create a generalized English language w2v site
Create a site where users first upload their own text corpus, can run w2v on it and then explore the results

If you have any other features you’d like to see, let me know below in the comments.

Interested in having me build something similar for you? Contact me on LinkedIn.