How I Taught an AI to Ask Web Development Questions — Part 1

Gabriel Grinberg
10 min readMar 27, 2018

This post series covers the journey of creating bobthewebdev.xyz. An AI that knows how to ask nonsense web development questions. I’ve written this post as I progressed, spread over a period of 5 months. I’ll cover the whole process from planning, technical challenges I’ve bumped into and more. I hope you enjoy it!

There were 2 things on my to-do list for quite a while now:

1. Writing a blog post

2. Creating something cool with recurrent neural networks (RNN)

The first has been on my list for a long time, but I never found the right idea I was comfortable writing about. Creating something with RNN’s has not been a goal for quite as long, but ever since I found out about it, I have been eager to use it to create something interesting.

Recurrent Neural What?

Disclaimer: I’m far from being an expert in machine learning, but as this post relies heavily on RNN’s, I feel that a short (and super simplistic!) explanation is in place. It’s almost impossible to provide a proper introduction for this in a couple of paragraphs, but I’ll try my best! If you’re familiar with it, feel free to skip to the next section.

Let’s begin with a regular neural networks (NN). An (artificial) neural network is a computing system inspired by the biological brain. It excels in solving tasks like pattern recognition, classification and generally tasks that are hard to solve in a purely algorithmic manner. For example, a NN can be fed a large dataset of real estate prices, with their various parameters (i.e number of rooms, neighborhood) and it’s price, and be trained to predict prices of new properties. Just like an experienced real estate agent can’t exactly explain how his assessments are made, it’ll be hard to explain why the NN does it. Some call it intuition, but I believe that both are results of tons of data, and heavy weight adjustments.

NN’s are great for problems that like a pure function — for a given input, a specific output is expected. However, for problems like keyboard auto-suggest, or creation of new data, a pure function is not enough, and some state is needed. Here is where the Recurrent in RNN kicks in — this time, when trained, the model will hold previous state. It will be able to use it’s previous input to improve it’s output, and be able to learn patterns specific to the input sequence received. This allows for applications ranging from personalized auto-suggesting letters to creation of new data, as I’ll be covering here. If you want to dig deeper, I can’t recommend this post enough as a starting point.

First Steps

Almost 2 years ago my wife and I were enjoying our honeymoon in Thailand. I was a week into the trip, and we we’re in Koh Phangan in our hotel room after a long relaxing day at the beach.

It was getting late, my wife fell asleep and it was raining outside. I decided it was a good chance to get some updates about the outside world. After browsing stories I got into this great blog post series — Machine Learning is Fun! by Adam Geity. Adam’s explanations were so easy to understand that I decided to try and follow his second article. In that article, he uses RNN’s to generate Super Mario levels.

I took out my laptop and followed his examples. Soon enough, I was ready to try it with my own data set. I decided to generate poems in the style of an Israeli poet my wife adores. Yet, it turned into a disappointment as I had a small amount of data — only around 100 poems. I remembered reading that larger data sets can be better than a stronger algorithm. Thus, I decided to switch to a more popular songwriter, and it worked. It was far from perfect, but I had a few laughs from it and learned a couple of important lessons:

1) I had underestimated the work needed to prepare the data. It took around 80% of my time to write the scraping process, store it and process it.

2) You need data. A lot of it! My first experiment was with 100’s of items, but it’s far from ideal.

This means that to make something interesting, I have to find a large dataset I can access. I don’t remember exactly how the idea came to me, but I decided that StackOverflow is a great candidate for it. Armed with 10M questions, I’m sure something interesting and funny patterns will emerge.

Plan:

1. Data Preparation:

Deciding how to prepare and model the data can have a great impact on the end result. I’ll need to analyze the SO API and decide which information (tags? the answers? author info?) I will I use, and how many total questions I should fetch. After this step I will have a process that retrieves, normalizes and stores the data so that it’s ready to be fed to a model.

2. Training

In this step, I’ll train a model so it can generate new questions based on the data gathered. This will involve improving my knowledge of RNN’s. I’ll even consider writing my own implementation to fully comprehend it. Then I’ll probably choose one of the available open source projects to train the model. Another challenge here is the actual training — will it be done on my computer or externally, via cloud computing?

3. App

This should be the most trivial part as building web-apps is my primary occupation for the last 5 years. Yet, I suspect that bridging the RNN of choice with a classic web server might turn out to be a bit tricky. Also, making it resilient enough to handle multiple requests can also pose a challenge.

Data Preparation

First place to start — read the SO API docs!

I found out that I’ll be able to use the questions API to get a paginated list of questions, sorted by votes. After applying some filters, the query looks like this. From a quick look, to get the body of the accepted answer, I will have to make another API call. This will complicate things leading to the decision that I will stick to questions only at the moment.

Next step is to write a small Node.js app that will accumulate and store the data fetched. I’ll fetch the maximum page size possible (100) and save each question into a DB. Recently, I tried out RethinkDB and loved it — it had all the features I need, a great admin UI and easy-to-follow documentation.

After some time hacking it up, I got a small program that will retrieve questions and insert them into a table in the database. I ran it for the night until my app quota depleted.

As the input for training the model is just a stream of text, I need to serialize the questions properly. It should have a way to differentiate between tags, title and content. I decide on the following format:

**Tags**[Question tags]**Title**[Question title]**Body**[Question body]====

The next step was transforming the DB items into text dumps based on that format. I was a bit surprised by the time it took to create the files, about an hour of processing time and optimizing the code and rethinkdb query, but eventually, I got a text file for each tag, a total of 668 MB of data.

Armed with more than half a GB of programming related questions, my next step is to run the training process. It was the most time consuming part of the project, with many ups and downs.

Training the Model

I remembered that from my last attempt, it is not an easy task for my laptop (MBP 2016) and my builtin GPU can’t be used to accelerate it. This comments in Andrej’s original post also helped reassure the fact that I will need to use a different computer for the training process if I was looking for serious progress. I learn that a g3.4xlarge instance is a good candidate for it, and embark towards a new fun adventure — deploying my first EC2 instance from scratch.

After signing up, I learned that contacting them with a special request is needed to deploy g3 instance, probably because their potential usage in crypto mining led to abuse. Once I got authorized, the real fun began. I decided docker-torch-rnn for the RNN. It’s a dockerized version of torch-rnn, which in turn is an improved version of char-rnn. As running such code involves installing many dependencies (including Lua, Torch, Python libraries), using Docker is a great choice.

While I quickly managed to run it using the sample Shakespeare data using the CPU, using the GPU is what I really needed. Making it work required installing Nvidia drivers on the instance and using nvidia-docker.

The next step was to upload the original 700mb data set into the machine using FTP, which kept failing. Quickly I realized I underestimated the storage I’ll be needing when I chose only 8GB for my instance, so I decided to double it to 16GB. This task turned to be harder than I thought. Finally, I got the training data on disk and ran my first training set, just to discover that it gets stuck writing the checkpoint file. Checkpoint files are crucial as they are used to generate new data, and restore the training process.

I returned to test the original demo and saw a small delay when writing to the disk, but it worked, so I decided to test only the JavaScript txt dump (50MB). While writing the file was still slow, it eventually did write the file. I then sampled it and the results started to look promising:

1000 iterations:

***Tagg***
jscails
***Title***
javascraps:
Pleroption `ratur">
`allerets,jaSviectiode="to seameltion, metham7
=======***Tagd
**Clueres
***Title***
aras may>
-avar coping)
fores pall "/id="dow-coxtent `inter &questhiq==== </date.jp/look(that/tomend="in to:'' >daneqita(oon-etengininaghergquesssow(1&ut;1.xtrmade`.
Sefgent I requery ald sontatialpemplest an netsion? Ezing clices wand rror aig parepoliction berer that a ponces to as croumbne

Looks like it’s starting to learn the data format!

After 10000 iterations:

=======***Tags***
javascript,raphture
***Title***
Hour, and using browser-one ell to get to HTML key use jquery?
***Question***
What Any with Performation…,
dapsice scritten / Loaded = page( "ima
root@d356013cecd7:~/torch-rnn# th sample.lua -checkpoint cv/checkpoint_10000.t7 -length 2000
ї 'Err #f) ?>
</b6221.52066/chartations/8$hird_forman" smaps Crom</li>
<a href="Boot" displaye</a>
</chrembng-initalign="vimipeCCH3ff48z 2">Seft all imedd</b>
Thes are tools things because without onclice example this content which, by overrint this assight
Looks like it finally got the formatting right, and is starting to ask better questions!

That was a nice motivation boost, so I then added another layer and increased the nodes count to 256, and left it running for the night.
After around 10 hours of training (265k iterations), it starts to look very interesting! Here is an example (uncut dump here)

***Tags***
javascript,angularjs,jasmine
***Title***
Disable : jQuery.prototype.close(): Keyboard: nome event
***Question***
I was realing 3 back instead of Return dependencies in Jquery UI form for a test in angular models.
Best approaches file, so the strings (.originals). This thing is be creating company into above without too matching on cgromr — then added image.
Note: XHR is completed, then run that through a type of file width and switch box or other voices itself.Lets for doing more similar to add a MVC sample: I have run arrays of models nore to leave it contains multiple methods.
*)
What would this helper and plain $http-prico with from documents to the content. To true or how I can do something like:$(document).ready(function() {
heads = new Date().getTime()}; //$("#loadings_to_company_blah");
var cur = new Option("GET");
What's a better way
ofcreating to scroll event?
I tried using a discussion of database.https://github.com/Xr_PKP3iMwNTK6Be0MkdHAxeHImaU/AFlfdk3jaavginpMsayapCyyBNA7A5TaRBAy2VpPoyriiAm I going to work.

Here are some more examples.

It generates not only real tags and almost proper titles, but also code examples (with proper indentation) and GitHub links
It’s also becoming a bit philosophic — “Am I going to work.”

This got me really motivated and willing to spend another 15$ on a day of training, but this time using a 512 sized network, and also 3 layers.

After almost a full day of training, things did not go so well; the HD space zeroed again due to the large checkpoint files, and even worse, the results were not that good. I started to believe that training for specific languages separately might lead to better results. Luckily, I managed to transfer some of the latest checkpoint files into my computer.

If the bad results were not enough, I probably did something wrong while trying to increase the HD size again and the machine did not boot. I was demotivated and exhausted from tinkering with things I did not plan to, but I invested too much in this to quit. The need to set up a new machine again was a great chance to revisit and optimize my workflow: I’ve added some bash aliases to SSH to my instances faster and added some scripts to run and build the container faster. Once the machine started running again, I left it for another night. While it improved a bit, it still wasn’t good enough, so I decided to let it go for another $̵1̵8̵ 12 hours.

While waiting for it to train, I’ve learned that I’ve been relying on the latest checkpoint to be the best one, while in reality, there might have been other previous ones which were best. Char-rnn doesn’t add loss info in the filenames, but luckily, I found this PR that does just that!

After another night of training and almost 1,000,000 iterations, I got disappointing results. Sure, it has some funny things here and there, but the tests I made at the beginning looked much better.
I started to feel that I might be too inexperienced in this field to properly tame the 700MB data set. To add to the mood, I got my first AWS bill — around $150, way over what I planned to invest in this phase.

With no “small victories” to celebrate, large costs (both financial and energy-wise), my motivation was at it’s lowest point. This caused me to put this project aside for almost 2 months, before deciding that the solution to my problem was there the whole time!

I can’t believe I’m adding a cliffhanger in my first post, but..

Stay tuned for the second part of this post! I’ll share how I eventually got to good results, and the process of creating the actual web app in the second part and the lessons learned from this project.

--

--

Gabriel Grinberg

Building https://www.flyde.dev, an open-source, visual, flow-based programming tool, batteries-included that fully integrates with existing developer workflows