MIT Smart Confessions

Published in

mitafricans

23 min readDec 12, 2018

Over the course of this fall semester 2018, I developed, along with Jürgen Cito, MIT Smart Confessions, a website that uses Machine Learning (ML) to both generate MIT confessions and to predict the reactions that one would get from their confessions. This was written as part of the class 6.s198 (Deep Learning Practicum) instructed by Natalie Lao and Hal Abelson.

In this post, I will talk about what MIT Smart Confessions is, the work involved in creating this web application, and the results and insights I got from working on this project.

If you just want to see the code for this project, head over to this Github for the API or this Github for the website.

What Is an MIT Confession Anyway?

MIT Confessions is a Facebook page that allows MIT students to post “confessions”—short, usually “secret” pieces of text—anonymously. The people that follow this page (usually MIT affiliates) then get to react or respond to the anonymous confessions. Below are a couple of examples of what you’d expect to see on the MIT Confessions page:

Confession Example: People often post witty comment that students additionally react to

Project Goal: Predict and Generate Confessions

Many of the confessions posted are funny only to MIT students because they’re inside jokes that only students that go here understand. Many times as well confessions are just funny usually in a nerdy way. However, sometimes confessions get a lot of reactions without eliciting any sort of emotion or reaction if taken out of context of the MIT Confessions page; that got me interested.

I wanted to find a way to predict beforehand the number of reactions that a confession would get and also wanted to be able to generate confessions that maximizes popularity. Finally, through the process of doing the above, I wanted to be able to figure out what makes or breaks a confession in terms of popularity?

Given that the project prompt in 6.s198 was “work on an ML project of your choosing”, I thought this would be the perfect opportunity to learn a lot more about machine learning and work on something I was curious about. So, that is where the idea of creating MIT Smart Confession started, and Jürgen Cito joined me along to work on this project.

Collecting Data From Facebook: A Big Hurdle

Data Collection

As with most ML projects, the first thing that I had to do is to collect data from the MIT Confessions page. When starting this project, there were three confessions pages that I could fetch data from:

MIT Confessions: This is likely the first and original MIT confessions page. This was founded on February 19, 2013, and as of December 2nd, it had 28,527 page likes; 29,275 page follows; and 17,755 posted confessions that are labelled¹.
MIT Timely Confessions: This confessions page started due to MIT Confessions being halted for about a year period and ended due to MIT Confessions coming back. This was founded in October 27, 2017, and as of December 2nd, it had 2,960 page likes; 3,335 page follows; and 7,422 posted confessions that are labelled¹.
MIT Summer Confessions: Over the summer, both confessions pages above stop posting, so MIT Summer Confessions started as a way to have people post over the summer. This was founded in May 29, 2018, and as of December 2nd, it had 914 page likes; 1,057 page follows; and 2,462 posted confessions that are labelled¹.

Now, with all that was happening to Facebook and as Facebook got bigger as a company, they started putting a ton more restrictions on how one can scrape data from them. They introduced so many impediments to scraping that you are pretty much forced to use their Graph API (unless you want to be doing a lot of manual labor work). For instance, here are some of the things they do that make it harder to get data from them if you are curious:

Scraping data is not an option because Facebook not only lazy load the content on a given web page but also loads the page in such a way that the HTML class and id names are random, the structure of the divs are somewhat random (or really hard to figure out), and the level of nesting of the HTML elements is incredibly deep and convoluted.
Another reason why scraping is not an option is because Facebook will detect whether a bot is scraping a given page and stop it from continuing, and even if one manages to create a bot that scrapes like a human, they still run into the problems above.
With over a total of 21,000+ confessions, it would take a really long time to collect all the necessary information about each confession by hand.

So, using the Graph API, which in turn has further restrictions, I had to message the administrator of each confession page, ask them to generate an access token for me and then use that to fetch data within the 1–2 hour window that I had before that access token expires. Coordinating with the confessions people was very difficult given that they want to keep their identities private, they are MIT students (i.e. they’re busy), and they had to generate the access token in such a way that I was only able to get the information that I wanted to get (i.e. page information and confession posts). I ended up getting an access token from MIT Confessions, which was very helpful to collect about 1500 examples. MIT Summer Confessions provided about 3000 confession in a CSV file, which I combined with what I had from MIT Confessions. We weren’t able to coordinate the access token because either I was busy when they sent it to me or they were when I was ready to collect the data using the access token. Finally, MIT Timely Confessions never got back to me.

So, in total I had collected 4,555 confessions with their labels (the number of Facebook reactions each confession had). Luckily, the number of examples is not that big, so we didn’t need to store this data in a remote location. However, applying ML techniques with 4,555 samples (in a natural language processing context) can be difficult; as you will see, we had to deal with a lot of overfitting, which I will get into

If you are interested in seeing the data, you can find it here.

Data Observations

We can collectively observe the following about the data:

The data is extremely skewed toward low values. I.e., it’s very much more likely that any given confession gets 0 likes than 1 like, which has higher chance than 2, which has higher chance than 3, etc. This gets especially more skewed once we start diving into the non-default facebook reactions (love, wow, haha, sads, angry).
About 67% of confessions collected get 0 reactions (including comments). Again, that is to illustrate how skewed the data is.
The confessions that get most reactions have somewhere between 75 and 175 characters. Longer than that, and it’s much harder for the confession to get a lot of reactions. Below that, it’s more of a matter of chance and wit from the person writing the confession.
Most confessions have between 74–144 characters.
Most reactions are inversely correlated with each other. i.e., getting more likes result in getting less of everything else. This makes sense. There are funny confessions (a lot of hahas), likable confessions, lovable confessions, etc; each confession, to some extent, conveys a specific feeling from the MIT community.
Facebook reactions outside of the like reaction started on February 24, 2016, so this means we do not know the reactions of confessions before that date.

In addition, here are some graphs that show the following observations further:

bucket histogram for the number of likes to confessions

bucket histogram for the number of sad reactions to confessions

Do you think you can find any other pattern in the data? You can find the data in the API Github repository and do some more analysis (also check out the /data/analysis directory for more graphs of if you want to generate more graphs—a CSV version of the data was extracted).

Data Cleaning

This step involves us taking the raw data and converting it into a format that the machine learning algorithm can easily use to do computations.

Some ML Background

A machine learning algorithm is essentially a function h that expects an input input and produces an output prediction. In the context of supervised learning, this algorithm “learns” by being given the true, expected output y (which we call the label, so I will use label from now on) from which it can compute a “loss” (a number describing how bad of a prediction this is, this is computed using a loss function) and use that loss value to better predict the next time it sees that same input again. With that said, a machine learning algorithm is just a bunch of mathematical computations that lead to a given output prediction, so all inputs and output are numbers! Often, however, our inputs are not numbers (e.g. images, texts, categories, 3D maps, etc) or are not the numbers that we want (e.g. you might be given a number, but you want the deviation from the mean as input). So, in order to get to input, we need to preprocess the raw_data and clean the raw_data to convert it to input. So, overall:

convert raw data: raw_data -> input 
create prediction: h(input) -> prediction
compute loss: loss_function(prediction, label) -> loss
update h based on loss

The critical step is to update h after computing the loss. This is the back-propagation step in neural networks.

Our Problem

We want to generate MIT confessions using a generator and predict the reactions to a confession using a predictor.

With this problem, our raw data are MIT confessions, which are just texts. Our output data is either the number of reactions for a given reaction type or more text when generating confessions. With this said and with all the characteristics about the data that we have mentioned above (the skewed aspect for instance), we had to do a couple of things to “fix” our dataset before we can run any sort of machine learning on it. I won’t go over and into every single steps or details that we did, but overall these were the main things:

In terms of character length, the average number of characters in any confessions was 229.16 characters, but looking further, we had a few outliers that were skewing the mean. So, we decided to cut out every confessions that had more than 600 on the generator and anything that has more than 400 characters on the prediction. There were only a few of those. The mean without over-600-character confessions is 159.45 characters. The mean without over-400-character confessions is 136.07 characters.
In terms of labels, we definitely had too many labels with low values and too few with high values. So, I decided to remove some of the oversaturated values (especially things that had 0 reactions) and—some machine learning experts will probably cringe at this idea—duplicate some of the sparse values (confessions that had over 50 reactions for example were quite sparse) for the predictor. This wasn’t an issue for the generator since it doesn’t deal with labels.
When training, some characters are distracting, so we decided to remove many of punctuation marks (but kept the following: !?.,). Ideally, we want to also remove words that appear way too often, but this would mess with our generating algorithm so we left them (words like I, you, me, etc).

We just didn’t have enough data to work with (4,555 is extremely small! We need quite more). So, we were limited in what we could do with it. As you will see, our models ended up having a lot of overfitting.

Machine Learning Models

For our machine learning related needs, we wrote our programs in python3 and used Keras and Tensorflow python modules.

Overview: Word to Sequences and Vice Versa

Before we get into the models, it’s worth mentioning that both model would have the same inputs. In fact, any model that wants to do anything with MIT confessions will most likely have as input the confession text. So, we need a way to convert these confessions text into numbers.

So, we used Keras’s text preprocessing: a text tokenizer. This is a word to index model that converts words into integer ids. I had to rewrite it a bit because this didn’t give us desired outputs. So, I used that as a helper only and created methods to convert text to sequences and vice versa.

In addition to that, we used Keras’s sequence preprocessing: pad_sequences. The thing with doing ML on text is that ML always expect inputs of the same size (with some exceptions). So, what we had to do is pad the sequences so that the number of items in sequences matches the longest sequence in the dataset and use that as our input size. We pad the sequences with a dummy “pad” word (which in our case we used 0 as the pad word index).

For example: let’s say we have 3 words “I”, “like”, and “tea”. Then we use the tokenizer to get the following word to index map:

   I <-> 1
like <-> 2
 tea <-> 3

Then, if we have the following sentence: “I like like like tea”, then it will be converted to the sequence [1, 2, 2, 2, 3] . Finally, if the longest sentence is “I like I like I like I like tea” (total of 9 words), then we get the padded sequence [1, 2, 2, 2, 3, 0, 0, 0, 0] (pad 5 zeros to the right match length 9).

Finally, we needed to do a word embedding, which converts an index into an vector of numbers. For example, it can convert 1 into some vector [0.12, -0.05, 1.14, 1.02] (this is just an example). So, our sequence ([1, 2, 2, 2, 3]) will become something like:

[
  [0.12, -0.05, 1.14, 1.02],
  [0.52, 1.05, -2.14, 0.02],
  [0.52, 1.05, -2.14, 0.02],
  [0.52, 1.05, -2.14, 0.02],
  [3.52, -1.5, 0.14, -0.07],
]

I made up these vectors just to show you an example.

First Model: Classifier

Now, our first model is a bucket classifier whose architecture looks like the following:

A “bucket” is a range of reaction count. If we’re looking at the number of likes that the confession will get, then the bucket range might be 0-10 likes or 11-20 likes. What our model is doing here is trying to predict the most likely bucket. So, if the two buckets above were the only one, the output would be something like

 0-10: 0.1
11-20: 0.9

I.e., 10% chance that the confession gets between 0–10 likes and 90% chance that it gets between 11–20 likes.

The reason I did this is because of the sparseness of the data. I needed a way to predict the number of reactions, and this was the closest I could get without the model overfitting too much and not learning anything (it still did though even after finding the optimal number of buckets 😢).

Training Results from Classifier

We trained the model for 75 epochs with a batch size of 64.

Validation loss went from: 0.13 to 0.23 → increased
Validation accuracy went from 0.96 to 0.96 → remained unchanged
Training loss went from: 0.13 to 0.00
Training accuracy went from: 0.96 to 0.99

The model had such high accuracy in the train data and almost arbitrary results from the validation data. Even though the accuracy is high, it’s not a good indication of quality. Most of our data looks the same, so the model can easily learn to almost always predict the same thing without actually learning about the data. Of course, this may be due to the type of model used. For instance, had we also used an LSTM, this could have had better results. Finally, given that we’re working with text, the level of complexity is extremely high and the number of examples very low in proportion. We had about 9,000 words, and each example is a combination of some subsets of word. If all examples had 40–200 word, then we’d need about 1.66 * 1⁰¹¹⁰ to 9.64 * 1⁰⁴¹⁴ possible sequences of words. Of course, only a small percentage of those would be meaningful (see information theory), but that would still be a huge number of sequence of words to choose from. In addition, our labels are the number of reactions, which for each reaction can go up to infinity (thus increasing the sample space further). If we have only 4,555 examples to try to learn this distribution, overfitting is extremely easy.

Second Model: an Long-Short-Term-Memory (LSTM) Generator

An LSTM is a model that learns to predict the next thing in the sequence (See this link for more details on LSTMs). For us, the output layer is a one-hot encoding vector for a word index.

Jürgen Cito started the LSTM model seen above. Some of the key features of the LSTM is that the inputs are padded pre instead of post. So for instance, the sentence sequence[1, 2, 3] will be padded as [0, 0, 0, 1, 2, 3] if the maximum length was 6. In addition to this, the labels for the LSTM are word indices. In order to extract labels, we break our sentence down into its subsequences starting from index 0 up until the end of the sequence (with minimum sequences of length 2) and use the last index as label. For instance, for the example above, we’d get two inputs sequences[1, 2] and [1, 2, 3], pad them as [0, 0, 0, 0, 1, 2] and [0, 0, 0, 1, 2, 3] , then get the following input label pairs: [0, 0, 0, 0, 1], [2] and [0, 0, 0, 1, 2], [3]².

Training Results from LSTM

We trained the model for 250 epochs with batch size of 32 using confessions that have at least 25 reactions in total:

Validation loss went from: 6.50 to 5.3870 → decreased!
Validation accuracy went from 0.05 to 0.5185 → increased!
Training loss went from: 6.87 to 0.1697
Training accuracy went from: 0.04 to 0.9632

Here, we see that the model is actually learning something—we see much better results in the validation data. As training accuracy increases, validation accuracy also increases (at about half the rate). Training loss decreases, but validation loss doesn’t decrease as much (but it does decrease). That shows that we could really use more data. The results we’re getting here could be tremendously improved if we had more data to work work with. The amount of overfitting would be lower, but the validation accuracy would be higher. In addition, using a word2vec embedding (instead of the generic embedding with 300 words used here) could potentially make a big difference.

Some Predictions from the Models

You can make prediction as well on mit-smart-confessions.herokuapp.com/. This may not work the first (or second, or third) time you try because Heroku servers go to sleep then takes a long time to wake up after you make a server request, but it should eventually work. 😅

Classifier Results

These are confessions taken from the MIT Confessions page recently. The way the classifier work is that you give it a text sentence and will predict the buckets for each reaction type.

—

Text: “b1 freshman lack nuke power must elevate”Output: 2–17 wow, 0 sad, 16–18 like, 3–12 love
Expected: 6 wow, 2 sad, 1 like, 1 love

—

Text: “Bow down you lowly phytoplankton, for I am your new sardine king.”Output: 8–10 like, 6–27 haha, 0 love
Expected: 7 like, 6 haha, 1 love

—

Text: “Watching all these YouTubers promote science for the masses is truly humbling. Shoutout to ElectroBOOM, 3Blue1Brown, VSauce, GreatScott, minutephysics, Veritasium, CGPGrey, numberphile, SciShow and infinitely many others!”Output: 1–2 like, 0 love
Expected: 53 like, 15 love

—

So, it seems like the model sometimes predict correctly and sometimes is just way off! That is what I meant earlier by arbitrary results.

Generator Results

The way the generator work is that you give it a seed text that represents the start of the sentence and a length to represent how many additional words to add to the seed text. Then, it will predict, one by one, the next words to add.

These seed text were chosen arbitrarily.

Seed text: Roast that
Output: Roast that 6 check mit midterms harder than international olympiads of course …

—

Seed text: People
Output: People who answer questions in large lectures where did you find such confidence bullshitters ? the animations are named asdf , …

—

Seed text: I love the
Output: I love the freshmen on my floor , they all have such funny and unique personalities ! so proud of them getting through their first real fight …

—

Seed text: MIT Midterms are
Output: mit midterms are by just being a p-set due at a potato, but it a few and we used to be bitter in a place like this, but just showing someone kindness, whether or not they are a friend, makes this place more rewarding. I feel like once a few years have gone by a lot of us lose sight of that. Keep doing nice things, it doesn’t take that much time or effort, and we can turn this place around, but unreadable enough that lazy female will have literally a plane and take.

—

The generator results, although more interesting, do show the overfitting. I think if you are old enough (in terms of how many years you have spent at MIT) and been on MIT Confessions for that time, you would know that some of these “sensible” predictions are actually parts of confessions that were posted in the past. You can tell if you try to give the input something non-sensical (e.g. “you what me what you coffee”); it’d yield something equally nonsensical.

One thing that I was happy about is the fact that the words outputted were words that you would somewhat find close to each other in a given confessions. So, it did learn some things about the confessions.

Pitfalls of Working on MIT Confessions

1. Lack of Data: Cursed from the Start

This has been pronounced many times in this article already, but this really shows how much ML becomes difficult if you do not have the data. You need a lot of data to be able to anything with machine learning, otherwise you get overfitted results.

With that said, there is a limit to how much data we can get from MIT Confessions. As shown in the beginning of this blog, the total number of confession is roughly close to 30,000. That is, if we collected all the data we possibly could, we’d get at most 30,000 confessions. That’s still quite small in the context of ML and would potentially still lead to overfitting

2. Knowledge Gap in Machine Learning Processes

Our first instinct when starting this project was to find something online, quickly copy paste it, train it (without care about the methods), and see results. Then, we could improve on that.

Not to say that that was a wrong approach, we just didn’t know as much about ML as we thought was necessary. I had to do external research, and found this data science primer guide to be very helpful. One key thing was that in 6.s198, I felt like the class didn’t emphasize enough about the data cleaning step and how important it is to getting one’s project to work well.

I am of the idea that any ML project that takes data cleaning for granted is bound to not be successful (or not reach its full potential). I think in this project there are still a lot of opportunity to improve on the data (for instance, in 2017 MIT Confessions had less followers than in 2018, how does that affect the number of reactions?).

3. Model Architecture

Something else that made this difficult was understanding what the hidden layers should be in the ML neural-network models. This is something that I still am not sure on how to do.

I feel like this is something that one gains insights after having a lot of experience, but I don’t think there is a right approach to it. I think that experts have a much better intuition on what layers could really make a difference and when to introduce certain layers.

4. Debugging Keras and Tensorflow

This can be extremely difficult. The keras and tensorflow codebase are large with many pieces working for various things. I want to highlight three important bugs that made the development of code difficult:

Tensorflow Doesn’t Work with Python 3.7.x

This bug is due to Python 3.7.0 adding more reserved keywords from Python 3.6.7. This minor update should be backward compatible with Python 3.6.x (unless for whatever reason you are using the newly introduced keywords—async and await, which are said to be backward incompatible syntax-wise).

Since Google “elevates” Python with more keywords by using a somewhat different interpreter (or maybe python just ignored it, to be honest I am not entirely sure about how this part works in the background), this causes problems in Python 3.7.x from the newly introduced keywords. Notably, the one that is problematic with tensorflow is the new keyword async , which tensorflow uses in its codebase.

The easy fix to this is to find every instance of the word async in tensorflow and replace it with a dummy variable name (like async1 ). This will allow tensorflow to work for the most part. However, it still won’t work for some features (which I had experienced further).

Here are some links to this bug:

The best fix I found is to simply downgrade to python 3.6.7. This is guaranteed to work. So, head over here to download python 3.6.7.

The LSTM Predicts the Same Word

There could have been multiple reasons why this happened. One of the was that if one pads post and tries to predict the next word, then the next word will always be the pad word. For us, the reason why this was happening was because we just needed to train for longer. 10 epochs is not enough: Around 250 epochs was the point where training accuracy wasn’t improving anymore. Note that this took over 10 hours of training, so you wouldn’t be able to find out why if you don’t train for long enough! This was a hard bug to figure out because our instinct was to try to train it for a bit (1–2 epochs) and see if it keeps predicting the same thing. Also, “train for longer” just didn’t seem right because the assumption with this solution is that a model that wasn’t trained should just predict random garbage (not the same word over and over) since all weights are initialized randomly.

Model Predictions Won’t Work Well With Python’s Flask

When running our server online, we have to load the models that we had saved after we trained them. Ideally, after loading the models, they should work just like they would normally work before they were saved. However, that is not the case when running with Flask. You would simply get an error coming from somewhere within the tensorflow codebase.

I don’t completely understand why this is happening (and if you know, let me know!), but the fix for this was to add the following to the code:

# import Tensorflow to the tensorflow graph, which fixed the bug
import tensorflow as tf
# store it into a variable
tf_graph = tf.get_default_graph()# ... 
# more code ... 
def predict(inputs): 
    # before making a prediction, reset the graph as default
    with tf_graph.as_default(): 
        model.predict(inputs)

This problem happened when running the code with running a Flask server. Online, I found things about the server handling things asynchronously, so it has something to do with that.

Extensions to this Project

With all of the above said, we can train better and we can find more insights. I have outlined the following list of extensions that could be implemented to this project to make it better:

Using a Word2Vec Skip-Gram embedding: this embedding will ensure that the embedding we use add some meaning to the sequences, which will not only make the training faster but also likely improve the model metrics (especially on the classifier).
Use the same model for all reactions: currently, each reaction has its own model. However, we could easily use one model that makes a prediction for all of them. This could have a better chance of capturing the inverse relation observed above (that having more of one type of reaction results in less of every other reactions).
Use some form of transfer learning on the text: this would be useful to be able to take text, convert it into some useful format, and use that format as input to our models. The Word2Vec embedding above can be a form of transfer learning (e.g. using Google’s Word2Vec model).
Prediction the fraction of confessions instead of the number of confessions: this could potentially yield better results because it reduces the sample space. This also changes to question from “how many reactions I will get?” to “what kind of reaction will people have on this confessions?”
Experiment with variations to the model’s architectures: Sometimes, different architecture can yield better results in the validation set. Maybe our models were too complex since they overfitted the data and didn’t perform as well on the validation set.
Collecting more data: Extremely critical for everything really.
Experiment with better data cleaning methods
Using an LSTM for the classifier model
Doing more work on understanding what makes or breaks a confession

Now What? Sentiment Analysis

With everything said, this may be funny, but it does not seem that useful. What’s the goal of this? Make people post better confessions? At most, this is fun but not useful. How can we make this more interesting?

If we think of those Facebook reactions, they each emanate a feeling, and one way that one could use this on something more interesting is seeing whether the message (or ad, or email, or poster, etc) that one will send will emanate the feeling that they intend to their targeted audience. Being able to answer what feelings you would get from your target audience can help you shape your message better and thus have a more desired impact.

Google did something similar with Perspective API, which I think is worth looking at.

Your Suggestions

Do you have some suggestions to add to this list or to tweak the project further? Please share in the comments and thank you for reading this!

Many Thanks

Working on this project helped me learn a lot about ML, servers, Flask, and Heroku that I didn’t know before, and I want to thank our class instructors Natalie Lao and Hal Abelson for providing me with this wonderful experience.

I also want to thank our assigned TA Yaakov Helman for guiding us in the early stage of the process and every other TAs for helping us out with various questions throughout the class.

Finally, I want to thank Jürgen Cito for helping out with writing the LSTM and many wonderful suggestions throughout the projects, especially using a base model class to use for all our models. Jürgen Cito also helped with writing this blog post, so I want to thank him for that too!

___________________________________________________________________

Footnotes

¹The page may have more confessions that what I gave here. That is because each confession is given an id, and the page post confessions as “#CONFESSION_ID CONFESSION_TEXT”. Maybe the id counter may have restarted or something, but the total number of confessions will not restart. This is why MIT Summer Confession ended up providing more data than the largest confession id.

²You may ask, why not take all subsequences? For instance, if we have the sequence [1, 2, 3, 1] , why not take all subsequences of length at least two, namely: [1, 2] , [1, 2, 3] , [1, 2, 3, 1] , [2, 3] , [2, 3, 1] , [3, 1] . The reason we would not do this is because of the following: How many subsequence results from a sequence of length n? The solution is n choose 2 , which comes out to be n * (n — 1) / 2 . So if our examples had somewhere between 20-80 words on average, it’d be at least 10 * 19 * 4555 = 865,450 at a minimum, and 40 * 79 * 4555 = 14,393,800 examples to train. That is not too much of a problem until you take into consideration the maximum length sequence, which can be quite big so each example vectors will be as big as that. If that max sequence was 80 for instance (and it’s much larger in our case), then we’d have many really large vectors, each of which gets 300-fold bigger through the embedding layer (size 300 in the LSTM case), which is just a lot of memory for python. We actually tried this and python quit on us for using too much memory.