Hackbright Project Season: Week 1 Retrospective
Officially one week into project season, and I finally was able to tear myself away from my code long enough to take a moment to reflect upon my progress and process so far — what has worked, and what has… taught me a lot.
Minimum Viable Product (MVP): To create a web app that visualizes public sentiment regarding the 2016 presidential candidates.
In more detailed form, my goal is to scrape tweets from the last six months referencing the 2016 presidential candidates, and to identify whether the tweets were displaying a positive or negative sentiment towards the referenced candidate by using natural language processing techniques (most likely Naive Bayes).
Struggles and Lessons:
My biggest struggle of the week was both technical and mental — and entirely my own doing. Instead of simply accepting that Twitter’s search API limits your queries to tweets from the past 7 days, and working on building a database using data starting in August, I decided that it would be a lot more fun to scrape* all of the public tweets from the last six months that referenced the words ‘Trump’ or ‘Clinton’.
While extracting data from a rendered webpage is fairly straightforward, Twitter’s infinite scroll makes this process a little bit more complex. The Twitter news feed is an example of lazy loading — where the webpage only loads data as needed — in Twitter’s case, when the user nears the end of their news feed or search. While lazy loading is incredibly efficient from Twitter’s perspective, it is less ideal for delinquents like myself who are trying to scrape all their data. Only one page worth of tweets is actually rendered in the html when you first load a search, and you can only get more data to populate by scrolling down to the bottom of the page. Given how often people like to share their opinions on Trump and Clinton, if I was going to do this manually I could probably spend the entire remainder of Hackbright mindlessly scrolling away and not even make it to Trump’s ‘second amendment people’ fiasco last week.
To get around this problem, I used Selenium Webdriver to make direct calls to a Firefox browser, automating the process of scrolling down the page and prompting Twitter to load more data.
However, each time you prompt the lazy load to retrieve more data, the previously rendered html is still present. Therefore, if you were to scrape the page each time you loaded new data, you would scrape all of the previously scraped data over and over again — not exactly the model of efficiency.
The alternative solution is to have Selenium load the page multiple times in a row, then scrape all of the loaded data at once, and then call a new Twitter search starting on the following day to prevent duplicate results. If you’re searching for tweets containing the word ‘cantaloupe’, for example, this works beautifully, but it turns out that people have a lot of opinions about Trump and Clinton. Approximately 3,500 pages of tweets per day worth of opinions to be more precise. Sadly Firefox didn’t seem to be too thrilled with having to load 3,500 pages worth of data with only a 2 second sleep between each call, and would reliably freeze and crash on me before I even got around to scraping the html.
After many many hours of troubleshooting, both alone and with my incredibly patient advisor Bonnie, I finally had to stop and consider how important capturing every single tweet was to my goals for my project. Ultimately, the purpose of this project is two-fold: on a straightforward level, my project is about getting a working web application that explores using Twitter data as a tool for illustrating public sentiment regarding the election. More importantly however, my project is about learning and growing as a brand new software engineer — building something with the skills that Hackbright has given me, while learning infinitely more in the process. While the internal academic that all UChicago alums have inside of them cringed a little at the messiness of my sampling, when taking a step back I had to acknowledge that the statistical rigor of my data sampling wasn’t fundamental to either of the foundational goals of my capstone project.
After much deliberation, I decided that the benefits of being able to examine and display the fluctuations in public opinion over the entire election cycle (even with a smaller sample size) outweighed the benefits of switching back to the Twitter API. I cut my number of calls drastically, all the way down to 400 — and at the current rate I’ve been processing I’ll be running my scraping script in the background for the next week. Even just using this fraction of the available data I’ve been getting ~120,000 tweets/month, still leaving me with a fairly decent sized database to work with.
As I tackle natural language processing this week, I intend to keep a better focus on what the actual goals (and timeline) of my project are, instead of getting tied up in what the polished, ideal version of the project would be if I had infinite time and more than a year of programming experience. Besides, implementing my dream features will make a great post-Hackbright side project to work on.
*Dear Twitter, if you’re reading this please don’t sue me — it was all in the pursuit of learning