Building Better Reads

William Cotton
Voice Tech Podcast
Published in
6 min readAug 9, 2019

A crash course in AWS and NLP

Last week, I had the opportunity to work on by far the most exciting project I’ve worked on during my time at Lambda School. I worked alongside three other data scientists and six web developers to create the next great book recommender — just type in the description of a book you’d like to read, and the Better Reads model generates 10 recommendations based on a natural language processing model, powered by spaCy and Elastic Beanstalk. I’ve written up how we did it, along with some overall reflections on the process.

CONTENT WARNING: TECHNICAL DETAILS

If you’re more interested in my takeaways, feel free to skip to the section labeled Overall Reflections.

The original plan was to load all of the descriptions and other variables into a PostgreSQL database on Amazon’s RDS service, then hook the RDS into some sort of EC2 instance, and finally attach the EC2 instance to a public API through the API Gateway — but there are some significant drawbacks to this approach:

  1. Complexity. Three of the four data scientists on the team had used AWS before, but AWS is far from straightforward (as we now well know). Becoming familiar with AWS, designing and deploying a model just on Elastic Beanstalk in three days was a task in and of itself — trying to do that on THREE services in the same timeframe would have been foolhardy.
    On top of all that, we’d have had to troubleshoot and debug the connections between the services, along with making sure the model sends predictions to the web development team correctly. Sure, everybody wants to become AWS pros — but our #1 priority was having a working model for our team to present at the end of the week.
  2. Reduced speed. The version of the model that we deployed relies on the ability to quickly vectorize a description and compare it against a database of 20,000 other vectorized descriptions. At best, this would mean querying the entire database when the server starts to hold all these in memory. At worst, our server would have to run 20,000 queries every time it produced a set of predictions. Not ideal, obviously.
  3. Cost. While using all three services for a single project wouldn’t be prohibitively expensive, we nevertheless wanted to minimize our expenditures while still providing a product that met our accuracy, speed, and simplicity targets. And, of course, in a business setting, cost is an ever present concern — if we wanted to scale this, we’d need to make sure our build was organized responsibly. Even a minor increase in cost should be avoided if possible, especially if that cost scales with the size of the deployment.

So we decided to simplify matters. We managed to get everything on our end working on Elastic Beanstalk, including the API endpoint for our web development friends. Here’s how we did it:

We created a spaCy vector-based model

This was the only part we ended up needing an AWS service beyond Elastic Beanstalk for — we did our preliminary data exploration, cleaning, and modeling in a Sagemaker notebook. Practically speaking, we probably could have done this on our local machines, or perhaps (barring timeout errors) even on Colab. But Sagemaker provides two clear advantages:

  1. Speed. Again, we wanted to deploy the model to a web app by the end of the week. Our local machines probably could have vectorized the descriptions and generated predictions just as well as Sagemaker — but it would have taken much longer. Using Sagemaker did increase our costs, but only marginally. In total, running Sagemaker cost only about a dollar an hour. Not too bad for getting predictions a few hours earlier when operating on a tight deadline, and since it’s a process that we only needed to run through once, scaling the project wouldn’t increase our total AWS bill.
  2. More exposure to AWS services. Again, we had more urgent goals than spending a week learning the ins and outs of AWS. But Sagemaker was a fairly gentle introduction to the AWS offerings, and provided opportunities to learn more about the core logic of AWS as a platform. For example, we learned more first-hand about the relationship between Sagemaker notebooks and EC2 instances, as well how to interface between Sagemaker and S3 buckets. (More about this in the next steps section.)

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Fortunately, the spaCy library is fast, easy-to-use, and produces pretty dang impressive results right out of the box.

Cool beans.

Using the built-in similarity function is just as easy — but in order to decrease the size of our deployments, we built a cosine-similarity function to compare test descriptions against the database and generate recommendations:

We created a Flask app

Elastic Beanstalk provides native support for Flask apps written in Python , making it an easy choice, especially since we’d all had some experience creating barebones Flask apps. Aside from a few minor problems, this step wasn’t too bad; just create a route for the model to take in some user’s description, then return a JSON with the predictions.

We included information about our Python environment

As with any serious Python project, dependency management was crucial. Elastic Beanstalk supports requirements documents created with pip freeze natively, so this was also fairly straightforward.

We included fixes specific to the Elastic Beanstalk platform

This was… less straightforward.

Elastic Beanstalk, and AWS more generally, for all their good qualities, have an absolutely dizzying array of features and configuration options. Among the many (…many) fixes we had to find for issues with our deployment, we had to fix an issue with using numpy on Elastic Beanstalk (thanks to Peyton Runyan, another Lambda School student who’d run into this before us and saved us the trouble), lengthen the timeout limit on starting up the instance since the initial loading of the dataframe and the spaCy model took so long, and modify the code to repair CORS issues with our API.

Overall Reflections

  • This took way more work than I’d expected — but it was 100% worth it. The final product produces some exciting results, especially when a user inputs a long description. I’ve had to shut down the instance to reduce costs, but I’m looking into future options for hosting cheaply.
  • By far, the most challenging (and exciting) part of this process was wrangling AWS. I learned an incredible amount about Amazon’s offering just to get this app running, and there are still so many things whose surfaces I barely scratched. I’m eager to continue to familiarize myself with the various parts of AWS.
  • Communication is key. We just flat out would not have finished this product without clear lines of communication with set roles and expectations for everyone on both the data science and web development teams.
  • Setting clear and realistic goals is crucial. We had a lot of ideas for stretch goals, including adding filters to the search feature, adding a search by author function, and increasing the database size. But the most important goal was delivering on our MVP by the end of the week, so we tailored our individual responsibilities with that in mind, with the understanding that stretch goals would be waiting for us if time permitted.

What I’m working on now:

  1. Deploying purely on Sagemaker. In reviewing our work and clicking around AWS, I discovered that Sagemaker actually has built in API endpoint support for deployed models, so I’ve spent the last few days exploring this and working on getting at least a sample model deployed through one of these endpoints. I’m also looking for ways to test the Sagemaker Python API locally so I can experiment without burning too many credits. If you’re reading this and you know of any great resources for model deploys on Sagemaker or working with the Sagemaker Python API, please shoot me a message!
  2. Improving the model. Amazon actually has their own pre-built NLP library available through Sagemaker — BlazingText. I haven’t worked with it yet, but I’m definitely going to see if I can get something going when I wrap up my Sagemaker API deployment of the project. I’m also curious to see if I can improve the speed of predictions without sacrificing accuracy — perhaps by k-means or another clustering algorithm.
  3. More ways to explore the data. In particular, I’d like to add more routes for the API. For example, it would be interesting to be able to search by author; I might set it up so that when searching for a particular author, I collect all of the vectorized descriptions of books I have by that author in the database, then find the most similar vectorized descriptions. That’s just an idea; I’m interested to see what methods provide the best results.

Here’s a link to our GH repo. If you’re working on anything similar and have questions about the process (or just want to grouse about AWS with somebody who gets it) shoot me a DM.

--

--