Saving Money with Beautiful Soup and Hashing

Fangfei Shen
Jul 9, 2018 · 5 min read

Back in the summer of 2017, Knewton’s alta images had an expensive problem. Whether the images were of parabolas, molecules, or supply and demand curves, they were all missing two important things: alt text and long descriptions.

Why do our images need alt text and long descriptions?

As part of being ADA Compliant, Knewton alta’s images need alt text and long descriptions to make the images accessible to screen readers. This is important so that our blind or visually impaired students can use assistive technology like screen readers to interact with our visual content.

An accessible image needs alternative text (alt text) and possibly also a long description, which the screen reader can read out to the user. Alternative text is typically brief (we limit ours to 255 characters) and should always be included on accessible images. Long descriptions are used to describe more complex images, like a detailed diagram.

Alt text and long descriptions are added to images via HTML attributes. Here’s an example image:

Image for post
Image for post
Fluffy gray cat belonging to a coworker.

Let’s say that the HTML tag for this image is:

By including and attributes in the image tag, screen readers can read out “Fluffy gray cat” and the contents of to the user.

The idea is that a visually impaired student can get all the information they need about an image via its alt text and long description. Take this image for example:

Image for post
Image for post
This figure shows two curves. The first curve is marked in blue and passes through the points (negative 1, 2), (0, 1), and (1, 1 over 2). The second curve is marked in red and passes through the points (negative 1, 3), (0, 1), and (1, 1 over 3). Attribution: Image by OpenStax Intermediate Algebra is licensed under Creative Commons Attribution 4.0 International License. Download for free here.

This image’s alt text is very specific:

This figure shows two curves. The first curve is marked in blue and passes through the points (negative 1, 2), (0, 1), and (1, 1 over 2). The second curve is marked in red and passes through the points (negative 1, 3), (0, 1), and (1, 1 over 3).

(Note: Due to the limitations of Medium, there are no alt text and long descriptions for images in this blog post, but we’ve included image captions as an alternative for screen readers.)

Unsurprisingly, writing alt text and long descriptions for thousands of images gets expensive fast. If there were only a way for us to not start from scratch…

Scrape OpenStax, save money

A good chunk of Knewton’s content is curated from the open source OpenStax textbooks (OpenStax and Knewton’s partnership began in 2016).

Greg, our Senior Manager of Content, was staring at OpenStax textbooks online — as senior managers of content are wont to do — when it hit him: OpenStax includes alt text with their images. We could scrape out alt text from OpenStax and match them with the images used in our courses! To Greg, this sounded like an ideal Hack Day project.

Knewton holds “Hack Days” a few times a year, in which Knerds (the Knewton employees) get to work on whatever project we wanted. For the August 2017 Hack Day, Greg and I teamed up to make his OpenStax-scraping, money-saving dream happen. Our solution had two steps:

  1. Scrape OpenStax textbooks for images and their associated alt text.
  2. Associate the scraped alt text with their images in our content management system’s database.

Finding images with Beautiful Soup

Conveniently, each OpenStax textbook has a downloadable zip, containing all of the textbook’s HTML and image files.

I wrote a Python script to walk through all the directories of the unzipped book, looking for HTML files. Then it became a straightforward application of Beautiful Soup, a popular Python HTML parser.

Notice how simple it is to find all tags. Once I got the HTML content, I just needed two lines:

  1. Create a “soup” using the HTML.
  2. Use the soup’s method.

For each tag in , Beautiful Soup makes it easy to extract the and attributes.

The attribute contains, well, the alt text. The attribute contains the file path to the image, which will be used in the next section.

Matching up images with hashing

Now that I have scraped all the alt text in an OpenStax book, I need to match them up with images in our content management system (CMS). This is where hashing comes in: two identical images have the same hash, while two different images have different hashes. If an alt text’s corresponding OpenStax image has a hash that matches the hash of an image in our CMS, then we can apply that alt text to that image in the CMS.

In Python, hashing an image is a matter of using the image’s file path (which was scraped with Beautiful Soup) to read the image’s bytes, and then feeding the bytes into one of Python’s built-in hashing algorithms.

That hash there is of this image below. Try hashing it yourself with the same hashing algorithm and you should see the same result. (This is assuming that Medium has not changed the image’s compression since this post’s publishing.)

Image for post
Image for post
Cartoon illustration of a laptop.

Conveniently, our image database already contained hashes for each image so we didn’t need to backfill any data.

What about the long descriptions?

You’ll notice that I glossed over long descriptions in the last few sections. Did we scrape for that at all? Yes and no. OpenStax image tags do not have attributes. However, we did end up repurposing many OpenStax alt texts as long descriptions because they were so detailed and, well, long.

We also did not use every OpenStax alt text verbatim, as our team of subject matter experts sometimes improved upon them or shortened them to fit within our 255 character limit.

How much money did we save?

We haven’t calculated exactly how much money we saved (we’ve been busy building out the Knewton alta product instead 😉). But if we do a back-of-the-envelope calculation:

  • We were able to scrape several thousand alt texts from OpenStax.
  • Writing the alt text and long description for an image costs between $5.00 and $40.00 depending on the image’s complexity.

Therefore, thousands of images times tens of dollars per image equals tens of thousands of dollars saved! Not too shabby for a hack day project that I coded in a day.

Image for post
Image for post

Knerd

The Knewton Blog - Stories about technology, product and…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store