M2M Day 247: How to extract 100,000 lines of crossword data from the internet

Max Deutsch
4 min readJul 6, 2017

--

This post is part of Month to Master, a 12-month accelerated learning project. For July, my goal is to solve a Saturday New York Times crossword puzzle in one sitting without any aid.

Yesterday, I realized that I won’t be able to complete this month’s challenge if I just use the crossword puzzles themselves as my training tools. Instead, I’m going to need to make some sort of training app/program that I can use to receive more immediate feedback and that I can control for more intelligent training.

To build this program, I will need two things: 1. Lots of crossword data, and 2. Some interface that displays this data in some quiz-like form.

Today, I focused on aggregating the data.

Step 1: Find the best data source

After a little bit of googling around, I found a well-suited website called NYTcrossword.com.

Every day, the site’s creator, Bill Ernest, publishes the clues and answers to that day’s NYT crossword puzzle as plain text. Not only that, but he also selects a number of clue-answer pairs, and provides additional useful information (i.e. an explanation of the clue, more info about the answer, etc.).

Here’s what that looks like for today’s puzzle…

Luckily for me, Bill has been keeping this record every day since 2009.

He’s also designed his site so that it’s incredibly searchable/navigable, which makes the next step much easier.

Step 2: Scrape all the data

The next step is to extract all this crossword data from Bill’s site using a technique called web scraping. Basically, to “scrape” a website, you teach a bot how to navigate the site and which pieces of data to extract as it navigates.

There are a lot of fancy ways to do this, but I’m pretty inexperienced in this area. So, instead, I just relied on the decently-user-friendly Chrome extension called Web Scraper.

Once installed, I was able to access Web Scraper via the inspector (i.e. developer tools) of my browser. It looks something like this…

After setting everything up, I clicked the “Scrape” button and let the computer do it’s thing.

After a few hours, I came back to my computer, Web Scraper had finished scraping, and I exported the data to a CSV.

Important note: I only scraped the data for puzzles published between 2009 and 2016. This means that I can still approach all 2017 puzzles “cold”, without ever having seen any of the clues previously. Thus, I will use the 2017 puzzles to fairly assess my progress throughout the rest of the month.

Step 3: Clean up the data

With the raw data downloaded, I spent some time cleaning it up until it was in a usable form: Each row includes the year of the puzzle, the day of the week, the clue, the word, the explanation, and the total number of times the word appears in my dataset.

Then, I sorted the list based on frequency… It turns out that ALI is the most common answer across NYT puzzles from 2009–2016, appearing 96 times.

There are 100,336 rows of data, so I’m not exactly sure how I’m going to do anything useful with this, but… I’ll figure that out tomorrow.

Read the next post. Read the previous post.

Max Deutsch is an obsessive learner, product builder, guinea pig for Month to Master, and founder at Openmind.

If you want to follow along with Max’s year-long accelerated learning project, make sure to follow this Medium account.

--

--