Web scraping the James Comey testimony using Python

Matthew Snell
Applied AI
Published in
3 min readJun 21, 2017

Former FBI Director James Comey gave a testimony to the Senate Intelligence Committee on June 8th, 2017. With all the buzz the past few months surrounding the possible Russian influence in the 2016 U.S. Presidential Election it’s important to take an analytical approach to key moments like this.

To get started, we need the transcript. Luckily for us, Politico did a perfect job structuring and cleaning the raw transcript that was recorded from CSPAN. Next up, we need to do some manual digging in the HTML to locate the data. Lastly, we need a Python web scraping library. My personal favorite, BeautifulSoup, is going to meet that need.

The link to the transcript is great, but unless we know how the page is built we won’t be able to tell BSoup where to look. By utilizing the console (right click -> inspect) we can being looking for the HTML that houses our data. After some searching we realize the transcript is sitting in a <div> called ‘story-text’, and each section is housed in individual <p> tags. Now we’re ready to start scraping.

Here’s the code required:

What’s happening here?

  • Lines 1 & 2 are importing the necessary libraries to let Python work it’s magic
  • Line 4 is telling Python where to find the data
  • Line 5 requests the page to open
  • Line 6 sets up the BSoup results into the variable soup
  • Line 7 creates a variable called div, which utilizes the soup variable created before and finds the ‘story-text’ div and then finds each p tag.
  • Lines 8 & 9 creates a for loop and prints each p tag in the div (except the last 5, noted by the [:-5]).

Some final cleanup can be seen using the replace function (replacing new-line characters with spaces makes manipulation easier in other programs like Excel) and, finally, replacing colons with pipe characters gives us a pipe delimited file that breaks out the speaker and the quote. That’s it!

Further analysis?

We used Excel to further massage the output from the comey.py script (this could’ve been done in Python as well). Although the data was clean at this point, we decided there was value in identifying word-use frequency by individual. So, back to Python!

The above snippet is prepping the output to be put in to a database for quick analysis.

  • Each word in the 2nd column (row[1]) is split at each space.
  • The data is cleaned up by replacing punctuation marks with empty spaces.
  • Finally, we print the name of the person (row[0]) and each word, separated by a comma.

An example from this output reveals John McCain said ‘Clinton’ 8 times, almost 4 times more than any other individual (besides Comey).

Python is a true staple for Data Science. It’s intuitive structure and almost pseudo-code like syntax creates easy to understand and easy to follow blocks like the ones seen above. As the Russian investigation continues and more information is gathered, you can bet Python will be used to maximize knowledge.

If you’d like to use the data created above, please visit my Kaggle page. The dataset can be found there. Happy hunting!

--

--