Data Pipeline for the Mueller Report

Elliott Saslow
Apr 23 · 3 min read

Get started using data science on the recent Mueller Report!

Goals / Objectives

  • Download the Mueller Report using Command Line Interface
  • Convert the PDF format to text format
  • Get ready to apply data science to the Mueller report!


This is a quick tutorial on how to take the Mueller report, download it locally onto your computer and then convert the pdf to text. This will allow you to begin to run your own analysis on the document and identify specific trends and topics using Natural Language Processing. The first step here is to install the library that we will be using for converting the pdf to text in the future. We will use pdfminer.six which is an easy pip install. Below is the command to install it with pip. pdfminer.six is a pdf parsing tool which works surprisingly well.

pdfminer.six is a great way to turn your pdf into text using python!

Next, we need to download the (redacted) pdf that came from the Attorney General Barr. To do this, we will be using a simple curl command and downloading the document directly using the command line interface. I call this ‘pipelining the data’ and since we are using bash, this process is quick. This will save the document to your computer locally.

As a reminder, any command that starts with $ implies that the command belongs in the command line!

$curl -O

Next, we will be using the new library that we just installed to convert the pdf into a text document. To do this we will use the library pdfminer.six which is a pdf parser. Our curl command saved the pdf as mueller-report-searchable.pdf which is the name that we will use when passing it to the pdf parser. By also piping the output for the pdf parser into a text file we can quickly and easily save the output.

$ mueller-report-searchable.pdf > Mueller.txt

This will save all of the text from the mueller report pdf into a text file that is called Mueller.txt — if you want to take a look at this file, you can easily see what is going on using the command $cat Mueller.txt In addition, you can now interact with this text file just like you would interact with any text file in python! Let me know in the comments if this was helpful — and where you got stuck. I can also writer a tutorial on how read the data in!

