Data Pipeline for the Mueller Report

Elliott Saslow
Apr 23 · 3 min read

Want cool Future Vision Merch? Check out our store here

Get started using data science on the recent Mueller Report!

Goals / Objectives

  • Download the Mueller Report using Command Line Interface
  • Convert the PDF format to text format
  • Get ready to apply data science to the Mueller report!

Introduction

This is a quick tutorial on how to take the Mueller report, download it locally onto your computer and then convert the pdf to text. This will allow you to begin to run your own analysis on the document and identify specific trends and topics using Natural Language Processing. The first step here is to install the library that we will be using for converting the pdf to text in the future. We will use pdfminer.six which is an easy pip install. Below is the command to install it with pip. pdfminer.six is a pdf parsing tool which works surprisingly well.

pdfminer.six is a great way to turn your pdf into text using python!

Next, we need to download the (redacted) pdf that came from the Attorney General Barr. To do this, we will be using a simple curl command and downloading the document directly using the command line interface. I call this ‘pipelining the data’ and since we are using bash, this process is quick. This will save the document to your computer locally.

As a reminder, any command that starts with $ implies that the command belongs in the command line!

$curl -O https://cdn.cnn.com/cnn/2019/images/04/18/mueller-report-searchable.pdf

Next, we will be using the new library that we just installed to convert the pdf into a text document. To do this we will use the library pdfminer.six which is a pdf parser. Our curl command saved the pdf as mueller-report-searchable.pdf which is the name that we will use when passing it to the pdf parser. By also piping the output for the pdf parser into a text file we can quickly and easily save the output.

$pdf2txt.py mueller-report-searchable.pdf > Mueller.txt

This will save all of the text from the mueller report pdf into a text file that is called Mueller.txt — if you want to take a look at this file, you can easily see what is going on using the command $cat Mueller.txt In addition, you can now interact with this text file just like you would interact with any text file in python! Let me know in the comments if this was helpful — and where you got stuck. I can also writer a tutorial on how read the data in!

Be sure to follow Future Vision for more great content! Connect with me on twitter with any questions!

Keywords & phrases to get the right audience here:

  • pdf to text python 3
  • pdf text extractor python
  • pdf text extraction python
  • convert scanned pdf to text python
  • pdf image to text python
  • best pdf to text python
  • learn python pdf to text
  • pdf to text converter
  • python use cases

Future Vision

A publication centered around high quality storytelling

Elliott Saslow

Written by

Future Vision

A publication centered around high quality storytelling

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade