DATA STORIES | PDF PARSING | KNIME ANALYTICS PLATFORM

Using KNIME to Parse and Analyze PDF Documents

Extracting daily weather recordings with just a few clicks

John Emery
Low Code for Data Science

--

As first published in LinkedIn Pulse

In my professional life I work with data; preparing, cleaning, building models, and creating visualizations. I spend the vast majority of my time using tools like Tableau, Power BI, and Google Data Studio for visualization work, while I build ETL workflows in tools such as Alteryx, KNIME, and Tableau Prep. Outside of work, I have a goal of standing at the highest point of all 50 states. To date, I have been to 24. On October 31st this year, I flew up to New Hampshire in an attempt to summit Mt. Washington, that state’s highest point — and my would-be halfway point.

Mt. Washington is world-renowned for its horrendous weather. Wind gusts in excess of 230 miles per hour have been recorded, and about one-third of the days of the year experience winds above 100 miles per hour. Although many visitors drive or take the railway to the top during the milder summer months, the weather on the mountain can be extremely dangerous — especially heading into winter. Of course, this makes the idea of climbing it much more appealing to me.

Planning a hike/climb in late October means you have to navigate the potential for early season winter weather. As a data professional, I was interested in seeing what October weather typically looked like to get an idea of my likelihood of reaching the summit. Thankfully, the Mt. Washington Observatory keeps a log of daily weather recordings — high and low temperatures, average and fastest winds, precipitation, etc. — for each month going back to 2005.

Getting the Data

What makes analyzing this data tricky is that the observations are saved as a PDF. As many analysts, data scientists, and other data professionals can attest, dealing with data in PDFs can be a challenge.

Monthly published data by the Mt. Washington Observatory.

To collect the data, I used KNIME to save each month’s PDF to my computer. (As an aside, KNIME is a wonderful free and open source data prep and data science tool. If you want a no-or-low code solution to your data prep needs, it is an excellent choice.)

The URL for the Mt. Washington weather observations is: https://www.mountwashington.org/uploads/forms/2021/10.pdf

The only thing that changes from one month to the next is the year (2021, 2020, 2019…) and the month (10, 09, 08…). Using tools such as Create Date&Time Range and Table Creator I was able to easily generate all possible combinations of months and years to build out all URLs going back to January 2005. From there, I was able to download the PDFs to a folder on my computer using two String to URI nodes. The first node creates a URI path string from the source and the second creates URI path string to the target destination on my computer. As I write this in November 2021, the download process takes about a minute for over 200 PDFs.

Prepping the Data

Downloading the PDFs onto a drive on my computer was relatively easy. All that I had to do was generate strings for each URL and then download them. Prepping data from a parsed PDF can be an absolute nightmare, however. Depending on the structure of the PDF that you need to parse, this task can range from quite simple to nearly impossible.

Thankfully, KNIME offers a node called Tika Parser. As KNIME describes it, this node “allows parsing of any kind of documents that are supported by Tika.” The Tika Parser node is ridiculously easy to configure. I simply selected the directory that housed the downloaded PDFs, selected the file type from a list, and then which metadata items I wanted to output. In this case I selected the filepath and main content. The resulting content output is a long text string from each PDF.

Raw data from the Tika Parser… not the prettiest thing — yet!

Once the data is in its raw string form like this, we can begin splitting it apart — putting fields in distinct columns and each day’s observations on distinct rows. To start, I used the Cell Splitter node to split the Content column into a list based on the newline delimiter (\n). Interestingly, in KNIME when you want to split a string into distinct rows, you go first split the data into a list and then use the Ungroup node to split the list into rows. This procedure turns our 203-row data set into a 26,566-row data set.

One thing to note, however, is that the Tika Parser parses the entire PDF. This includes the data that you want and data that you may not want. In my case, I was only interested in the daily weather observations; data between the “DAY” and “31” records in the image above — everything else had to go.

Filtering out string data dynamically is a challenge for many analysts. You don’t have the luxury of saying “X > 100” or any nice mathematical formula. For this exercise, I struggled here. I knew I wanted to use regular expressions to extract the characters up until the first space, but KNIME doesn’t have a built-in function for regular expression extractions. Thankfully, what does exist is a third-party node called Regex Extractor — precisely what I needed!

Using the Regex Extractor node, I was able to pull out the first “word” from each string and then filter when it either said “DAY” or was a number between 1 and 31.

We’re getting there…

With only numeric data left, I finally split the data into distinct columns based on the space delimiter. From here, only standard data prep issues stood in my way — renaming columns, ensuring fields were given appropriate data types, etc.

The Results

This workflow took me about one hour to build from beginning to end. Using just a few simple and easy-to-configure nodes, I was able to construct URLs, download PDFs to my computer, parse each PDF, and then reconstruct the data into an easy-to-use data table.

This is much better.

Having clean data is perhaps the most important piece of the puzzle to perform sound analysis. Messy, incomplete, or poorly structured data not only cause delays but also frequently result in inaccurate reporting. It ends up being a waste of time for all involved. Using a tool like KNIME, I took existing data in a challenging structure for analysis and molded it into the structure I needed.

….. so what about the results from Mt. Washington?

The day of my climb saw nearly 4 inches of rain (over 5 inches total fell on the day of the climb and the day before) and average winds of about 50 miles per hour and gusts nearing 90. Looking over the historical data, it was the 15th rainiest day and around the 80th percentile for average wind speed. In short, not a great day for climbing!

--

--