Richard, Panda Wrangler

Avril Aysha
The Startup
Published in
5 min readOct 26, 2020

Yes, it’s true. I wrangle with Pandas. On the daily.

Except my Pandas are purely digital, imported into my digital wrangling environment with a simple line of code. I don’t even break a sweat.

And to make sure I really don’t exert myself too much here, I even chop a six-letter word into a two-letter abbreviation. That’s how lazy (accomplished!) a wrangler I have become in the span of just a few weeks.

But what about all that cute, fuzzy fur, I hear you ask? What’s the point of wrangling pandas if you can’t bury your face into that warm mass of cozy fluff? Do not fear! It is 2020 and for every beautiful physical, offline experience we have a worthy substitute:

Bring it on, white walkers, I’m ready for winter!

For real, though…

Inside-jokes aside, it’s been a fuzzywuzzy week here trying to wrangle my datasets into some sort of shape. And yes, for the non-DS-initiated, wrangling is an actual verb used to describe the first stage of any Data Science project: importing your data, identifying and dealing with missing values, assigning the right column and row names, etc.

Now that might not sound like such a daunting task…until you realize your datasets have 115,000+ and 225,000+ entries, respectively.

Wrangling Predicament I

Rather unglamorously, my first big challenge was figuring out how to even import the GHCN climate data into my Jupyter notebook. The compressed archive file I downloaded (*.tar.gz) was close to 8GB but I could not free up enough space on my disk to extract it (meaning it’s larger than 40GB in total). My mentor Guy Maskall helpfully pointed out that Python can read .csv files from within an unextracted tar archive, using the tarfile module. After tweaking the code in the last answer to this StackOverflow thread, I finally managed to extract a single file (containing the climate measurements from a single weather station)…only to discover my next challenge.

Wrangling Predicament II

The files are not .csv at all…but .dly.

Turns out this is a little-used fixed-width file format, meaning each column is defined by a fixed amount of characters. After an afternoon spent digging through the readme.txt included in the dataset documentation and this very helpful Gitlab Snippet written specifically for accessing GHCN data, it was getting dark outside when I finally managed to access the file and read the data. SUCH a good feeling to end the day knowing that you’ve really done something to make the world a better place.

Wrangling Predicaments to Come

Now that I can actually look at my data, I kinda wish I couldn’t. The ambitious dreams I had of glamorous contributions to study of the effects of climate change on international conflict are starting to whither in the sun as I realize the following predicaments:

  • While there are many weather stations included in the GHCN dataset (115,000+), all with precise location coordinates, these stations are not equally distributed across the globe. In fact, there is a significant concentration of the weather stations in the Western, developed world, with much less coverage in zones of conflict. This of course makes some intuitive sense when you think about it, but it will affect the granularity of my climate data for the areas of the world I’m interested in.
Source: https://www1.ncdc.noaa.gov/pub/data/metadata/images/C00861_GHCN-D_stations.png
  • Not all weather stations collect all measurements. The description of the datasets lists a long series of variables measured, which led me to think I could approach the phenomenon of climate change from a few different angles (such as temperature difference, wind speeds, precipitation levels, etc,). However, many weather stations collect only very basic information on min, max and average temperatures and precipitation levels. From what I’ve managed to inspect of the data so far, it seems like weather stations in conflict zones tend to measure only these basic values. This will restrict the analyses I will be able to conduct.
  • Finally, not all variables are measured consistently over the available years. Even when a station measures only the basic variables of temperature and precipitation, these measurements can be very patchy, with large gaps of years for which no values exist. For example, a quick scatter plot of the data collected at the LEE00147734 station at the Rayak Air Base in Lebanon reveals decades of missing data for all variables included. This is confirmed by an overview of the null values, with close to 2000 missing precipitation values and more than 11,000 missing temperature values. I will have to evaluate how this missingness compares to that of other weather stations in my areas of interest. If this is the trend across many of the relevant weather stations, I may have to seriously adjust my research question.

This story is part of a linked series documenting my progress through my first independent data science project. Find the previous post here.

--

--