Hastily Constructed Precipitation Analysis

Sam Vidovich
Analytics Vidhya
Published in
5 min readJul 12, 2021

I recently moved out to the middle of nowhere Ohio because my company went remote and I wanted some nicer scenery than the old ‘concrete jungle’ can provide. Something notable about the region I moved to is that it rains quite a bit more than it did in the city. So much so that someone-who-isn’t-me asked the question, “Is this like… a normal amount of rain for this area? Or?”

So! Like a good little programmer I got right to work. Is it a normal amount of rain? If not, how abnormal is it? Finding out turned out to be a neat afternoon adventure.

First, I needed to figure out where, exactly, I was meant to get this sort of information. After googling around for approximately seven minutes, I managed to find my way to NOAA — The National Oceanic and Atmospheric Administration. They have all sorts of datasets like this dating back really far. As it would turn out, it’s very easy to get a hold of a dataset for a region you’re interested in — you can just use their fabulous mapping tool to select stations you’d like to see data from! I decided that it would be a good idea to try and get the normals dataset for my region. This was a mistake. I downloaded the data and noted that it did not come with a year, only a month. It took me far longer than it should have to understand that I had gone and downloaded the wrong data.

Fine, fine fine. I decided to hunt again, until I figured out that, much like normals, you can download precipitation data specifically. I decided to get the precipitation dataset for a carefully selected area around my new home. Amazingly, you can just… drag a box around the stuff you want.

The Weather Region Selected, more or less.

Then, getting the data is rather simple. You select the datasets you’d like, and ‘add them to your cart’, constructing a little order with the data you want. It takes a little time, but soon, you’ll receive an email with a link to the data. And the data is cool!

Some Sample Precipitation Data

It doesn’t fulfill every demand or desire that I’ll ever have, but it has plenty for a quick and dirty analysis. The fields I really cared about for this analysis were HPCP ( Hourly Precipitation in inches ) and Date. A couple of lines of CSV parsing code and I was ready to go.

The big hurdles up-front were rolling up the data to the monthly level. Since the dataset contained measurements on the level of hours, I needed to roll up by station, then further roll up to an average across the stations over the course of each month. Well, I suppose I didn’t have to go that route. It just seemed nice, in case I wanted to use the station-level data later, it wouldn’t be much. The code for this part was rather simple:

Rolling Up by Station and Month

I loaded up a dictionary first by the station, and then by the month. I kept a running total of precipitation from any given month, but also kept the individual measurements, just in case.

Something I noted at this point was that some of the measurements wound up being 999.99. According to NOAA documentation, this meant that these datapoints were missing or there wasn’t enough data. Like a good statistician, I threw these in the garbage, and hoped nobody saw.

In the next part, I needed to get the amount of precipitation for each month. This sucked, because I had a lot of stations. I took a wild guess and thought, ‘well, for each month, let’s just do an average of the precipitation quantity across all of the stations.’ I… think I did it right. Right? It seems legit. The code is… less than efficient. But it did the job in a pinch.

Building ‘By Month’ Averages

Now I was at a point where I was ready to start plotting. Now, I am not a matplotlib wizard. I’ll never claim to be. That’s not to dog on anyone who is — you’re out there, you’re doing it big, you’re in the penthouse area. I… just haven’t advanced that part of my skillset yet.

But! I can make an ugly scatterplot real quick. Now, you’ll remember I had the dates all squared away. I really, really wanted those on the plot. But there were at least 100 of them, so I decided to ditch them. If you know how to do it so that it doesn’t suck, please feel free to leave a comment. Here’s the plot that came out the other side!

Hmm yes the data here is made of data

So yes, lots of precipitation around these parts. But, is there an increasing trend? Decreasing? None at all!?

To find out, I brushed off my dusty, dusty linear algebra knowledge and coded up a quick linear regression. Are there libraries for this? Sure. But what’s the fun in that? It’s not like I actually know how to use numpy anyway.

Linear Regression Code

I formed the numerator and denominator separately because… Eh, because uh…

From the plot that pops out, and no further analysis ( because of course, none is needed. The code is 100% perfect, and fine. Don’t look at it. Don’t look at the data either. ) we can conclude that though it may be raining a lot here, it’s kind of… always been raining a lot here. Which is a terribly uninteresting result, but, hey, it’s a result no less.

The Resultant Regression Line

Goodness of fit? What are you on about? “Using the right tools?” Go away, will you? I was bored on a Sunday afternoon.

Thanks for the read! You can find the code in this github gist. If you see any glaring issues with the code / analysis, bahabahbah, please comment! I’ll learn something new.

Citations

Data: User Engagement and Services Branch. DOC/NOAA/NESDIS/NCDC > National Climatic Data Center, NESDIS, NOAA, U.S. Department of Commerce.

--

--

Sam Vidovich
Analytics Vidhya

Programmer from Ohio. You can expect bad math and worse programming.