Data Analysis for the Non-Analyst:

Simple data analysis walk-through from one novice to another: Using Transportation Data

erik james mason
The Startup
8 min readSep 21, 2020

--

In this article, I attempt to demonstrate that with a minor amount of coding experience, a bit of data background, and a pinch of study and effort —even an inexperienced individual can discover insightful information from data.

Just a little background which may help provide a preface:

I’m not an analyst.

I’m not actually technically a “data” anything…

My official position is a Planner. It just so happens that what I “plan” is data and data-related activities (collection, compilation, QAQC, management, and reporting)

Sometimes, it feels like I’m swimming (or drowning) in data and yet I don’t really get the opportunity to fully unravel the data to create a meaningful data product, to truly tell a story with the data I work with.

On that sentiment, I unknowingly embarked on a journey to get more information from the data I worked with; information that could tell a story. The journey continues but I’d like to share what I’ve experienced thus far.

So…

How does someone like me, with no experience or background as an analyst, analyze data in a way that can tell a story?

Photo by Alejandro Gonzalez on Unsplash, Lettering by Author

Imaginably, there’s not one way to analyze data to extract meaningful information. This notebook is a little snapshot into my organic and yet cosmopolitan methodology, and is as follows:

  1. EDA (Exploratory Data Analysis)
  2. Data Wrangling/Cleaning
  3. Visualization/Reporting

There are only 3 required packages and 2 optional packages (and one unlisted optional one, whoops)

Required Packages:

  1. Pandas
  2. Numpy
  3. Plotly Express

Optional Packages:

  1. Pandas Profiling
  2. Datapane

This is purposeful because, unless you are a full-time data professional, you probably don’t have excess time to learn a myriad of coding techniques and packages that get touted about your favorite Medium channels and StackOverflow threads. So simplicity and directness is key.

I’ll try to keep this brief.

Alright then, shall we?

Part 1: EDA (Exploratory Data Analysis)

Photo by fabio on Unsplash

I’m not sure how to even really begin talking about EDA, which coincidentally enough is probably similar to how most non-analyst (or maybe even some analysts) feel about doing EDA.

It can be hard to know where to begin -

but it’s not impossible…

In fact, sometimes you just gotta start walking to get there.

And you’re off, as it were.

Now, granted it may feel overwhelming, but for me, this is the adventure of it all. It’s time to sleuth, explore, dissect, and imagine. I love patterns and puzzle-solving so I find this part riveting.

In our above example, this is actual data taken from my notebook. You don’t have to be extremely familiar with the data, transportation data, or really much of anything to understand to some degree what is happening.

We have some sort of identifier or ID field (tmc_code), a near-datetime field (measurement_tstamp) which informs when the recording transpired, various different speed fields which are all nearly similar, an unique column dealing with travel duration (travel_time_seconds), and a bit of an odd-ball field at the end which presumably is a coded value for amount of data (data_density).

Don’t second guess yourself, you probably already figured out how this data works.

The tmc_code is a sort of station (in fact, raw probe) that records a number vehicles (data_density) during different timeframes (measurement_tstamp) at certain speeds (speed, average_speed, reference_speed). The amount of time it takes to travel the segment that the tmc_code is recording is travel_time_seconds.

So data like this will provoke questions that inspire stories, questions like:

how long does it take to travel on a certain road and how often is it like that?

But we’re getting ahead of ourselves here; we can still get to know the data a lot more.

Those are fairly easy lines of codes to memorize and you can already tell a lot about this data and specifically this dataset.

In fact, you probably noticed that:

  • It’s a fairly big dataset (nearly 14 million rows)
  • It has some missing values (but nothing crazy)
  • Something odd is happening with the travel_time_seconds field (hint: look at the max value compared to the 75%)

If you didn’t notice that right away, don’t fret. I’ve found that it’s one of those sort of things that once you’ve noticed it, you start to notice it everywhere.

This is probably also a good time to mention a handy tool called pandas-profiling

If you were second guessing yourself (don’t worry, I live there), Pandas-Profiling is an awesome tool to verify or clarify the nature or characteristics of your dataset.

In fact, remember how we noticed the travel_time_seconds field and its strangely large maximum value?

So did Pandas-Profiling:

SKEWED warning for column

This is my data and I’ll be the first to admit that I don’t fully understand it — but I’m gonna bet that not many roads take 274385.88 seconds (or ~76.22 hours).

Actually, in Alaska, the longest road is the Dalton Highway, which is 414 miles and takes about 12 hours to drive (give or take).

The longest tmc_code (which happens to be on the Dalton) is ~282 miles (length in miles present in other data.

So something is amiss.

This is the part where the EDA will have its value revealed. This is the part where decisions are made.

Part 2: Data Wrangling/Cleaning

Photo by Scott Graham on Unsplash

Data wrangling/cleaning can be pretty intense and some pretty smart guys out there do stuff that I don’t fully understand. But I have learned enough to make things come together in a dataset.

Like how this dataset is unnecessarily big, or specifically, inefficient.

output using info()

That’s better.

But it’d be even better if we had fields to easily filter things like time and location.

Neat.

It’s definitely a bigger dataset now, but it’s all useful and more efficient.

The data wrangling here is not very sophisticated and there is a lot more that we could do or do differently, but for brevity’s sake — let’s get the fun stuff.

Part 3: Visualization/Reporting

Visualization is one of my favorite parts of doing anything, as I tend to really get into aesthetics and UX concepts.

As far as visualizing large datasets (like millions of row records across thousands of features), I’m still looking for an insightful and efficient way to display such a dataset. But otherwise —

You can:

Reduce

Map

ArcGIS Pro Map (by author)

Constrain

ttr_df_juneau_std = ttr_df_juneau.groupby('tmc_code')['travel_time_seconds'].std().sort_values(ascending=False)

Ultimately, for simplicity sake, we’re going to constrain our scope.

This is actually fitting because, at least in my business unit, people generally want to know about certain subsections; either by area, time, or facet.

We’ll focus in on the tmc_codes with the widest or most irregular spread of values in Juneau, Alaska (my hometown).

So… How can we visualize these potentially unreliability tmc_codes?

Visualizations with Plotly Express makes it very simple and direct, which is perfect for when you’re just assessing things and feeling out the edges

  • with animations
  • interactive visualizations of different plots
  • combining plots
  • or parsing facets

And of course, being able to map it on the fly without using a GIS platform/framework is a big advantage as well —

Hopefully, you can see that fairly insightful and attractive visualizations can be accomplished with just a little study on the arguments. No need to become a webapp developer just to communicate something meaningful with stakeholders or your team.

Speaking of which, what is a good way to share this visualization with someone?

Reporting (or sharing) is a key part of visualization because while running a notebook may work with doing a live presentation/recorded presentation — it wouldn’t easily work for someone else to run your notebook and view your data approach/methods/results.

There’s other ways to share and deploy solutions/visualizations, but I’ve really enjoyed the simplicity of Datapane.

From a very minimal amount of code (and a login if you are uploading to it), you can share your results with anyone or by use of the private link.

This way, they can inspect your data and see if it is truly the sort of meaningful insight they are looking for, then download it themselves.

Conclusion:

Data analysis, as I understand it, is any process to collect, inspect, clean, and transform data into meaningful insight and communication.

Though I do aspire to gain more technical skills in this area, I don’t personally see why creating stories from data must wait until one has mastered certain techniques and technological/programmable applications.

Would I bank the direction of my section and our objectives off my simple analysis? Maybe not so much so soon.

But it is a powerful way to substantiate an observation and to tell a story, which I think anyone, including the non-analyst, can do.

In case, you missed it, here’s the link to my notebook!

Thank you!

--

--

erik james mason
The Startup

Data this, data that - doing data things as Director of Data for nonprofit cross-sector towards diversity & equity. ML/DL Enthusiast and just generally enthused