# The power of multiple datasets and the insights hiding in them

There are many reasons planners like to combine Strava Metro data with their bike counter data. The two most common reasons are:

1) to find out what share of the biking population Strava Metro represents

2) to create expansion factors, so that they can use Strava Metro to analyze across their entire network (not just the places they have counters).

But I’m here to pose a third, rather nerdy reason: **to find cool insights that are hiding in your data.**

(If you’re looking for a more hands-on tutorial for how to do Strava Metro / Bicycle Counter correlation analysis, check out this guide or send me a message.)

This example is based on research I did using bike count data from New York City’s Department of Transportation.

My initial intent was to answer the two questions mentioned above, so I was looking at how the Strava Metro data correlates with NYC DOT’s bicycle counter data on four East River bridges: Brooklyn Bridge, Manhattan Bridge, Williamsburg Bridge, and Queensboro Bridge.

Things started out fairly routinely. For each bridge, I took the number of Strava bicycle trips for each day, and compared them to the count of trips recorded by the city’s bicycle counter equipment. For each month, I calculated the R-squared value, based on the daily data. (The R-squared value is a measure that lets you compare two datasets, to see if changes in one dataset can be predicted by the other dataset.)

In most places I found R-squared values that were consistently strong, as high as 0.96. However, on the Queensboro Bridge for the month of May 2019, I found an R-squared value of 0.01. Effectively no correlation. So what was going wrong? Why might this bridge have a different relationship to the Strava sample in May than all the other comparisons?

I had two possible paths to troubleshoot the issue. The first was to calculate the percentage of bicycle trips from the counter that were also captured by Strava for each day, and look at the ranges. Immediately, I noticed that while the median percentage was 5.7% on this bridge, the max was nearly 80%. And here I had my culprit — on May 5th, 2019, nearly 80% of bicycle trips that were captured by the bicycle counter were also logged on Strava, a deviation from the rest of the dataset.

*Big Data lesson number one — when working with big data, always set up ways to find outliers, and know how to correct or remove them when you find them.*

(It turned out to be very important that I was working with daily data, and it makes intuitive sense that having access to daily frequency (or greater) is important in analyzing biking, since ability and willingness to bike varies so much day to day.)

The second troubleshooting path was to chart the two datasets. As a visual learner, I often put data in a chart to get a sense of what story the data is (or seems to be) telling. By plotting both bicycle counter activities and the Strava Metro activities, I could immediately see that May 5th was, indeed, a problem.

Had I only looked at a chart of the bicycle counter data, I wouldn’t have noticed a problem with May 5th. I likely would have investigated May 12th instead (I checked on that too — the dip was due to unseasonably bad weather!).

Charting both datasets, I identified May 5th as the cause of the issue. Unsurprisingly, I found that if I removed the May 5th data, the chart showed a much stronger correlation:

Another way to chart this data is to create a scatterplot of all of the corresponding data points, and then plot the trendline.

Again, once I removed May 5th from the data, the correlation was much stronger.

*Big Data lesson number two — use charts to help you **see** what’s happening in the data.*

But what was going on on May 5th? Just removing the day without knowing why wouldn’t give us confidence in the analysis.

A quick online search revealed that May 5th was the day of the Five Boro Bike Tour which brings 32,000 people together to tour through NYC on bikes, on a route that crossed the Queensboro Bridge.

So now that I knew *what* had happened on that day, I needed to know *why *it was influencing the data in this particular way. I turned to the folks over at NYC DOT who oversee the bicycle counter and data program, who immediately knew why this was the case. The bicycle counter on the Queensboro Bridge is set up in the bike lane, which means it only counts bicycles traveling in that lane. For most of the year, this works great, but for the Five Boro Bike Tour, the *additional travel lanes* are open for people on bikes. So while the counter kept counting people in the bike lane, it didn’t count the people traveling in the other lanes. Since it doesn’t matter in what lane you’re traveling when tracking your ride on Strava, the Strava Metro data captured people across the entire bridge. Mystery solved.

*Big Data lesson number three — find a source of local knowledge to explain things you can’t.*

With all that context, I was able to remove that day’s data from the dataset and continue on with the correlation work. It also caused me to set up two workflows for checking my data — one based on calculating the range of percentages, and the other based on creating line charts and scatterplots in order to identify anomalies.

Here are three quick takeaways on working with Big Data:

- Always plan for ways to find outliers in your data (you will find them)
- Use charts to help you see what’s happening in the data
- Find a source of local knowledge to explain anything you can’t

Thanks for reading! I’ll be sharing more unusual insights from working with active travel data over the coming weeks. Drop me a note or comment if you have questions!