Big Data, BigQuery and a story of bikes

Eric Wall
Google Cloud - Community
6 min readMar 7, 2017

“Big Data” as a buzzword is inescapable. The amount of data collected daily is so staggering that it is nearly beyond comprehension. As of 2015, 2.5 million terabytes of new data is created every day. Applications stretch from simple consumer profiling to predictive modeling for topics ranging from scientific research to public policy. The impact, albeit usually hidden, of big data on daily life is undeniable.

Big data often feels like an inaccessible monolith to a casual observer, so when I first read about Google’s BigQuery toolset, I was immediately interested. I attended a talk given by Google developer advocate Sara Robinson at DevFestMN 2017 on the practical applications of using BigQuery illustrated through her analysis of the Citi Bike program in New York City. She finished the talk with a cursory look at the Nice Ride MN bike program which inspired me to try my hand at BigQuery and dive into the world of big data.

Nice Ride MN has been collecting data since the program started in 2010 and makes the data publicly available on their website. I am by no means a SQL expert, so I was pleased that importing the data into BigQuery only took a bit of trial and error, which found me running basic queries in about an hours time.

So, lets take the most basic question: How Nice Ride trips have been taken?

SELECT
COUNT(*) as total_trips
FROM
[mn_nice_ride.trips_2010],
[mn_nice_ride.trips_2011],
[mn_nice_ride.trips_2012],
[mn_nice_ride.trips_2013],
[mn_nice_ride.trips_2014],
[mn_nice_ride.trips_2015],
[mn_nice_ride.trips_2016]
WHERE
duration != 0;

Which yields the impressive result of 2,223,719:

Total trips for Nice Ride MN 2010–2016

So the next obvious question, is how did ridership grow per year?

SELECT
2016 AS year,
COUNT(*) AS number_of_trips
FROM
`mn_nice_ride.trips_2016`
WHERE
duration != 0 UNION ALL….[REST OF YEARS]
Total trips per year for Nice Ride MN 2010–2016

The above breakdown shows pretty steady growth year over year, except for the curious case of the decrease in ridership from 2015 to 2016. So how can we further analyze the data to get a better idea of what may have lead to the downturn? It turns out that another data point in each trip record is if the trip was taken by a subscriber (Member), or non subscriber (Casual).

SELECT
account_type AS account,
COUNT(*) AS num_trips,
FROM
[mn_nice_ride.trips_2010]
WHERE
duration != 0
GROUP BY
account;
Trips per year by member type for Nice Ride MN 2010–2016

This query yielded pretty interesting results. In fact the year over year growth (16%) continued for trips taken by subscribers from 2015–2016, but there was a fairly severe dip (-37%) in the number of causal trips.

I made a chart

So, we have a better idea of what subset of the Nice Ride customers accounted for the decrease in ridership — anything else to get more information before formalizing a hypothesis for the downturn?

Well lets take a look at the popular routes that non subscribers were taking in 2015 vs 2016:

Top 10 trips for non subscribers 2015
Top 10 trips for non subscribers 2016

So immediately apparent in the data is that non subscribers tend to use the service in a tourist capacity, with very popular routes being around the chain of lakes in Minneapolis. We see a definite decrease across the board from the number of trips at these destinations from 2015 to 2016. The most obvious reason for the decrease, or more simply the most logical would be that 2016 had more non ideal biking weather during peak times when a casual user might rent a bike for a tour around the lakes — but how do we know for certain? This is an important question, because it has an huge impact on the bottom line of the Nice Ride coffers, as each one of these popular non member trips accounts for ~$8 of gross revenue.

Using a subset of data from the NOAA (Measured from MSP Airport), I began running queries to look at various weather conditions that might not be conducive for a leisurely bike ride.

SELECT
COUNT(*) AS hot_and_humid,
year
FROM
[mn_nice_ride.weather_2016],
[mn_nice_ride.weather_2015],
[mn_nice_ride.weather_2014]
WHERE
mo > 3
AND mo < 11
AND dewp > 65
AND max > 90
GROUP BY
year;
Hot and Humid Days 2014–2016
SELECT
COUNT(*) AS hot_and_humid,
year
FROM
[mn_nice_ride.weather_2016],
[mn_nice_ride.weather_2015],
[mn_nice_ride.weather_2014]
WHERE
mo > 3
AND mo < 11
AND rain_drizzle = 1
GROUP BY
year;
Number of Rainy Days 2014–2016

So both an increase in uncomfortably hot days, and days with precipitation were measured from 2015–2016. This additional information has given us another avenue to explore with how the weather conditions affect casual ridership. I took a random sample of a hot Saturday in 2016 vs a cool Saturday in 2016 and queried a count of the number of rides for each:

SELECT
COUNT(*) AS casual_trips_hot_day
FROM
[mn_nice_ride.trips_2016]
WHERE
start_date LIKE "7/23/2016%"
AND account_type = "Casual";
Number of trips by casual users on 7/23/2016
Number of trips by casual users on 7/2/2016

Not surprising was the fact that the number of casual riders on the cool comfortable day handily beat the number of those on the hot day. I am a bit surprised that the hot day had as high of ridership as it did, but the data I’m using is fairly incomplete and doesn’t give the whole picture of weather vs ridership — just a basic trend.

While I could run additional queries on the NOAA weather data currently loaded as a BigQuery public data set, I think that a more useful metric would be querying against hourly temperature readings to get a more accurate picture of how weather conditions truly affect ridership numbers.

A few basic questions have been answered, but what business use is there for recording and analyzing this data? Creating a ridership vs weather model could allow Nice Ride to offer incentives to increase the number of trips on hot days, or offering different membership plans focused on weekend use to increase subscribing members, or even utilizing real time data to create a bounty system to decrease station rebalancing could be potential possible applications.

I accomplished my goal of familiarizing myself with BigQuery along with learning a bit more about the intricacies and challenges that the Nice Ride MN program faces. The platform Google has made available seems robust enough for even more advanced data sets, and I think they have done an incredible job at democratizing big data. Hopefully more journalists, and hobbyists are able to utilize this powerful tool to bring better analysis to the complex problems facing the world today in an effort to transform “Big Data” into “Meaningful Data.”

--

--