Exploratory Data Analysis (EDA) of the Chicago Divvy Bikes Dataset
Welcome to this hands-on project, where I explored the dataset of Chicago’s Divvy Bikes. This exploration forms the capstone of the Google Data Analytics Certificate and is approached through the lens of a fictional company. My mission? To unravel key business insights by meticulously following the data analysis process: ask, prepare, process, analyze, share, and act.
The dataset contains an entire year’s worth of unique trips, spanning from January 2022 to December 2022. Within this data, there are two types of riders: casual riders, who opt for single-ride or full-day passes, and Divvy members, with a commitment to annual memberships.
The dataset unfolds across 13 columns: ‘ride_id’, ‘rideable_type’, ‘started_at’, ‘ended_at’, and various geographical markers like ‘start_lat’, ‘start_lng’, and corresponding end locations.
From a strategic company viewpoint, particularly through the eyes of the marketing board, the goal is clear. The company’s future hinges on increasing annual memberships. Thus, the objective of this case study is to dissect how casual riders and annual members engage with the Divvy Bike program in contrasting ways.
How do annual members and casual riders use Divvy bikes differently during the year?
Processing and preparing the data set
The initial focus was the data processing, a crucial step to ensure accuracy and reliability before diving into analysis. This involves checking for errors, null values, and outliers, thus connecting business objectives with data analysis.
In the year of 2022 there was 5.6 mio trips, but after processing the data, the analysis will be build using 4.1 mio trips
Key modifications included converting started_at and ended_at from objects to datetime formats and introducing new columns for enhanced granularity: started_day, started_month, ended_day, and ended_month. This restructuring allows for more nuanced analyses, as seen in subsequent sections.
A important step during this part was adding two vital columns: ride_length and distance_km. Ride_length is calculated in minutes by subtracting started_at from ended_at, offering insight into the duration of each trip. For distance_km, the Haversine formula was employed using latitude and longitude data, yielding an approximate distance for each trip.
Finally, null values and outliers were identified and addressed. I opted out to delete null values and outliers that don’t align with realistic trip lengths (over 180 minutes or negative) or distances (beyond 10 km or negative), ensuring data integrity and relevance.
Unraveling Divvy Bike Usage Patterns
To keep the analysis aligned with the core objective, let’s revisit the business statement: Understanding the distinct usage patterns of annual members versus casual riders of Divvy Bikes.
In 2022, Divvy members contributed to 61% of all trips, with casual riders making up the remaining 39%. Intriguingly, research indicates Divvy had around 550,000 unique riders and 43,000 members in the same year. This translates to an average of 2.9 rides per casual rider and a striking 57 rides per member — a 20-fold difference emphasizing the members’ frequent usage.
Interestingly, casual riders tend to have nearly double the trip length compared to members. This discrepancy hints at differing user intentions: casual riders possibly enjoying leisurely rides or weekend explorations, while members likely utilize bikes for practical commuting purposes like work or school. Regarding travel distances, both user groups show similar patterns, with only a 10% variation. Given the approximate nature of distance calculations, it could imply that casual riders opt for longer, perhaps more scenic routes, or simply ride at a more leisurely pace.
Seasonal trends are evident too. There’s a notable surge in rides from late spring to early fall, peaking in summer. Members show a more consistent usage pattern throughout the year, likely influenced by their subscription type. Chicago’s climate data supports this trend, with July’s pleasant weather (average 23°C/73°F) contrasting starkly with January’s chill (-6°C/21°F). July registers an eightfold increase in rides compared to January, suggesting a strong link between weather and bike usage.
Station locations like Streeter Dr. & Grand Ave. and Lake Shore Dr. & North Blvd., close to popular attractions like a pier, museum, and Grant Park, further validate the leisure use of bikes. They are hotspots for tourists and recreational riders.
Lastly, my analysis extends to creating a Graph object using NetworkX, representing a network of interconnected bike stations. The size of each node in the graph is determined by the frequency of departures from that station, giving a visual indication of the station’s usage. The top stations are the same represented in the bar graph above, where I listed the most popular stations. Meaning, the more frequent the departures, the larger the node. The edges are styled as dotted lines to illustrate the connections between stations.
Insightful Journey Through Data Analysis
In this comprehensive hands-on project, my journey traversed the entire spectrum of the data analysis process: ask, prepare, process, analyze, share, and act.
The data I analyzed consisted with unique trips starting January, 2022 to December, 2022, by two types of customers: casual riders (who purchase single-ride or full-day passes), and members (who purchase annual memberships).
The objective of the EDA was to understand and determine the usage patterns by customers, trying to answer the question:
How do annual members and casual riders use Divy bikes differently during the year?
After preparing and processing the data, I was able to work with 4.1 mio trips, and build analysis such as preferred rideable type, proportion of rides by type of user, the length and distance by each type of customer, usage patterns by day, month and throughout the year.
Finally, I had the chance to plot and see the map of Chicago by counting the preferred starting stations by latitude and longitude, and I did a graphical representation providing an intuitive and spatial understanding of the bike network, highlighting key stations based on activity and the connectivity among different locations. The visual nature of this graph makes it easier to identify patterns, such as popular stations or commonly used routes, within the bike-sharing network.
The code behind the article, can be found here: Kaggle.
About me
I am a professional with over a decade of experience in marketing, communication, branding, sales campaigns and managing squads using agile methodology. By the moment I am posting this article, I am also pursuing a 2-years course to gain knowledge in IT, such as data analytics. You can connect with me on LinkedIn and GitHub.