Capital Bikeshare — Time Series Analysis

If you have ever been to Washington DC you know that the best way to get around the city is outside of the car, whether that be walking or biking. The city is full of historical monuments and beautiful scenery that is better embraced while on a slow jaunt that allows for detours and shortcuts, compared to the confined ride from the inside of a car. Capital Bikeshare is a bike sharing system located in Washington DC. Used by both tourists and residents of the city, Capital Bikeshare allows people to unlock a bike at any of their numerous docking stations, ride it as long as they desire, then dock it at any of the stations nearby their destination. These bikes are available for rental 24 hours a day, 7 days a week, 365 days a year. The abundance of docking stations in addition to the boundless availability makes Capital Bikeshare an ideal option for people embarking on a one-way trip, or those who do not wish to buy their own bike but would prefer to use a convenient bike sharing service.

Capital Bikeshare records data every single time a rider uses their service and then compiles these instances into a dataset which is released quarterly. They collect data about the duration of the ride, start date, end date, start station, end station, bike number, and member type. For my analysis I am going to work with the datasets released during 2016 and 2017. During these two years alone, Capital Bikeshare recorded data from 7,091,771 instances in which someone used their bike sharing service. Because of the vast amount of data that was available at the touch of a button, I was curious to explore the dataset and see what kind of discoveries I could find using statistical models and prediction methods.

Before I even started my analysis I could make some broad assumptions about what I expected to see with the dataset. To start, Washington DC experiences all four seasons of weather and so I could assume that not as many people would be riding bikes in the snowy winter compared to the other seasons. On the other hand, the summers in DC are very hot and humid which are less than ideal conditions to ride a bike in. During these brutal days of summer, I can assume that not as many people would be using Bikeshare compared to the mild-weather days.

For my analysis of the dataset I wanted to see if my assumptions that were made above were valid or not. To begin, I created a simple graph that showed the number of rides per quarter for both years (shown below). As you can see, the first quarter (January — March) has the lowest amount of rides which is most likely the result of the snowy weather during those winter months. The fourth quarter (October — December) has the second lowest amount of rides which makes sense because during those fall months the days are getting shorter, the temperatures are dropping, and tourism is down. During the second and third quarters (April — June, July — September), Capital Bikeshare experiences a surge of rentals. This surge is likely the result of an increase in tourism during this time, as well as, during the summer months people enjoy riding their bikes to work to bypass the traffic and get some exercise in.

I then utilized feature engineering on my dataset and used seaborn to plot out the overall trend of my dataset (shown below). This graph displays the overall trend in the Capital Bikeshare dataset. It communicates the increase in activity in the summer months, the decrease in the winter time, and the line of best fit shows that there is a gradual increase in overall activity per month as time progresses.

Next, I used seasonal decompose to identify seasonal, trend, and residuals (shown below). These graphs may seem confusing and intimidating at first but they can be easily broken down. The observed chart shows exactly what you would think, it plots the instances in the dataset as is. Below the observed is the trend graph, which smooths the observed data to make the underlying trends more apparent. With the bikeshare data, the trend line shows that over the course of the year there is a progression of peaks and valleys — the peaks occurring in the summertime, the valleys in the winter months. The seasonal graph isn’t as communicative as the others, however, what we can observe is that there are a lot of ups and downs throughout the year in terms of how many times people use Capital Bikeshare per day. Lastly, the residual graph displays the difference between the observed datapoint and the predicted value. This statistical model graph shows that our residuals were pretty high, with the model performing better on the 2016 dataset than it did on the 2017 data. This isn’t ideal because that means that our model isn’t accounting for a lot of our observed data.

After running a seasonal decompose on the dataset, I ran a linear regression using the statsmodel python package. This model produced an r-squared value of 0.577 which is lower than I would have liked, however, it still captured the overall trend of the dataset. The regression correctly showed that the peak months for bike sharing would be in June, July, and August; while the lowest months would occur in November, December, and January. For the future, the model predicts that this cyclical pattern will continue through the year 2021, with the overall activity of Capital Bikeshare also increasing at a steady pace.

In the end, I think that my assumptions about Capital Bikeshare usage were valid. The various statistical models that I ran on the data all showed that the summer months were the best time to use the bikes, while the winter months proved to be less popular. This cyclical pattern is expected to occur in the future, as well as an uphill trend for the overall use of the company’s bikes as bike sharing is becoming more commercialized and used in cities across the world. If I were to do this analysis again, I would choose to run multiple statistical and prediction models to attempt to get a higher r-squared value and lower residuals so that my models prove to be more accurate than the ones used above.

--

--