Finding the Best and Worst Times to Travel

Brandon Fung
INST414: Data Science Techniques
6 min readFeb 10, 2024

In the bustling heart of New York City, where the metropolitan area is home to over 20 million residents and countless visitors each year, John F. Kennedy International Airport (JFK) stands as a pivotal gateway to the world. Recognized as one of the busiest airports globally, JFK encapsulates the vibrancy and diversity of the city it serves, acting as a critical hub for international and domestic flights alike. Driven by a mix of both positive and negative experiences at JFK, I will conduct an Exploratory Data Analysis (EDA) and delve deeper into the operational aspects of JFK Airport. Through this exploration, I aim to uncover the factors contributing to these varied experiences, hoping to contribute insights that could enhance the travel experience for millions of passengers.

For my analysis, I chose to use the AirLabs API to collect flight departure information out of JFK. This API seems to be very useful in providing the necessary flight information for my analysis like destination and scheduled departures. These metrics can help derive insights regarding foot traffic and the busiest times to travel through JFK. However, due to my free plan, I am limited to retrieving only 50 rows of data at any given API call. Because of this, I am forced to make an API call once every hour to cover all flight departures. Since I cannot always keep my computer on, I decided to invest in an AWS EC2 instance to host my program. I created an ETL pipeline using Apache Airflow, which would make an API call every hour, automatically clean and transform it, and finally load it into an AWS S3 bucket. From there, I downloaded the file, as a CSV, onto my local machine for analysis.

Before I can start my analysis, it is critical to mention that my data only spans the months of January and February of 2024. I started this project at the start of the new year and due to financial constraints, I cannot afford to run the program for multiple months or even years. My main motivation for collecting live data over historical data, however, is to eventually create an interactive Tableau dashboard that travelers can view at any time to evaluate JFK traffic. I believe that having a live dashboard with real-time insights will be way more useful for travelers than an analysis of historical data.

To start, I had to call the API so I used the ‘requests’ library. This library was very useful and made it very easy to specify the arguments and fields that I wanted to retrieve. The only argument I used was the Departing Airport so that I could filter all flights to just those departing from JFK. The fields I wanted to retrieve included: Departure Date, Departure Airport, Terminal, Airline, Flight Number, Departure Time, Flight Delay, Arrival Airport, Arrival Time, and Flight Duration. These fields were chosen based on what I thought would be relevant to my analysis, though more fields can be retrieved. This data was then stored in a Pandas data frame to be cleaned and analyzed.

To clean the data, I extracted the month, day, and day of the week into separate columns. I chose this transformation/normalization to eliminate data redundancy and simplify my queries for analysis. After, I extracted the hour and minute from the departure time for the same reasons as specified above. Sometimes a departure date and/or time was not specified, but I felt that my sample size was large enough for the timeframe to simply drop these instances and still retain an accurate analysis. Finally, I extracted the airline from the flight number so that I could explore potential trends across different airlines.

For my analysis, I first wanted to figure out the busiest times to travel at JFK Airport. For simplicity, I defined ‘busiest’ as when there were the most departing flights. I grouped my data by day of the week and then proceeded to get the most common departure hour for each day. Finally, I sorted my data frame in descending order and the results can be seen here:

As we can see, the busiest days and hours are Fridays, Saturdays, and Sundays between 21:00–22:00, 21:00–22:00, and 16:00–17:00, respectively.

If we just want to figure out the busiest travel days, however, we can group the data by day of the week again, and then get the counts of each day. Here is a bar graph, showing the busiest travel days:

Friday and Sunday have very similar flight volumes whilst Tuesday is the least busy day of the week for travel.

Next, I wanted to see what routes were the busiest. This information could be useful because travelers on popular flights may expect to see more people and thus more traffic. To do this, I grouped my data by the destination airport code (ICAO) and then displayed it in a bar graph here:

Finally, I was curious as to what terminal was the most busy. The busiest terminal will likely yield the longest wait times in security and can be one of the biggest tells for someone wondering what their airport experience will be like. Again, I grouped my data, but this time by terminals, and found out that Terminal 4 was by far the busiest as seen below:

In analyzing the airport departure data from JFK, I was able to identify trends and patterns in flight frequencies across different terminals. While my study offers valuable insights into operational volumes and potential bottlenecks, it’s important to acknowledge the limitations inherent to my analysis. First, the dataset does not account for external factors such as weather conditions, air traffic control restrictions, and unforeseen operational disruptions, which can significantly impact flight schedules and frequencies. Moreover, my analysis is confined to a relatively short timeframe, limiting its ability to capture seasonal variations and longer-term trends that could provide a more comprehensive understanding of airport operations. Additionally, the data does not include information on flight cancellations or delays, which are crucial for a complete analysis of airport efficiency and passenger experience.

All in all, however, my analysis provides a glimpse into the power associated with real-time airport data. Of course, with more data, my analysis will become more robust and thus more reliable. The insights gathered, however, are still valuable in determining busy hours in airports and can be a tool that travelers use before they book their flights. In the future, I would like to expand to all airports in the United States so my analysis can be used by a greater audience. There is always room for more analysis and refinement, but my analysis serves to give travelers a basic understanding of the conditions at JFK Airport at any given time.

You can find the code for my analysis here.

--

--