Analysing flight data

5 min readJul 18, 2023

In this article I’ll analyse a dataset of all flights that departed NYC in 2013

I’m going to explore the data, ask some questions, and answer them using python and pandas, matplotlib and seaborn.

The visualisations are clear and easy to understand, and they help to illustrate my findings.

To get started, import the necessary tools

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

Next, read and preview the data. You can download it here

df = pd.read_csv("flights.csv")
df.head()

df.head() returns the first 5 rows in the dataset

Next up, Data Cleaning

df2 = df.dropna() # Remove rows that have null or empty values
df2.drop(columns=["Unnamed: 0"], inplace=True) # Remove unnecessary index

# Replace airport codes with their actual names
df2["origin"] = df2["origin"].replace(
    ["EWR", "JFK", "LGA"], ["Newark", "John F. Kennedy", "LaGuardia"]
)

I’m going to state a question, answer it using some python code and a chart for visualisation. While answering these questions, I’ll also do some feature engineering, transforming features like month, and generating new ones like day of the week, so I get get more from the data.

Analysing a single feature of a dataset is referred to as Univariate Analysis

Question 1. Which months had the most flights?

Distribution of flights across months.

# Months with the most flights, sorted by month number
year_map = {
    1: "Jan",
    2: "Feb",
    3: "Mar",
    4: "Apr",
    5: "May",
    6: "Jun",
    7: "Jul",
    8: "Aug",
    9: "Sep",
    10: "Oct",
    11: "Nov",
    12: "Dec",
}
df2["month_name"] = df2["month"].map(year_map)
sns.countplot(x="month_name", data=df2, order=year_map.values())
plt.xlabel("Month")
plt.ylabel("Number of Flights")

A bar chart showing months with the most flights, sorted by month number

Question 2. Do people fly more during the weekends? (Friday, Saturday, Sunday)

The relationship between the day of the week and the number of flights.

To answer this, I’ll combine year, month, day into one column. I’ll call the new column date and we’ll derive another column day_of_weekfrom it. Next, I’ll create a cross-tabulation of the day of the week and the number of flights. This will show us if there is a significant difference in the number of flights on weekdays versus weekends.

# Merge year, month, day into one date column
df2["date"] = pd.to_datetime(df2[["year", "month", "day"]])
# Categorize date into Weekday and Weekend
df2["day_of_week"] = df2["date"].dt.dayofweek.apply(
    lambda x: "Weekday" if x < 5 else "Weekend"
)

# Plot weekday vs weekend flights
sns.countplot(x="day_of_week", data=df2)
plt.xlabel("Day of Week")
plt.ylabel("Number of Flights")

It’s clear from the above graph that there were more flights on weekdays by a wide margin.

Question 3. Do you have more flights in the morning, afternoon, or evening?

The distribution of flights across time of day

Let’s categorise flights by morning, afternoon or evening by creating a column called time_of_day.

To answer this question, I’ll create pie chart of the number of flights per time of day. This will show us which time of day has the most flights and which time of day has the fewest flights.

def time_of_day(x):
    if x < 12:
        return "Morning"
    elif x < 17:
        return "Afternoon"
    else:
        return "Evening"


df2["time_of_day"] = df2["hour"].apply(time_of_day)
df2["time_of_day"].value_counts().plot(kind="pie", autopct="%1.1f%%")
plt.xlabel("")
plt.ylabel("")

Most flights were in the morning, followed by the afternoon, and the least number of flights were in the evening.

Bivariate Analysis

To get more insights from the data, I’ll explore the relationship between any two features in the dataset.

Question 4. Which airline has the most delays?

The relationship between airline and the number of delays.

The answer to this question can be found by creating a cross-tabulation of the airline and the number of delays. This will show us which airline has the most delays and which airline has the fewest delays.

carrier_map = {
    "9E": "Endeavor Air Inc.",
    "AA": "American Airlines Inc.",
    "AS": "Alaska Airlines Inc.",
    "B6": "JetBlue Airways",
    "DL": "Delta Air Lines Inc.",
    "EV": "ExpressJet Airlines Inc.",
    "F9": "Frontier Airlines Inc.",
    "FL": "AirTran Airways Corporation",
    "HA": "Hawaiian Airlines Inc.",
    "MQ": "Envoy Air",
    "OO": "SkyWest Airlines Inc.",
    "UA": "United Air Lines Inc.",
    "US": "US Airways Inc.",
    "VX": "Virgin America",
    "WN": "Southwest Airlines Co.",
    "YV": "Mesa Airlines Inc.",
}
df2.groupby("carrier")["dep_delay"].mean().sort_values(ascending=False).plot(
    kind="bar",
)
plt.xlabel("Airline")
plt.ylabel("Average Delay (mins)")

Frontier airlines and ExpressJet had the highest flight delay in minutes, while Hawaiian and US Airways has the lowest.

Question 5. Which origin has the most delays?

The relationship between origin and the number of delays.

This will show us which origin has the most delays and which origin has the fewest delays.

df2.groupby("origin")["dep_delay"].mean().sort_values(ascending=False).plot(
    kind="pie", autopct="%1.1f%%"
)
plt.xlabel("")
plt.ylabel("")

Bonus Question. Which airport was the busiest, based on the number of flights?

Distribution of flights across airports

I’ll answer this question by creating a bar plot of the number of flights per airport. This will show you which airports have the most flights and which airports have the fewest flights.

df2["origin"].value_counts().plot(kind="bar")
plt.xlabel("Airport")
plt.ylabel("Number of Flights")

Conclusion

I learned some interesting things from my analysis. People preferred to fly on weekdays. As to why, I believe the answers lie in the data. You can find the notebook on GitHub. You’re welcome to play around with it, and maybe you’ll discover something interesting.

References

NYC 2013 Flight Data CSV and DOC

If you found this helpful, please consider following me on Twitter, reacting to this post or leaving a comment. Cheers!