Data Cleaning with Power Query

Dooter Ior
4 min readAug 30, 2024

--

Introduction

Data is often messy, but with the right tools and techniques, it can be transformed into a powerful asset. The dataset from Kaggle — US Airline Flight Routes and Fares 1993–2024, packed with details from flight dates and routes to fares and passenger counts, presented an ideal opportunity to showcase the transformative potential of Power Query in Excel.

The Dataset and the Challenges

The dataset included alot of information, but not all of it was relevant or ready for analysis. To extract meaningful insights, I had to clean and prepare the data meticulously. This process involved:

  • Dropping Irrelevant Columns: Many columns contained redundant or irrelevant information that didn’t contribute to the analysis.
  • Renaming Columns: Standardizing column names made the dataset more understandable and easier to navigate.
  • Splitting and Trimming Columns: Data was often buried within other data, requiring the splitting of columns to isolate useful information.
  • Merging Tables and Mapping Fields: Integrating data from multiple tables allowed for a more holistic view of the flight operations.
  • Performing Calculations and Creating New Fields: Custom calculations and derived fields enabled deeper analysis, such as determining trends over time.

Power Query: The Engine Behind Data Cleaning

Power Query was instrumental in turning this raw data into a clean, structured dataset. Its robust data transformation capabilities enabled me to:

  • Automate the Cleaning Process: Once set up, Power Query can automatically apply the same cleaning steps to updated datasets, saving time and reducing errors.
  • Merge and Append Data Seamlessly: Integrating data from multiple sources or tables was straightforward, ensuring all relevant information was available for analysis.
  • Create Custom Columns and Perform Calculations: Power Query’s ability to generate new columns based on complex logic was crucial for answering the analytical questions.

The Insights Uncovered

With the data cleaned and structured, I was able to answer several key questions:

  • Passenger Count Per Year: reveals a steady increase in the average number of passengers, indicating sustained growth in the airline industry. The slight dip in 2020 stands out, corresponding with the global impact of the COVID-19 pandemic. This setback was temporary as the data shows a recovery in passenger numbers post-2020 as travel restrictions were eased.
  • Discount Trends Per Quarter/Year: the first quarter of each year consistently shows the highest amount of discount applied suggesting several possible explanations such as a combination of seasonal promotions, demand fluctuations, and strategic pricing decisions by airlines to maintain passenger volumes during a typically slower period.
  • Mile to Fare Ratio Over the Years: revealed a clear correlation between the distance traveled and the fare charged, with shorter flights generally having lower fares. This correlation is consistent with pricing strategies in the airline industry, where shorter routes typically incur lower operating costs and are thus priced more affordably. This insight underscores the complexity of airline pricing models, where factors like distance, demand, competition, and operational costs all play crucial roles in determining fares.

Other insights gathered;

  • Trend of Average Passenger, Fare, and Discount Applied Over the Years and Quarters: This analysis provided insights into how the airline industry balanced passenger load with pricing.
  • Busiest Routes by Passenger Count: Identifying the most popular routes highlighted key travel corridors and hubs.
  • Most Expensive and Cheapest Routes: This comparison sheds light on pricing disparities across different routes.

The Power of Power Query in Data Cleaning

This project underscored the importance of data cleaning in the analytical process and showcased how Power Query can be an invaluable tool. Power Query’s ability to handle large datasets, automate repetitive tasks, and transform data quickly made it possible to clean the data efficiently and effectively. Without these preparatory steps, the insights drawn would have been far less accurate and reliable.

Conclusion

Data cleaning is a foundational step in data analysis that ensures the results are valid and actionable. With its robust functionality, Power Query plays a crucial role in this process, enabling analysts to transform raw data into a format ready for deep analysis. The insights I gained from the US Airline Flight Routes and Fares dataset are a testament to the power of clean data and the tools that make it possible.

If you’re looking to enhance your data analysis skills, mastering Power Query for data cleaning is a must. Not only does it simplify the process, but it also opens up new possibilities for discovering actionable insights hidden within your data.

Get the Code

The full analysis and data can be found here.

Thanks for reading!

--

--