Thanks to BeautifulSoup and Regex
I traveled quite a bit last year, which in my case meant frequent plane rides. Aside from one flight, I had relatively good luck. My lone problem flight involved hours of delay resulting from the pilot needing to abort take-off at the last second. Of course, nobody ever explained why the pilot slammed the breaks — we were just shuffled off the plane and instructed to wait for further instructions. Curious to know what happened, I found a website that tracks air traffic incidents and searched for about a week for my flight’s report. I never found the flight or the reason it aborted takeoff, but I did find a subject for a web scraping project.
The Aviation Herald reports “incidents and news in aviation” daily. The landing page for the site lists reports as headlines, and allows you to click each headline for a more detailed article. Aviation Herald has reports on incidents dating back to June 19th, 1999 and looks like this:
In case you’re wondering, the little picture near each headline isn’t just a funny looking bullet point. Each report is classified into one of five categories:
Lucky for me, each headline is structured very similarly, which I assumed would make for easy parsing. A quick scrape, parse, & clean and I’ll be armed with knowledge that will make me even more anxious each time I fly! Yippee!!
Before scraping, I inspected the site’s HTML to get an idea of how I could locate each headline. Luckily, the HTML included a nice big note that read
<! — Begin Content Section-->, which reassured I was searching in the right place.
<table> hierarchy, I was able to find the report classification (as the title of the
<img class= “frame”>), and the full headline (under
<span class = “headline_avherald”>). These were exactly the two elements I needed. Easy! I realized, however, that each page contained less than 50 headlines, dating back only about a week. If I wanted enough data to perform any sort of analysis, I’d have to find a way to move through pages and scrape data from each new page. Since the URL for avherald.com isn’t paginated nicely (ie no page=2 within the URL path), I had to get creative. The best way I could think of to advance through this website’s pages was by finding the URL path associated with the page’s “Next” button.
To scrape the site, I used the
BeautifulSoup libraries to fetch the content and create a BeautifulSoup object. I created a function, with url as an input, since I knew I would need to do this multiple times, for multiple pages. Here’s a look at my code:
Great! I have a BeautifulSoup object, I know exactly where I can find the classification and the headline. Aviation insights are within reach!
Locating and isolating the classification & headline for each report was just the tip of the iceberg. I realized that all of the detailed data I hoped for was contained with in one long string: the headline! I needed to find a way to isolate phrases to make a meaningful data frame.
Enter: Regular Expressions. According to regular-expressions.info, a regular expression (regex) is a “special text string for describing a search pattern.” This seemed perfect for my purposes. Luckily for me, the structure of the headlines followed a pattern:
[Airline] [Aircraft] at/near [Location] on [Date], [Short Description]
So if I could get the proper special text string to describe this pattern, I’d be able to build a data frame by iterating through each headline and parsing like the line above.
Unfortunately, regex strings look like this:
I don’t know about you, but my first look at this string of symbols made me cringe. This looked unreadable to me — and I assumed it would take forever to decipher. Actually, it was a pretty quick to learn simple syntax for regex strings, especially since I found this magical site, Regex101, to help.
Loaded with a quick reference guide to help you find “tokens” (special text strings), an explanation window, match window, and intuitive highlighting to help you debug, this site makes building a regex string relatively simple. Since I’d be parsing many, many headlines, I made a function for parsing. The function returns a list of the parsed segments of my original string.
Finally, time to put everything together and build a data frame I can analyze. Some disclosures: I only grabbed headlines that followed the format above. A couple of headlines included two airlines & aircrafts or two locations. Because there were so few instances of these, the effort to include them far outweighed the benefit for my purposes. So, I decided to exclude these edge cases.
Ta-da! A gorgeous database to play with:
A quick df.info() inspection revealed my data frame had 450 rows, 6 columns, and that all columns were non-null objects. Headlines are listed in the order they are posted to the site, and my final headline was posted on 12/10/2018. No aviation headlines prior to this one were considered. Some more quick stats:
Digging in a little deeper…
Out of 222 unique airlines, most appear 2 times or less. More disclaimers: I didn’t combine associated airlines. For example, Aeromexico and Aeromexico Connect are considered separate airlines in my data. If I had done the research to and enriched my data to account for this, my result here would likely be pretty different.
Each airline in the top 5 is North American. I wonder if North American airlines really have more incidents? There could be a lot of other factors at play here. Maybe North American airlines are quicker to report incidents? Maybe North American airlines are less likely to operate regional flights under a different name? Without more research, my data can’t answer these questions.
Of the 69 unique aircraft listed in our headlines, most appear fewer than 5 times. There are a good amount that appear 5–25 times, but strikingly there are 2 aircraft that appear over 50 times! Let’s take a look at which aircraft are so problematic.
Hmmm, not what I was expecting. None of the top 5 aircraft are the B38M that was recently grounded after two crashes. In fact, after more digging, the B38M was only involved in 10 headlines since December. I wonder if the top two aircraft are actually more problematic than other aircraft, or if they are just very popular models. Again, I’d need more details on usage of each aircraft to answer that question.
I made a quick word cloud to visualize the headline descriptions. After removing typical & some specific stop words, we can see that engines, landing, and bird strikes seem to be big problems!
With a bit more cleaning of my data, I was able to turn the Date strings into datetime objects, and figure out the day of the week each incident occurred. Turns out, Mondays are rough for the aviation industry as well.
I’ll try leave you with some good news. Flying can be scary and things go wrong every day. HOWEVER, of the 450 headlines that I analyzed, only about 14% resulted in an accident or a crash. All others were resolved safely.