My First Data Science Project
A Simple Classification Problem
Completed using: Python (Pandas, Numpy, Matplotlib), SQL, MS Excel
If you’re a student pilot that’s working towards your private or commercial license, then chances are you look at the METAR report when determining whether or not you can fly on the day of your booking. This is what a METAR report looks like:
All this just to see what the weather’s like? How unnecessarily complicated. I have a much easier alternative: Read this 3000 word machine learning model analysis instead.
I don’t want to make this a difficult read, so I‘ll try to keep things simple and concise. This article contains the written explanation of my analysis. If you’re interested, my code and database queries can be found here:
github.com/justincessna/project1
My analysis is tailored to be applicable only on the Cessna 172 model aircraft. “Tailored” meaning my analysis was performed under this specific model’s flight constraints. Each aircraft is slightly different; for instance, the maximum wind speed this aircraft can handle during takeoff and landing is approximately 15 knots (according to it’s operating handbook). Another model’s maximum takeoff wind speed may be different.
Here’s a photo of a 172:
My analysis is centered around the machine learning model of Naïve Bayes. It’s really easy to understand how this classifier model works. The model takes in new input, and classifies them under certain categories on the basis of past inputs, where their categories have already been determined.
I’ll keep things simple for my first project. Thus, when applying this model to my analysis, it simply means I take new, weather data inputs to determine whether or not that day is suitable for flying. This is relatively easy to perform because there are only two categories where my outputs can lay, as opposed to the hundreds and thousands of categories present in exponentially more complex machine learning models. These two categories are:
- “Yes, flying is permitted,” or
- “No, flying is not permitted.”
First, I asked myself what factors of the weather determine my ability to fly? I can’t speak on behalf of experienced 737 captains, but as a student, I look for these three properties: Wind Speed, Visibility, and Cloud Ceiling. Wind speed and visibility are pretty self explanatory. The former tells you the strength of the wind, and the latter tells you how far you can see ahead. Wind speeds are usually measured in knots, and visibility is usually measured in statute or nautical miles. Knowing visibility is important because students fly in accordance with “VFR,” or “Visual Flight Rules.” All this means is that flying takes place through physically looking outside the cockpit instead of at the instruments. Basically, flying is not permitted if winds are gusting too strong, nor is it permitted if it’s impossible to see what’s reasonably ahead. For you, the one actual pilot out there that will read my analysis, I conflated crosswind and headwind into one variable as splitting the two is unnecessarily complicated for this analysis. Don’t worry about these two technical terms if you don’t know what they are. Lastly, there’s cloud ceiling. That measures how high the lowest clouds are above sea level. Students are advised to avoid flying through clouds because they may experience erratic turbulence that can lead to a loss of control of their aircraft. Their engine may also experience “shock freezing” from the cloud’s moisture. That means the engine stops. Lovely.
Under naïve Bayes, wind speed, visibility, and cloud ceiling are feature variables. Feature variables represent each measurable piece of our data set we use for analysis. In the made up, sample data set above, they’re represented by column heads B, C, and D.
A typical private license consists of three different types of flights. If you’re a student, you’ll have to experience all three of these during the process of obtaining your license. “Dual flights” are flown with your flight instructor. They need to be there to hold your hand during these couple hours because prior to your first lesson, you use the word “joystick” as an inappropriate euphemism. “Solo flights” are flown by yourself after you’ve demonstrated that you can operate the aircraft with professionalism and care. A “Cross Country” flight refers to when you fly and land at a different airport from where you departed. Cross country flights are done both solo and with an instructor on separate occasions.
For each different type of flight, there are “weather minimums” that need to be observed. For example, if clouds need to be at least 3000 feet above sea level, and they’re only at 2400 feet on the day of your booking, then you’re not permitted to fly.
Below are the weather minimums for the 172. For this analysis, I used a personal estimate based on about 30 flight hours (that’s where I’m at right now), but do note that these numbers are not entirely accurate and are debatable depending on experience.
Scroll back up and look at row 2 of our sample data set. According to the measurements of our three feature variables, we can see that flying was permitted on January 1st. This means that on January 1st, our three feature variables measured within the flight constraints of the table above, and thus, pilots were allowed to fly.
Now imagine it’s Friday, you’re a student, and you have a lesson booked with your instructor tomorrow, on Saturday. For once, you use the internet for something useful, and you easily find out what Saturday’s wind speed, visibility, and cloud ceilings are from a detailed weather forecast. Now that you have this information, how can you use it in conjunction with an actual weather data set to predict whether or not you’ll fly this Saturday?
Let’s first understand how the naïve Bayes classifier is mathematically represented:
P(A | B) and P(B | A) are conditional probabilities. A conditional probability can be defined as how we read “P(A | B),” which is, “the probability that event A will occur, given that event B has already occurred, and vice versa for P(B | A).” P(A) is the individual probability of event A occurring, likewise with P(B). P(A) and P(B) are class probabilities. A class probability is simply, an outcome’s probability; so in this case, we’re referring to the probability of outcome A, and the probability of outcome B. Two separate probability figures.
There are two underlying assumptions of the Bayesian classifier model:
- All feature variables are completely independent of each other, and
- All feature variables are equally weighted when determining the outcome.
This means that in any given data set,
- Variable X has nothing to do with variable Y, and
- X and Y equally affect the outcome’s probability.
This is actually why the model is deemed “naive.” Variables in real world data sets are very, very unlikely to be completely unrelated, nor are they likely to have an exact equal say in determining the probability of an outcome.
Our sample equation only contains one feature variable B. When more feature variables are accounted for in any data set, our equation turns into this:
That can be simplified into this:
Don’t be intimidated. What you see above simply represents a formula that accounts for a data set with n number of feature variables B, where i starts at 1. n can be literally be anything; 1, 5, 10… It’s just how many feature variables we have in our analysis. Since we have three feature variables, our n is = 3; making our final formula, this:
P(A | B1, B2, B3) = (P(B1 | A) * P(B2 | A) * P(B3 | A) * P(A)) / P(B1, B2, B3)
We only care about the probability of flying, not the probability of not flying, so let’s make A = “Yes, flying is permitted.” A; not P(A). Big difference. If A = “Yes,” then P(A) is the probability of “Yes.” Having established that, our final formula is defined as “the probability that we will fly (“Yes”) given that B1 (Wind Speed) is x knots strong, B2 (Visibility) is y statute miles far, and B3 (Cloud Ceiling) is z thousand feet high.
I hope at this point, you’re are starting to slowly piece everything together. P(B1 | A) is the probability that wind speed (B1) is x knots strong given that flying is permitted on that day (A). Only wind speed. I’ll say it again. Just wind speed. Why? Remember one of the model’s underlying assumptions? Each variable operates independently from all other variables. The same logic applies to P(B2 | A) and P(B3 | A). With P(B2 | A), we’re only looking at visibility (B2) with respect to outcome A, and with P(B3 | A), we’re only looking at cloud ceiling (B3) with respect to outcome A.
The next step is working with the real life data.
Step 1 is data collection. I acquired all my raw data from climate.weather.gc.ca. Data is collected via weather stations operated by the Canadian government. As for it’s integrity, I don’t see any reason for a bias against the weather.
I downloaded individual csv files for every month of the year of 2019. Each monthly data set is further split by day, which is split even further by hour. Here are the first 9 rows of January 2019’s data set:
As you can see, a ton of irrelevant information scattered across a messy table of 745 rows. Excellent.
Step 2 is analyzing and cleaning the data. The key here is to delete every piece of information I don’t need for my final calculation. Since I only need information on three feature variables, everything else is junk. Right? Well, as you can see from the column heads, I’m missing crucial data on cloud ceiling. Cloud ceiling figures are relatively obscure on a conventional weather forecast, so I wasn’t surprised they weren’t recorded. My remedy? I calculated it myself. These were the steps, as outlined by the Federal Aviation Administration, or FAA:
- I recorded the difference between surface temperature (Column J) and dew point (Column L). This is known as the “spread.”
- I divided each spread by 2.5 because temperature is measured in °C. I would divide them by 4.4 if temperatures are in °F. Why the seemingly arbitrary numbers? FAA.
- I multiplied each quotient from step 2 by 1000.
- Finally, I added ground elevation to each product in step 3. My flight school’s ground elevation is 936 feet.
With these steps, I was able to create a new list of daily measurements on cloud ceilings above sea level.
After about 5 hours of wrangling, joining, and manipulating my data using SQL and python, I successfully generated a master data set containing all the required information from the entire year of 2019 I used to calculate the probability of outcome A. The code’s logic was easy to execute, but extremely repetitive and time consuming to write from scratch.
Here’s a preview of my master data set:
And there you have it. I’ve created a table that’s clean and easy to read. These are actual figures, so each row represents actual observations for each day of the year 2019. Results 1, 2, and 3 are all outcome variable “A”s for their respective type of flight. They were calculated using a simple conditional IF statement with the following feature variable classification table:
Let’s look at row 3. On January 2nd, 2019, wind was blowing at 7.3 knots, visibility was 10 statute miles ahead, and the lowest clouds were 2287 feet above sea level. Students were permitted to fly with an instructor (Outcome A under Result 1), but were not allowed to fly alone (Outcome A under Result 2) or perform cross country flights (Outcome A under Result 3).
Step 3 involves performing pre-computations on my master data set. Scroll back up and look at our final equation if you need. The next part is just arithmetic; I must calculate all elements on the right hand side of the equation, P(A | B1), P(A | B2), P(A | B3), and P(A) individually before I can arrive at P(A | B1, B2, B3). After writing and executing a whole page of code, I finalized the following three tabular data sets for P(A | B1), P(A | B2), and P(A | B3) respectively.
Let’s look at the table immediately above; my cloud ceiling data set, Table 3.
Using the Feature Variables Classification Table from a few screens above, I sorted every figure under the cloud ceiling column of my master data set into: “Low,” “Medium,” or “High,” for each flight type: “Dual,” “Solo,” and “Cross Country” (Columns A and B respectively).
Then, I counted the number of “Yes” and “No” results for each flight type and cloud ceiling pair (Columns C and D). For example, row 8 of my cloud ceiling table tells me that the number of times a student was given clearance to fly solo (Column A), when the cloud ceiling (Column B) was “High” (> 3000 feet) was 98 times, and the number of times they were not given clearance to fly under the same scenario was 58 times.
Afterwards, I calculated the probability of each flight type and cloud ceiling pair by individually dividing the number of “Yes” and “No” occurrences by the total number of “Yes” and “No” occurrences. Going back to row 8, if the cloud ceiling predicted by the weather forecast to be “High” (> 3000 feet) on Saturday, and I have a booking to fly solo that day, then based on historical data, the probability that I will fly, by only looking at cloud ceiling is 69% (98 / 142), and the probability that I won’t fly is 27% (58 / 215). Once again, emphasis on the word “only.” Remember? Feature variables are assumed to be completely independent of each other.
Finally, I repeated this process 53 more times across each of my three feature variables to arrive at 54 unique probabilities of P(Bi | A). For example, the probability that I will fly a cross country flight after only observing a low wind speed is 87% (Cell E10 of Table 1 — Wind Speed). The probability that I will fly, say, a solo flight after only observing medium visibility is 1% (Cell E7 of Table 2 — Visibility).
The last element of our equation is P(A). That was determined using this table:
This table sums the total number of “Yes” and “No” outcomes for each flight type, regardless of what the magnitude was for each feature variable. For dual flights, you can see that there were a total of 222 times across the year when flying was permitted, and 135 times when flying was not. P(A) is calculated by simply taking each “Yes” sum and dividing it by the total number of outcomes, which is 357 (222 + 135, or 142 + 215, or 97 + 260).
That concludes the building process of our naïve Bayes classifier model.
Let’s test the model out.
Back to Friday evening. You look at the weather forecast for Saturday, and you see Wind Speed: 9 knots, Visibility: 10 statute miles, and Cloud Ceiling: 2495 feet. You know you’re flying solo, so what are the chances that you’ll fly tomorrow? Let’s apply our historical data to our calculation.
P(A | B1, B2, B3) is our desired outcome. This can be represented as:
P(“Yes” | Wind Speed = 9 kts, Visibility = 10 sm, Cloud Ceiling = 3000 ft)
Let’s combine the three variables B1, B2, and B3, as the variable, “Saturday.”
So now our desired outcome is:
P(“Yes” | Saturday)
Using Final Table 1, Final Table 2, Final Table 3, and Final Table 4, (which was derived from the Feature Variables Classification Table), the complete equation becomes:
P(“Yes” | Saturday) is =
(P(“Low” Wind Speeds (Solo)| “Yes”) * P(“Far” Visibility (Solo)| “Yes”) * P(“Medium” Cloud Ceiling (Solo)| “Yes”) * P(“Yes” (Solo)))
/ P(Saturday)
Reversely,
P(“No” | Saturday) is =
(P(“Low” Wind Speeds (Solo)| “No”) * P(“Far” Visibility (Solo)| “No”) * P(“Medium” Cloud Ceiling (Solo)| “No”) * P(“No” (Solo)))
/ P(Saturday)
According to each of our final tables,
P(“Low” Wind Speeds (Solo)| “Yes”) is = 85% (Cell E6 of Final Table 1)
P(“Low” Wind Speeds (Solo)| “No”) is = 41% (Cell F6 of Final Table 1)
P(“Far” Visibility (Solo) | “Yes”) is = 99% (Cell E8 of Final Table 2)
P(“Far” Visibility (Solo) | “No”) is = 92% (Cell F8 of Final Table 2)
P(“Medium” Cloud Ceiling (Solo) | “Yes”) is = 31% (Cell E7 of Final Table 3)
P(“Medium” Cloud Ceiling (Solo) | “No”) is = 60% (Cell F7 of Final Table 3)
and P(“Yes” (Solo)) is = 40% (Cell D4 of Final Table 4)
P(“No” (Solo)) is = 60% (Cell D5 of Final Table 4)
Thus, the proportional probabilities of outcomes “Yes” and “No” are:
P(“Yes” | Saturday)’s proportion = 85% * 99% * 31% * 40% = 0.1043 (rounded)
and P(“No” | Saturday)’s proportion = 41% * 92% * 60% * 60% = 0.1358 (rounded)
When I say proportional probability, I’m just referring to the formula’s numerator:
P(B1 | A) * P(B2 | A) * P(B3 | A) * P(A)
Since P(“Yes” | Saturday) + P(“No” | Saturday) = 1 (100%),
That means:
P(“Yes” | Saturday) = 0.1043 / (0.1043 + 0.1358) = 0.4344 (rounded)
= 43.44%
(0.1043 + 0.1358) is our P(Saturday).
So given Saturday’s weather conditions, the chance of flying tomorrow is slightly less than half, at 43.44%.
How accurate is this model? That’s up for debate. I think that if weather data was consistently collected and added to our master data set, I would say, given how objective our feature variables are, with more training, the model should be very accurate. Except during one, unique combination of inputs: when all three feature variables measures within the range of acceptable flying condition minimums. What do I mean? Consider the following variable measurements below:
Flight: Solo
Wind Speed: 3 knots (“Low”)
Visibility: 20 statute miles (“Far”)
Cloud Ceiling: 10,000 feet Above Sea Level (“High”)
Weather: Sunny
Given our Bayesian model, what’s the probability of flying?
P(Bi | A = “Yes”) * P(A = “Yes”) /
(P(Bi | A = “Yes”) * P(A = “Yes”) + P(Bi | A = “No”) * P(A = “No”))
So using our final tables:
(85% * 99% * 69% * 40%) /
((85% * 99% * 69% * 40%) + (41% * 92% * 27% * 60%))
= 0.2323 / (0.2323 + 0.0611)
= 0.2323 / 0.2934
= Approximately 79.18%
Not a bad chance at all, but I can tell you from experience, unless your 59 year old instructor has a hot lunch date, you’re 100% flying in this weather. So why does the Bayesian model think the outcome is only 79% probable? I don’t actually know for sure, but I think it’s because the model’s underlying data is assumed to be a normal distribution. That’s partially true. Below is a visual comparison between the distribution of P(Yes) and P(No) values for solo flights. I only need to show you one type of flight to demonstrate my point.
Aside from P(No)’s graph for visibility, you can see that P(Yes) graphs are all linear, and P(No) graphs are normally distributed (Yes, I know its a triangle and not exactly a bell shape). Now, look at this graph of outcomes:
Assume that each blue dot represents a probability value pair derived from a unique set of feature variable inputs. Both the x axis and y axis represent probability numbers as a percentage. So a dot on the “Outcome: Yes” end would have a yes to no probability ratio of say, 99.5% : 0.5%, while a dot on the “Outcome: No” end would have a yes to no probability ratio of the reverse: 0.5% : 99.5%. A dot on the middle would have a ratio of say, 55% : 45%. You get the gist.
Why are extreme ends of the spectrum so heavily populated, while the middle is basically empty? Well, during actual bookings, when even a single feature variable falls short of their respective weather minimum value, then the outcome is almost guaranteed to be “no.” On the other hand, if all variables satisfy their respective weather minimum values, then the answer is almost always “yes.” It’s that black or white, there’s very little middle ground.
Back to our unique scenario. All three feature variables on the hypothetical day of my booking were conducive to flying. If that occurred in real life, the outcome P(A | B1, B2, B3) will nearly always be “yes.” That’s because in this case, P(Bi | A) is 100%. The Bayesian model however, does not recognize P(Bi | A) as 100%. Instead, it utilizes the same values as if our historical probabilities are normally distributed like how it is under P(No).
So in conclusion, this model is very accurate at predicting our desired outcome. However, there is one exception to this statement: if inputs are in a specific combination of “Low Wind Speed,” “Far Visibility,” and “High Cloud Ceiling,” then, our model is inaccurate.
I think.