AI in travel, part 2: representing cyclic and geographic features
Cyclic features, like recurring time intervals such as time of day, or day of week, can be challenging to represent. Consider time of day on the 24-hour clock. If we use the natural continuous representation, we get a discontinuity at midnight, as we tick from 23:59 to 0:00. We could choose fixed buckets to make time of day categorical (e.g. morning, afternoon, evening, night) but this loses precision as well as the ordering between buckets.
An alternative approach is to imagine a circular 24-hour clock, and represent the time using the Euclidean coordinate of the clock’s hand. As shown in Figure 1, this uses two numbers to represent a single time of day, but gets rid of the discontinuity since both coordinates now vary smoothly as the hand moves around the clock. Ian London includes a nice explanation in a blog post on modeling cyclic features.
This same approach extends directly to geographic features. The natural representation in this case is latitude and longitude, with, say, New York commonly represented as 40.7128°N, 74.0060°W. However, we again have problems with discontinuities, both near the dateline, where 180°E and 180°W coincide, and at the poles, where all points become close regardless of longitude. As shown in Figure 2, we can use exactly the same approach to fix the problem, by representing places with three coordinates instead of two, but constrained to the surface of the unit sphere.
Simpler calculation and aggregation
The geographical case suggests another benefit of this alternative representation. Normally, calculating the “crow flies” great circle distance between two places is quite complicated, since getting distance from latitude and longitude involves the scary looking Haversine formula (or equivalently the spherical law of cosines):
But if we represent New York and Paris as Euclidean unit vectors p and q, then Figure 3 shows that we can easily calculate the angle between them using their dot product, so that the required distance is simply: d = r arccos p · q.
This type of representation is also useful for aggregation. Normally calculating with cyclic features is painful because of wrap-around effects. How do we calculate the average launch time-of-day for a user? What about the average location of destinations they’ve searched? These are easy to calculate from the vector representations. Figure 4 illustrates calculating a user’s average launch time, resulting in a vector that points in the direction of the preferred time, with a length r that relates to angular variation. The value r ranges from 0 (indifference) to 1 (absolute certainty). To get a preference strength in the range [0, ∞) we could, for example, calculate the cotangent of the typical variation angle, i.e. r / √ ( 1 — r2 ).
Last time, we mentioned Nate Silver’s comment on the average Presidential birthday. By applying our cyclic encoding, we can do the math to average the vectors representing each birthday which results in an average direction that corresponds to December 5 (ignoring leap years). Nate’s method, finding the day of the year that minimizes the sum of squares difference in days to all birthdays, gives a slightly different value of November 20 (and minimizing average absolute distance says that any of Nov 19–23 are equally best). Why the different answers? Our cyclic average effectively defines distance as sin(𝛥𝜃) whereas Nate’s defines it as just 𝛥𝜃, which ends up giving much more weight to far off birthdays (those early Summer Presidents). Either way, we see that the preference strength of Presidential birthdays is very low (0.07), suggesting an underlying randomness…
If you missed Part 1 of our series, you can find it here.
— Patrick Surry, Chief Data Scientist at Hopper
Hopper is an award-winning mobile app that uses big data to predict and analyze airfare and accommodations. Hopper provides travelers with the information they need to get the best deals on flights and hotels, and notifies them when prices are at their predicted lowest points.
We’re hiring for data science and engineering! Click on our Jobs page to apply and for more information.