AI in travel, part 2: representing cyclic and geographic features

Hopper
Life at Hopper
Published in
4 min readApr 24, 2018

Cyclic features, like recurring time intervals such as time of day, or day of week, can be challenging to represent. Consider time of day on the 24-hour clock. If we use the natural continuous representation, we get a discontinuity at midnight, as we tick from 23:59 to 0:00. We could choose fixed buckets to make time of day categorical (e.g. morning, afternoon, evening, night) but this loses precision as well as the ordering between buckets.

An alternative approach is to imagine a circular 24-hour clock, and represent the time using the Euclidean coordinate of the clock’s hand. As shown in Figure 1, this uses two numbers to represent a single time of day, but gets rid of the discontinuity since both coordinates now vary smoothly as the hand moves around the clock. Ian London includes a nice explanation in a blog post on modeling cyclic features.

Figure 1: Representing time of day as a circular 24 hour clock embedded in the Euclidean plane, arbitrarily placing midnight at the top with the usual clockwise rotation. For example, 8am becomes 120° past midnight, and is represented as the coordinate (½√3, -½). This requires two real numbers to encode the time of day (although they’re constrained to lie on the unit circle), but importantly gets rid of the discontinuity.

This same approach extends directly to geographic features. The natural representation in this case is latitude and longitude, with, say, New York commonly represented as 40.7128°N, 74.0060°W. However, we again have problems with discontinuities, both near the dateline, where 180°E and 180°W coincide, and at the poles, where all points become close regardless of longitude. As shown in Figure 2, we can use exactly the same approach to fix the problem, by representing places with three coordinates instead of two, but constrained to the surface of the unit sphere.

Figure 2: By imposing arbitrary Euclidean axes on the globe, we can represent any point on the Earth’s surface with three real coordinates (x, y, z), and remove the discontinuities of the traditional latitude and longitude representation at the cost of an extra variable.

Simpler calculation and aggregation

The geographical case suggests another benefit of this alternative representation. Normally, calculating the “crow flies” great circle distance between two places is quite complicated, since getting distance from latitude and longitude involves the scary looking Haversine formula (or equivalently the spherical law of cosines):

But if we represent New York and Paris as Euclidean unit vectors p and q, then Figure 3 shows that we can easily calculate the angle between them using their dot product, so that the required distance is simply: d = r arccos p · q.

Figure 3: It’s easy to measure distances on the globe (or our 24h clock) using our Euclidean representation, since the dot product p · q gives the cosine of the angle between them.

This type of representation is also useful for aggregation. Normally calculating with cyclic features is painful because of wrap-around effects. How do we calculate the average launch time-of-day for a user? What about the average location of destinations they’ve searched? These are easy to calculate from the vector representations. Figure 4 illustrates calculating a user’s average launch time, resulting in a vector that points in the direction of the preferred time, with a length r that relates to angular variation. The value r ranges from 0 (indifference) to 1 (absolute certainty). To get a preference strength in the range [0, ∞) we could, for example, calculate the cotangent of the typical variation angle, i.e. r / √ ( 1 — r2 ).

Figure 4: The homogeneous representation also simplifies aggregation of cyclic values. The radial scatter plot (left) shows all app launches for a single user during a three month period. Each light blue line corresponds to a single launch. The red line shows the average launch time, calculated by simply averaging the (x, y) values corresponding to the individual launches. The direction shows that the average launch is near 3pm. The length of the red vector relates to preference strength, with the chord passing thru its endpoint subtending a segment (shaded) that measures the typical angular variation around the average.

Last time, we mentioned Nate Silver’s comment on the average Presidential birthday. By applying our cyclic encoding, we can do the math to average the vectors representing each birthday which results in an average direction that corresponds to December 5 (ignoring leap years). Nate’s method, finding the day of the year that minimizes the sum of squares difference in days to all birthdays, gives a slightly different value of November 20 (and minimizing average absolute distance says that any of Nov 19–23 are equally best). Why the different answers? Our cyclic average effectively defines distance as sin(𝛥𝜃) whereas Nate’s defines it as just 𝛥𝜃, which ends up giving much more weight to far off birthdays (those early Summer Presidents). Either way, we see that the preference strength of Presidential birthdays is very low (0.07), suggesting an underlying randomness…

Tweet from Nate Silver (@natesilver538)

If you missed Part 1 of our series, you can find it here.

Patrick Surry, Chief Data Scientist at Hopper

Hopper is an award-winning mobile app that uses big data to predict and analyze airfare and accommodations. Hopper provides travelers with the information they need to get the best deals on flights and hotels, and notifies them when prices are at their predicted lowest points.

We’re hiring for data science and engineering! Click on our Jobs page to apply and for more information.

--

--

Hopper
Life at Hopper

Hopper uses big data to predict when you should book your flights & hotels. We’ll instantly notify you when prices drop so you can book travel fast in the app.