Exploring Linear Algebra — Part 1: Predicting Route Costs

André Mello
10 min readMay 13, 2018

This is my first entry in a series of articles with creative applications of linear algebra to problems. This one was inspired by an Uber ride.

So, imagine you are Google Maps, and your client wants to know the best path to take from point A to point B. If you have the city’s map, it’s easy, right? Just wearily apply Dijikstra’s algorithm to find the shortest path, and that’s your answer. Or maybe not.

If you’ve taken enough Uber rides, you know that sometimes the shortest path also happens to be the one under worst maintenance, or maybe it’s the most jammed up. If, like me, you live in a city scantily decorated with clusters of poverty and crime, there are paths you want to avoid at all costs. Several environmental factors, which may vary according to weather, time of the day and occasion will influence the ride’s duration, so more than just geographic distance should be taken into consideration when searching for the optimal route.

The Waze app does planning smartly. It crowdsources additional markers, like accidents, radar guns and asphalt holes. It uses the client’s GPS to estimate speed, and uses that to derive traffic intensity. It’s for good reason Uber and taxi drivers alike rely on that app so much to navigate chaotic cities like São Paulo. But I’ve found that it often favors a road in bad shape or in a dangerous location, simply because it’s a little bit faster to go through it.

What could be made better? It seems ride duration is not the only variable that should guide the decision algorithm. We could artificially introduce other variables and markers and weight their relative importance, but ultimately that would fail to generalize — different cities and different people imply different priorities. I feel it would be more robust, scalable and elegant to derive a cost function implicitly, from simple user feedback. Would that be possible?

If I had a definitive answer, I would be selling it to Google, but it turns out there’s a simple and quite educational approach we might explore. It just involves a little bit of Linear Algebra.

Articulating the problem

Suppose all the information you have is a map, a series of paths along it and their respective lengths (or, more generally, costs). Your goal is to assign values to each individual path segment, so that you can easily estimate a new path’s cost by summing up the cost of its parts. If you represent the map as a directed graph (roads might only go one way), and each intersection, merger or split as a vertex, your edges become the segment units and any path is simply an ordered sequence of visited vertices (or, alternatively, traversed edges).

To put it a little more formally, you are given a weighted digraph G = (V, E), associated with an unknown function w(e) : eE ⇾ ℝ⁺, as well as a collection of tuples pᵤ =({eᵥ}, cᵤ), with eᵥE and cᵤ ∊ ℝ⁺, and the objective is to reconstruct w(e). If we enumerate each w(eᵥ) as wᵥ, from our paths {pᵤ} we may derive a system of equations as follows:

where xᵤᵥ is defined by

If you like, this can be rewritten as Xw=c. We know how to solve that exactly as long as rank(X) = m. If that’s not the case, we may still find an approximation. This is already a reasonable solution for our problem, as long as our c are exact, but that is hardly possible in real life.

For a more realistic model, we may assume our pathwise measures of c are perturbed by some random noise, say cᵤ ∼𝓝(μᵤ, σᵤ). In that case the rank precondition is not sufficient, and we may only derive an approximation to the solution, by minimizing the error ‖Xw-c‖.

A toy example

It always helps to walk through an example. Suppose we work at a consulting firm on Big Data, called Large Data, and the Happy Mayor of the cozy town of Zizgy is our client.

The inhabitants of this town all have a map, which the Mayor happily distributes as fancy fliers that he spreads all over, from his helicopter, every month. People always complain about the mess, but the Mayor’s mysterious smile always catches them off guard, and they always end up forgiving him.

The somewhat surreal map of Zizgy!

It was all good around Zizgy for a while, until Uber came over and started claiming that the map was incomplete and its drivers did not know the shortest routes to take, what was increasing client dissatisfaction. The Mayor had never thought to measure the road lengths, and when he tried he found out his ruler wasn’t long enough to cover even the smallest of them. Luckily, Uber had given him their data, as evidence of how inconsistently drivers chose how to get from one point to another. So naturally he sought our firm’s expertise to make good use of that Data, which he thought was very Big.

That data included complete information on the path taken for each ride, as well as total charge. Since there was a well defined, publicly available ride fare policy, we could estimate the total distance traveled from the price. And since the road network could be represented as a graph, we figured this was an edge weight recovery problem!

Graph representation of the map.

After constructing the road graph as illustrated above, the encoded paths became

. If (0,1)=e₀, (1,2)=e₁, (1,5)=e₂, (2,1)=e₃ and so on (in lexicographical order), the indicator matrix X becomes:

With X and c, we can then approximate w with the least squares method:

The residuals Xw-b have a standard deviation of about 0.65, which seems reasonable — the values are distinct enough that this is already useful for finding the shortest paths. We deliver this result to the Mayor, who smiles at our competence.

As he was updating the town’s map with the length estimates, however, his secretary tells him that she found the true measurements on an old archived map from his predecessor. The townsfolk become unhappy when they hear that the Mayor has spent money on a consulting contract unnecessarily, but they warm up at his delightful smile as he showers the town with the new maps. All’s well that ends well.

Actual graph and lengths for Zizgy’s roads.

As it turns out, our estimates were off by an average of 0.53 units, though for road 0–1 we had a whopping error of 1.33!

Notice that, in this example, we could’ve improved our estimates by taking advantage of the fact that the two-sided roads like 2–5 are symmetrical — they’re the same length from 2 to 5 and from 5 to 2. I left both directions as separate unknowns to illustrate the more general framework, under which we are approximating some abstract cost measure that may depend on which direction one travels (for example, traffic congestion might be more intense on one direction or the other).

In case you are wondering, that ride data was generated by picking random paths within the graph, and computing the total length with added noise ε∼𝓝(0, 1). Given sufficient data, we might have been able to guess the error distribution, and we would’ve been able to take a bayesian approach to have much better estimates of the lengths.

Coding a simulation

Now that we’ve walked through an example, I thought it would be interesting to explore an implementation of this modelling framework using TensorFlow. The reason I chose TF was because I wanted to show how it could be useful for general machine learning, outside of Deep Learning. This choice also enables online learning, with which new data is gradually generated and learned from. This more closely resembles a production-ready real world implementation.

First, here’s the code I used to generate a single random path from an adjacency matrix:

Here’s an example of the kind of adjacency matrix this code expects:

In this format, every non-zero element represents an edge, and its value represents the edge’s weight. The parameter σ controls the standard deviation of the noise added to the total distance, and α controls the chance of revisiting a vertex.

This last parameter was added to model rides where the driver makes a mistake and creates a loop in his path. If α = 0, X is as defined before — an indicator matrix that encodes which vertices were visited in each ride. If α > 0, then it encodes how many times each vertex was visited. Either way, the equation Xw=c doesn’t change, nor does our method for solving it. Just make sure α < 0.5, otherwise the algorithm has a good chance to cause a stack overflow for trying too many cyclic paths.

One thing to note is that, since our solver doesn’t necessarily know in advance which edges exist in the graph, we encode our X as a |V|×|Vn tensor x, which is able to model even a fully dense digraph with self-loops (not that this makes sense given our interpretation of it as a map). Our solution for w is actually the same matrix as Adj. Under this structure, the problem equation is actually x:w=c, with the double-dot operator denoting tensor double contraction, but this is really just the same thing as representing X as a |V|²× n matrix with rows as the flattened hit matrices from walk().

Here’s the rest of the code:

This creates a Dataset object (that generates random paths on-the-fly), defines the adjacency matrix, defines the computation graph and executes some 5000 training iterations (with each iteration applying a single path).

Here’s the final weights I got by running the above code:

This has a Mean Squared Error (relative to the true weights) of about 0.85. Not a great result, specially since 5000 paths is a lot of data for such a small graph.

Since the paths always have some random noise added to them, the weights never actually converge to a fixed value, but keep on oscillating:

Weights histogram throughout iterations.
Evolution of the loss over training.

For reference, if there’s no noise in the data, the weights do converge to the exact solution:

Evolution of the loss over training for noiseless data.

We can take advantage of the weight oscillations to estimate a probability distribution over their values:

Derived from the later 50% iterations from a total 100000.

By taking the average of the weights over a large number of iterations (i.e. 50000), we can better estimate their true values:

This result has a MSE of about 0.04.

Additionally, batching does a great job at reducing variance:

Loss evolution with batches of 50 paths.
Derived from the later 50% iterations from a total 2000, with batches of 50 paths.

By combining both techniques, we can get arbitrarily close to the true values. We can also reuse samples to save on training data and possibly learn faster. Note that if the true weights are not time-stationary, we should constrain our training data and weight averages to some recent window. Otherwise, the more we accumulate, the better.

You can look at the full code over at this notebook.

Final thoughts

After all this, it’s worth taking a step back and looking at the big picture. I’ve given an example where our goal was to estimate road lengths, so we could find shortest paths as we wish, but in reality road lengths are often known to adequate precision, and aren't the best cost to minimize. I propose not to optimize just distance or duration, but some implicit cost function that depends on the environment and the user. This function is expected to highly correlate with the duration for each ride, but may also depend on factors that we cannot always measure directly, such as road condition, temporary detours and neighbourhood.

Say we ask the user to rate their ride each time, in a scale of 1 to 5, separately from their driver. That score can be used to derive an implicit cost sample. There is a number of ways to estimate future costs from this feedback, but perhaps the simplest one is to assume that, for each user, cost is a fixed function of the path alone, and assign weights to each road segment. Even if those weights are not really fixed over time, our online learning approach should be able to adapt accordingly. At the very least, this is a starting point for more complex and contextual approaches.

I hope you’ve enjoyed this article, and since I’m a Medium newbie, I appreciate any feedback. I’ve got a few ideas for future entries, but I’d love to hear some suggestions as well!

--

--