Requisite background: high school level programming and calculus. Explanation of backprop is included, skim it if you know it.
If you’ve taken a calculus class, you’ve probably differentiated functions like g(x) = (x² + 1) * e^(-x²/2). But if you want to do math on a computer (for e.g. machine learning) then you’ll need to differentiate functions like
function f(x): # Just wait, the variable names get worse…
a = x^(½) # Step 1
b = 1–1/(a+1) # Step 2
c = ln(b) # Step 3
This is a toy example, but imagine how gnarly this could get if f(x) contains loops or recursion! How can we differentiate f(x), without having to write it all out in one line? More specifically, given the code to compute f(x), how can we write code to compute df/dx?
The answer is backpropagation. We step through the function, differentiating it line-by-line, in reverse. At each line, we compute the derivative with respect to one of the internal variables. As we work back, we differentiate with respect to earlier and earlier variables until we reach x itself.
Let’s work through the example. Keep an eye on the diagram above to see where we are as we go.
First, we differentiate the last line:
return c: df/dc = 1
Then, we work backward:
Step 3: c = ln(b) -> df/db = df/dc * dc/db = df/dc * 1/b
Notice what happened here: we used the chain rule, and one of the two pieces to come out was df/dc — the result of the previous step. The other piece is the derivative of this particular line. Continuing:
Step 2: b = 1–1/(a+1) -> df/da = df/db*db/da = df/db * (1/(a+1)²)
Step 1: a = x^(½) -> df/dx = df/da*da/dx = df/da * (1/2*x^(-½))
Each step takes the result of the previous step and multiplies it by the derivative of the current line.
Now, we can assemble it all together into one function:
a = x^(½) # Step 1
b = 1–1/(a+1) # Step 2
c = ln(b) # Step 3 df_dc = 1
df_db = df_dc * 1/b # df_dc times derivative of step 3
df_da = df_db * (1/(a+1)²) # df_db times derivative of step 2
df_dx = df_da * (1/2*x^(-½)) # df_da times derivative of step 1
The name “backpropagation” derives from the df_d* terms, which “propagate backwards” through the original function. These terms give us the derivative with respect to each internal variable.¹
In the next section, we’ll relate this to intermediate prices in supply chains.
Suppose we have a bunch of oversimplified profit-maximizing companies, each of which produces one piece in the supply chain for tupperware. So, for instance:
- Company A produces ethylene from natural gas
- Company B produces plastic (polyethylene) from ethylene
- Company C molds polyethylene into tupperware
We’ll give each company a weird made-up production function:
- Company A can produce a(x) = x^(½) units of ethylene from x units of natgas
- Company B can produce b(a) = 1–1/(a+1) units of polyethylene from a units of ethylene
- Company C can produce c(b) = ln(b) units of tupperware from b units of polyethylene
You may notice that these weird made-up cost functions look suspiciously similar to steps 1, 2 and 3 in our function from the previous section. Indeed, f(x) from the previous section tells us how much tupperware can be made from a given amount of natgas: we compute how much ethylene can be made from the natgas (step 1, company A), then how much polyethylene can be made from the ethylene (step2, company B), then how much tupperware can be made from the polyethylene (step 3, company C).
Each company wants to maximize profit. If company 3 produces c units of tupperware (at unit price Pc) from b units of polyethylene (unit price Pb), then their profit is Pc*c(b)-Pb*b: value of the tupperware minus value of the polyethylene. In order to maximize that profit, we set the derivative to zero, then mutter something about KKT and pretend to remember what that means:
Company C: d/db [Pc*c(b) — Pb*b] = 0 -> Pb = dc/db = Pc * (1/b)
We’ve assumed competitive markets here: no single company is large enough to change prices significantly, so they all take prices as fixed when maximizing profit. Then, at a whole-industry-level, the above formula lets us compute the price Pb of polyethylene in terms of the price Pc of tupperware.
Well, now we can work back up the supply chain. Maximize profit for company B, then company A:
Company B: d/da [Pb*b(a) — Pa*a] = 0 -> Pa = Pb * (1/(a+1)²)
Company A: d/dx [Pa*a(x) — Px*x] = 0 -> Px = Pa * (1/2*x^(-½))
Notice that these formulas are exactly the same as the formulas we used to compute df/dx in the previous section. Just replace df/da by Pa, df/db by Pb, etc — the price of the intermediate good is just the derivative of the production function with respect to that good. (Actually, the price is proportional to the derivative, but it’s equal if we set the price of tupperware to 1 — i.e. price things in tupperware rather than dollars.)
So the math is the same, but how physically realistic is this picture? In a real market, the chain of causality would not follow the same first-go-forward-then-go-back pattern as backprop; firms in the market all tweak their numbers in parallel. Also, unlike true backprop, firms will change their input and output quantities, i.e. change x. We can make it look more like backprop by fixing input quantities and pricing in tupperware rather than dollars, but that’s pretty artificial. That said, the math does match up perfectly when the market is at equilibrium. That’s all we need for many applications — including organizational scaling. Even if the analogy isn’t perfect, it is strong enough to be useful.
To sum it all up: we can think of profit-maximizing firms in a competitive market as a distributed backpropagation algorithm. As long as each firm maximizes its profit, the net effect is to make the price of each intermediate good relative to tupperware, equal to the amount of extra tupperware which could be produced per extra unit of the intermediate (i.e. the derivative of the tupperware production function with respect to that intermediate). Each firm “backpropagates” price information from their output prices to their input prices. This doesn’t always reflect physical causality, but the math matches exactly for markets at equilibrium.
Organizations & Management
Let’s switch gears, and think about a company’s internal supply chain. To continue our example, suppose that the whole natgas -> ethylene -> polyethylene -> tupperware chain is inside a single vertically-integrated company. Let’s call it Exxon-Tupperware.
Ideally, to maximize its profits, Exxon-Tupperware would just take the whole end-to-end production function f(x) and stick it into their profit formula, same as before, and end up with Px/Pr = df/dx, same as before. In principle, they could even use backpropagation to do this, which would give them the values of all the intermediate goods. Each department within the company would set the internal “price” of their local intermediate good equal to their local derivative of production, and the company’s profits would be maximized.
In practice, I don’t see many managers running backprop or implementing internal markets. Even if they knew the math, it would be tough to account for all the little things — ethylene is ethylene, but many intermediate goods are not so homogenous. In an open market, competition plus profit maximization keeps the derivatives in line, even if the local behavior is quite complicated. But within a big company, it’s difficult to structure incentives to make that happen.
So what’s the outcome? Misaligned incentives. Waste. When price is not equal to derivative, profits are not maximized, opportunities are missed, value is lost. Almost anyone who’s worked in a company of more than a couple hundred people has seen it.
Consider the marketing department of a large car dealership. Maybe marketing optimizes for number of calls/emails inquiring about a car, but it’s hard to account for more/less serious buyers from different marketing channels. This leads to misalignment and waste: marketing ends up spending too much on channels which produce large numbers of bad leads, and too little on channels which produce small numbers of good leads.
Consider the sales department. Maybe sales optimizes for number of cars sold, but that incentivizes salespeople to minimize the dealership’s margin on the car. This leads to misalignment and waste: salespeople will sell a car for less than they could get, in order to boost their numbers.
Consider the upstream side, the manager(s) who buy cars for the dealership to sell. Maybe they’re judged on how quickly the cars they buy are sold. This leads to misalignment and waste: they end up overstocked on very common cars with low margins, and understocked on more specialty cars with higher margins.
In a small company, with good communication across departments, this usually gets hashed out — someone will notice the misalignment, and people aren’t too metric-driven to adjust a bit. But in larger, more numbers-driven companies, the mismatch between a local metric and the true local derivative can persist.
But even these local imperfections, on their own, usually aren’t enough to break a company. The real problem is when each department’s incentives are off by just a little bit, and the errors compound.
Stability & Value Drift
Finally, we get to the interesting part.
Suppose that the local derivative calculations in a backpropagation algorithm aren’t exact — they have a bit of noise. Each step adds a little noise, and then the whole sum of noise propagates back, so all the little pieces of noise add up as you backpropagate. Even if each local calculation only adds a tiny bit of noise, it could add up in a complex calculation with many steps.
As it turns out, even that scenario is too optimistic. In general, the noise in each step is multiplied when it propagates back. As you backpropagate, the random noise can grow exponentially, quickly ruining any calculation with more than a few steps. On the other hand, in some scenarios the opposite can happen: accumulated noise might actually be multiplied by a factor less than one at each step, allowing reliable derivative calculation even for large, complex systems with a lot of noise.
Crucially, it is possible to mathematically compute which of these scenarios one faces. This is just standard sensitivity analysis. (Interesting exercise: calculate the Lyapunov exponent of a backprop step in a recurrent neural net. I did this for a class project in college, it’s relatively straightforward.)
Now, let’s turn to price theory. The efficient markets hypothesis stipulates that, under perfect competition, prices would perfectly reflect (local) partial derivatives… but real markets are never perfectly competitive, and real agents are never perfectly rational. In practice, these imperfections result in local noise in the market prices — they do not quite perfectly reflect partial derivatives, even in very straightforward markets. Since markets are just distributed backprop, we expect that noise to accumulate as it propagates back. In noise-sensitive markets, the noise will grow quite large for prices of inputs far from the finished goods.
Fortunately, this noise propagation is counterbalanced somewhat by economic actors constantly looking around at the wider system and trying to guess how the big picture will affect them, and placing bets on future price movement. With respect to short-term market noise, real people are not strictly local optimizers. They try to guess which way prices are off, and bet accordingly — e.g. in futures markets. All that speculative activity has a strong damping effect on noise-induced price fluctuations.
For managers and organizations, however, the situation is more bleak. The informal and often intuitive nature of managers’ local optimization problems presumably leads to much noisier local estimates. Organizations also do not typically have any damping mechanism comparable to financial speculation. Given all this, we should expect organizations to face a scaling problem: as the organization grows, local incentives diverge from the big picture.
Predictions and Takeaway
Based on all this theory, what can we predict about organizational scaling problems?
First and foremost, we should expect incentive drift to strike hardest at early steps in deep pipelines — the parts of an organization furthest removed from the finished product. Organizations with deep pipelines are more susceptible than those with broad, shallow pipelines.
Local characteristics will also determine organizational effectiveness. Any one component whose impact is difficult to measure will have high risk of incentive misalignment, and that misalignment will propagate back to components upstream.
In principle, we could even attempt to quantify the whole thing. In an actual company, we could look at each division to see what metric they try to maximize, and then how far off that metric is from what they would ideally maximize (i.e. “market value” of their outputs). This would be difficult, especially if the “intermediate products” are not homogenous — e.g. leads as the intermediate output of marketing. Every lead is different.
But the theory does at least give us a handle on what each department should optimize — even if there isn’t always a good way to calculate it in practice, it’s useful to know what we’re trying to approximate. Specifically, each department should maximize their local “profit”: the value of all their outputs to downstream departments, minus the cost of all their inputs. As the department maximizes this quantity, they will set the local derivative of their production function equal to the “price” of their intermediate outputs — the per-unit value of their output to downstream departments. That price information then propagates back up the chain.
¹In ML we usually have multiple inputs, in which case we compute the gradient. Other than single-variable, I’ve also simplified the example by only reading each variable in one line, strictly sequentially — otherwise we sometimes need to update the derivatives in place. All of this also carries over to price theory, for supply chains with multiple inputs which can be used in multiple ways.