Algorithms to Live By — Brian Christian and Tom Griffiths
“Our judgments betray our expectations, and our expectations betray our experience. What we project about the future reveals a lot — about the world we live in, and our about our past.”
Start — 11/6/16 Finish — 11/12/16
Thoughts:
I had heard this book strongly recommended on multiple podcasts, some saying it was specifically relevant for trading and investing. I’m not entirely sure if I see a specific use for the insights of these authors into those fields over just general usefulness for thinking about problem-solving. But the authors certainly pose very unique and compelling arguments for incorporating algorithmic thinking into our everyday lives. They show us how the realm of computer science actually has plenty of overlap with the human decision-making process.
Christian and Griffiths use various facets of algorithmic problem-solving like sorting, caching, and optimal stopping to convince the reader that step-by-step methods that seem very mechanical in nature are actually extremely useful (or at least “good enough”) for making decisions amid uncertainty. They usually tie in some narrative about a renowned computer scientist who initially solved some problem with a type of algorithm or framework into each chapter. I can’t say the narratives were particularly exciting, and the personalities they describe are not quite fleshed-out at all, but I guess that’s beside the point. This book certainly isn’t like Innovators by Walter Isaacson, or anything out of a Michael Lewis book on the dotcom bubble. The characters and narratives are sort of just tacked on as origin stories for some type of problem solving. Algorithms to Live By is certainly just as mechanical and practical in its style as its guidelines for making difficult decisions.
The book had a very reassuring message amid all the uncertainty we face on a day-to-day basis: it is okay if the outcome isn’t what we wanted, just so long as our process made sense and we tried our best. Ultimately life constantly presents us with intractable problems for which even no algorithm can solve; however, we do have many heuristics, approximations, and general guidelines we can use to mitigate the overwhelming odds against an optimal decision.
This is a really great dive into intelligent decision-making, and it gives the reader plenty of actionable advice to attacking all sorts of real-life problems. And while the writing isn’t anything special (would you expect much different from two computer scientists) and the narratives slightly dry, the usefulness and uniqueness of the book certainly win out in the end.
Score: 7/10
Directives:
1. There are cases where good algorithmic approaches can simply be transferred to human problems. The 37% rule, the Least Recently Used criterion for overflowing caches, and the Upper Confidence Bound as a guide to exploration are all examples of this.
2. Be sure to distinguish between process and outcome. Even the best strategy sometimes yields bad results — if you stuck to the best possible plan, you shouldn’t blame yourself if things didn’t go your way.
3. We should draw clear lines between problems that admit straight-forward solutions and problems that don’t. For the more complex ones, remember that heuristics, approximations, and strategic use of randomness can help you find workable solutions.
4. For the most difficult problems, practice some form of constraint relaxation. If the problem in front of you is too hard, try to solve an easier version of it and see if the solution can give you a starting point to the original.
5. Game theory and information cascading bring up some important insights: first, we should be wary whenever we know more about what people are doing than why they’re doing it; second, remember that actions of others are not necessarily beliefs; third, we should remember that games can have irredeemably lousy rules that result in terrible despite rational behavior.
Notes:
· Algorithm — a finite sequence of steps used to solve a problem
o Even in cases where life is too messy for us to expect a strict numerical analysis, using intuitions and concepts honed on the simpler forms of our problems offers us a way to understand key issues and make progress
o Algorithms can ultimately help us manage finite space, finite time, limited attention, unknown unknowns, incomplete information, and an unforeseeable future
· Optimal Stopping
o The problem is determining when to stop collecting data about potential pool of options when you can consider only one at a time, and want to make the best possible selection
o When you stop too early, you leave the best applicant undiscovered; when you stop too late, you hold out for a better applicant who doesn’t exist — the optimal strategy requires a balance between the two
o The Look-Then-Leap Rule: you set a predetermined amount of time for exploring options, then after that point instantly commit to anyone who is better than all previously viewed options
§ As the pool of options grows, the exact place to draw the line converges upon 37% of the pool
§ This optimal strategy ultimately gives a 37% chance of hiring the best applicant; while this may sound low, without the strategy our chances of finding the best choice would steadily decrease with pool size — it becomes more valuable as the data grows larger
§ The rule can be applied either to a fixed pool or a fixed amount of time
o Modifications:
§ Adding rejection and recall into the mix will either decrease or increase the look period respectively
§ Full information (or an objective and not relative standard on which to judge applicants) changes to the Threshold Rule: immediately accept an applicant if they are above a certain percentile
· The percentile threshold decreases as the number of applicants remaining in the pool decreases (aka lower your standards when you have a smaller pool to choose from)
· This rule leaves you with a success rate of 58% instead
§ Job hunts and real estate sales — when you have a range of values offers will come in and a cost of passing on an offer, the threshold rule will apply
§ The burglar problem (each robbery provides some reward, but if caught, lose all accumulated gains: optimal stop is roughly equal to chance of success divided by chance of failure
o We tend to stop too early in most optimal stopping type problems, but this may be explained by the implicit time cost of all of our decisions
· Explore/Exploit
o Problem: how do we decide between playing it safe with something we know well and taking a chance on something new in hopes that it’s better than what we know?
o The balance depends entirely on the time interval you have to choose; the value of exploration decreases over time while the value of exploitation increases over time
o The Gittins Index assumes a geometric discounting (valuing each run as a constant fraction of the previous one) and tries to maximize for a single quantity that accounts for both exploration and exploitation
§ The win-stay principle holds, the lose-shift principle will not always work well, and complete or relative unknowns receive a bonus
§ If the future is discounted more lightly, the value of making a chance discovery (relative to taking a sure thing) goes up even more
o A regret minimization framework will make use of an Upper Confidence Bound algorithm, in which you want to choose the arm that could reasonably perform best in the future (not which arm has performed best so far)
§ Similar to Gittins, but easier to compute and don’t need geometric discounting
§ Still gives a bonus to possibilities we know less about
o Empirical studies have shown people to generally favor exploration too much above what would be optimal, and go back and forth between options even when one is objectively better
§ This may be explained by the fact that the chance of success for our given options may change or get better over time
§ Both the young and old generally behave in line with what the algorithm would expect — the young favor exploration while the old favor exploitation because both realize where they are on their respective time intervals
· Sorting
o Problem: sorting becomes increasingly difficult as scale grows — sometimes growing linearly with n, other times quadratically (n*n)
o Mergesort is an algorithm that starts by putting adjacent units into sorted pairs, then collating the pairs into ordered sets of four, eventually leading to a fully sorted object; this leaves us with a complexity between linear and quadratic time (n*log(n))
o Bucket sort groups items together into a number of sorted categories with no regard for intracategory sorting, with the goal of dividing items into roughly equalized groups; this can allow for linear time
o Comparison Counting Sort compares each item to all the others, generating a tally of how many items it is bigger than, then using that as a rank; this is like a Bubble Sort, which works in quadratic time (very inefficient but not prone to mistakes, or robust)
o The effort expended on sorting materials is just a preemptive strike against the effort it’ll take to search through them later — so generally we should err on the side of messiness, since sorting something you will never search is a complete waste
· Caching
o Caching is the process of keeping around pieces of information that you refer to frequently so that it can be accessed more quickly; a caching algorithm will decide what information or items to evict when that limited space or memory is used up
o Belady’s Algorithm — evict whichever item you’ll need again the longest from now (requires data from the future)
§ Least Recently Used comes closest to such clairvoyance among various caching algorithms
§ LRU teaches us that the next thing we can expect to need is the last one we needed, while the thing we’ll need after that is probably the second-most-recent one
o Caching is just as useful when it’s proximity (rather than performance) that’s the scarce resource
o Advice applied to physical objects in human environments:
§ LRU is a good principle to use
§ Make sure things are in whatever cache is closest to the place where they’re typically used
§ Having multiple levels of caches — from smallest and fastest to largest and slowest — can be even better
§ Replace anything you take out of the cache at the front position every time
· Scheduling
o Problem: how do we order our day when some tasks must come before others, others can only be started at a certain time, some have definite deadlines while others not so, and some are important but not urgent
o Earliest Due Date — start with the task due soonest and work your way toward the task due last (best for reducing maximum lateness)
o Moore’s Algorithm — start with EDD, but as soon as it looks like we won’t get to the next item in time, throw out the biggest item; repeat this pattern until everything can be done within their respective due dates (best for reducing the number of late projects)
o Shortest Processing Time — always do the quickest task you can (best for reducing the sum of completion times)
§ May ease your mind by shrinking the number of outstanding tasks as quickly as possible
§ Can also weight each project by importance, divide by the time needed to complete each, and work in order from highest importance per unit time to lowest
o EDD and SPT are still optimal strategies when new assignments are given to you at unpredictable moments as long as you can switch to the job that just came up if it’s due sooner than the one you’re currently doing
§ There are psychological or mental switching costs associated with preemption, including possible delays and errors
§ And of course, scheduling itself is an item on our to-do-list
o Thrashing — any situation in which the system grinds to a halt because it’s entirely preoccupied with metalwork
§ In this state, even doing tasks in the wrong order is better than doing nothing at all
§ Establishing a minimum length of time to spend on one task can help prevent a commitment to responsiveness from destroying throughput entirely — if the minimum is longer than the time it takes to context switch, then we can never get into trashing
o Interrupt coalescing — batch tasks that can be done together as long as none of them have conflicting response time intervals
· Baye’s Rule
o Laplace’s Law: for any possible w successes in n attempts, the expectation of success is (w+1)/(n+2)
§ Even when we’ve only made a few observations amidst some uncertainty, we can still have this practical guide as an estimate
o Baye’s Rule: P(A|B) = P(B|A)P(A)/P(B)
§ Gives a straightforward solution to the problem of how to combine preexisting beliefs with observed evidence: just multiply their probabilities together
o The Copernican Principle: if any moment is equally likely in a given object’s lifetime (or we have no prior knowledge about an object), then we can expect to show up precisely halfway into the duration of any phenomenon — or things will last as long as they’ve lasted already
o When it comes to making predictions based on limited evidence, few things are as important as having good priors — like whether we’re dealing with a normal distribution or a power-law distribution (with most points below the mean, and a few much higher than it)
§ For any power law distribution, the appropriate prediction strategy is a multiplicative rule: multiply the quantity observed so far by some constant factor; the longer something has gone on the longer we expect it to continue
§ For any normal distribution, we use an average rule: use the distributions natural average as your guide; events are surprising when they’re early but not when they’re late
§ For any Erlang distribution, we use the additive rule: always predict that things will go on just a constant amount longer; events are never any more or less surprising no matter when they occur
o We should remember that news, social media, and other forms of story-telling can very easily skew our perceived distributions and likelihoods for a given phenomenon; representation in the media does not equal frequency in the real world
· Overfitting
o Problem: how hard should we think about our problems, and how many factors should we take into consideration?
§ Including more factors in a model will make it a better fit for the data, but that does not necessarily mean a better prediction
§ Too many factors can also generate highly variable results from one instance of a study to the next (overfitting), meaning adding complexity can be harmful
o Overfitting to certain metrics or incentives can have many unforeseen consequences, and the shift towards real-time analytics has only made that danger more intense
o We can partially fight over-fitting through cross-validation, or assessing how well a model does with data it hasn’t seen (which ironically may involve using less data)
§ Cross-validation can be done by withholding data points and also testing the model with data from some other form of evaluation entirely (different proxy metrics)
§ Regularization can also combat over-fitting by adding a complexity penalty to your model — more complex models need to do a significantly better job of explaining the data to justify themselves (ex: the Lasso, which uses as its penalty the total weight of the different factors of the model)
o If the factors we come up with first are likely to be the most important ones, then beyond a certain point thinking more about a problem is not only going to be a waste of time and effort — it will lead us to worse solutions
§ The greatest the uncertainty, the bigger the gap between what you can measure and what matters, the more you should watch out for overfitting and prefer simplicity (aka stopping early)
§ Remember that when facing complexity, going with our first instinct can be the rational solution (heuristics exist for a reason)
· Relaxation
o Problem: what do we do when a problem proves to be intractable?
o Constraint Relaxation — removing some of the problem’s constraints and set about solving a problem that would be preferable to the current one… then after making headway trying to put the constraints back in
§ If you can’t solve the problem in front of you, solve an easier version of it — then see if that solution offers you a starting point or a guide
§ Most of the time if we are willing to accept solutions that are close enough, then even some of the toughest problems can be tamed
§ We can also turn discrete optimization problems (integers only) into continuous ones (fractions allowed as answers) and use those as a rough guide
o Lagrangian Relaxation — take some of the problem’s constraints and bake them into the scoring system as costs instead
· Randomness
o Sometimes the best solution to a problem is to turn to chance rather than trying to fully reason out an answer
§ Random sampling over and over (like in a Monte Carlo simulation) can produce decent approximate results (sometimes to near certainty) in a fraction of the time where calculating probabilities proves to be too exhausting
§ Sampling can be very useful for problems too complex to be comprehended directly, or for situations where statistics and anecdotes aren’t enough (public policy for example)
o Results may be improved by Hill Climbing — taking a baseline solution, and then testing some alternative by making slight perturbations to the sequence looking for the best local improvement (stopping when no further change gives improvement)
§ To avoid getting stuck at local maxima, you can make a few random small changes (even if they’re worse) and go back to Hill Climbing (known as jitter)
§ Random-restart hill climbing involves completely scrambling our solution when we reach a local maximum and begin climbing again (works well for when they are many local maxima in a problem)
§ Metropolis Algorithm — like Hill Climbing, but at any given point can potentially accept bad tweaks as well as good ones
§ Simulated Annealing — like metropolis algorithm, but ramps down the probability you take a bad tweak slowly over time (less inferior choices over time)
o Creativity itself can be stimulated by introducing a random element into your process, bringing in new associations or changing the context of your problem to help you think about it differently
· Game Theory
o Nash equilibrium — both players in game follow a strategy that neither would want to change, given the play of their opponent
§ But just because these outcomes are stable and a result of rational action, does not mean they are good for each player (ex: prisoner’s dilemma)
§ The “price of anarchy” quantifies the cost of such dominant strategies that don’t coordinate; in some cases this price is small enough to make practical sense
§ Poor nash equilibria can also occur on a multi-party scale, otherwise known as the Tragedy of the Commons
o While game theory asks what behavior will emerge given a set of rules, mechanism design asks what set of rules will give us the behavior we want to see
§ It is possible to worsen every outcome and yet make everyone’s lives better by shifting the equilibrium to what would be considered optimal
§ Scaling up this logic results in a good argument for the role of government — making laws mandating minimum vacations and limiting shop hours for example
§ Emotions in a way act as a mechanism design to enable contracts that need no outside enforcement (to essentially override whatever rational behavior would be)
o Game theory also tells us that catastrophes like financial bubbles can happen even when no one is at fault
§ A group of agents who are all behaving rationally can nonetheless fall prey to effectively infinite misinformation, or information cascade
§ Mixing together private information and the flow of public information about individual agent’s actions can create a consensus that diverges greatly from the original private pieces of information
§ We should be wary whenever we know more about what people are doing than why they’re doing it; second, remember that actions are not beliefs; third, we should remember that games can have irredeemably lousy rules that result in terrible outcomes like bubbles
Phrases/Quotes:
· Intuitively, we think that rational decision-making means exhaustively enumerating our options, weighing each one carefully, and then selecting the best. But in practice, when the clock is ticking, few aspects of decision-making (or of thinking more generally) are as important as this one: when to stop.
· Unless we have good reason to think otherwise, it seems that our best guide to the future is a mirror image of the past. The nearest thing to clairvoyance is to assume that history repeats itself — backward.
· So as you age, and begin to experience these sporadic latencies, take heart: the length of a delay is partly an indicator of the extent of your experience. The effort of retrieval is a testament to how well you’ve arranged it: keeping the most important things closest to hand.
· The truth is, there’s always overhead — time lost to metawork, to the logistics of bookkeeping and task management. This is one of the fundamental tradeoffs of scheduling. And the more you take on, the more overhead there is.
· Our judgments betray our expectations, and our expectations betray our experience. What we project about the future reveals a lot — about the world we live in, and our about our past.
· When it comes to culture, tradition plays the role of the evolutionary constraints. A bit of conservatism, a certain bias in favor of history, can buffer us against the boom-and-bust cycle of fads. That doesn’t mean we ought to ignore the latest data either, of course. Jump toward the band-wagon, by all means — but not necessarily on it.
· The anecdotes, of course, are rich and vivid, but they’re unrepresentative. Almost any piece of legislation, more matter how enlightened or misguided, will leave someone better off and someone worse off, so carefully selected stories don’t offer any perspective on broader patterns. Aggregate statistics, on the other hand, are the reverse: comprehensive but thin. A statistic can only tell us part of the story, obscuring any underlying heterogeneity.
· What’s more, being able to fall involuntarily in love makes you, in turn, a more attractive partner to have. Your capacity for heartbreak, for sleeping with the emotional fishes, is the very quality that makes you such a trusty accomplice.