Why Randomness should be Embraced and Not Feared
Previously I wrote about how the design strategy of biology differs from the design strategy of human technology. The difference can be found in the difference in priorities. Human technology, specifically computer technology, is based on ensuring strict and correct behavior. In fact, even quantum computing is designed on the principle of taming the noise in the quantum realm to build predictable quantum circuitry. In contrast, biological systems do not require strict correctness, but rather are designed to favor robustness.
What are the design strategies for systems that favor robustness? One strategy that immediately comes into mind is redundancy. Biologically systems are multiplicatively redundant. Another strategy is the importance of diversity. Monocultures are more susceptible to extinction. What about the most unreasonably effective strategy of them all? What about randomness? How does randomness as a strategy lead to robustness?
We can take inspiration from many new human inventions that employ randomness as a strategy.
Information dispersal methods. Many are familiar with how redundant disk arrays are made more robust through redundancy. There is another kind of storage redundancy that makes randomization as its strategy. Michael Rabin introduced the idea of information dispersal as a means of introducing security, load balancing, and fault tolerance. Basically, you slide up your data into multiple parts and then randomly disperse data across a network of storage devices. The method is mathematically provable to be time-efficient and highly fault-tolerant.
Spread Spectrum methods. Spread spectrum methods are a communication method that spreads communication signals across multiple frequencies. The original reason was to ensure secure communication, however, the additional benefit of this strategy is increased resistance to interference and to limit power flux density. The method uses randomization to spread narrowband signals across a wider band of signals. This method resists jamming as well as hides the fact that communication actually took place. The latter side-effect is, in fact, problematic for methods that seek understanding through the sampling of signals.
Warehouse logistics. Amazon has a system that employs a “random stow” on how products are placed in its warehouses. Details of this system can be found in “Behind Amazon’s Well-Oiled Machine”. The system places products across the warehouse based on forecasted order frequency. This has the consequence of having diversity in all the stow areas and this has the effect of reducing bottlenecks. It is not uncommon to find unrelated items stored in the same location This random stow also reduces selection mistakes. The system does not assign similar items next to each other. Therefore, the chances of picking up a similar but incorrect item from the same location are greatly reduced. Contrast this to how a normal library organizes and stores its books. Organization is only important for systems that have limited memories (i.e. our brains).
Scalable Distributed Consensus. Achieving consensus in distributed systems is an expensive and time-consuming operation. Blockchains are an example of this problem. Bitcoin is able to ensure that money is not doubly spent through the use of a decentralized consensus mechanism. Unfortunately, this is a time consuming as well as an energy consuming process. A typical Bitcoin transaction can take over half an hour to confirm (i.e. assume 3 blocks). A recent innovation from Dfinity proposed to remove this consensus problem by employing cryptographically verifiable randomization.
Random Forests. By randomly splitting up training data across many decision tree learning processes one creates a more robust solution compared to a single decision tree with all the data. Random forests are known to be more robust againsts errors and noise. Furthermore, there is less overfitting with training data.
Decision Support. A analogous idea to Random Forests can be applied to human decision making. Daniel Kahneman employs randomization to improve human decision making. An HBR article “Noise: How to Overcome the High, Hidden Cost of Inconsistent Decision Making” describes this system. The key observation here is that if you ask people to make multiple decisions, they will inevitably lead to different answers. This is because people’s attention is different at different times. People also have different methods for making decisions. This kind of variability in decisions exists even if the person is experienced. This variability in decision making helps us better appreciate the utility of the Wisdom of the Crowds. To form a wise crowd, the following criteria are essential: diversity of opinion, independence, and decentralized knowledge. Ultimately, the greater the diversity of knowledge the more robust he aggregate decision will be. This example highlights the value of disparate viewpoints and isn’t really randomness per se. However, it hints at the society of mind approach to cognition and this is likely the same general mechanism employed by Deep Learning.
All the above are macro-scale examples that show the benefits of leveraging randomness in achieving greater robustness. The key take away from all of this is that the use of randomness has its benefits. The common strategy of ensuring correctness by reducing randomness is a flawed strategy. It is, in fact, the presence of this randomness that leads to more robust solutions.
Recent papers in Deep Learning brings into focus the effectiveness of randomization. Ben Recht has written about how employing random search can be equally effective as more complex reinforcement learning strategies. He writes:
Random search with a few minor tweaks outperforms all other methods on these MuJoCo tasks and is significantly faster.
Then there’s this paper “Gradient Descent Provably Optimizes Over-parameterized Neural Networks” that to achieve linear convergence one only needs to begin with an over-parameterization network and use random initialization. This has the effect of restricting every weight to be close to its original random initialization for all its iterations. This allows the system to exploit a “strong convexity-like property” that allows linear convergence towards a global optimum.
There’s also this paper “Rethinking ImageNet Pre-training” that demonstrates that pre-training is no better than random initialization. To make it even more confusing, there is this paper “Approximating CNNs with Bag-of-local-Features models works surprisingly well on ImageNet” that employs a “Bag of Local Features” to achieve competitive results.
Finally, there is this recent counterintuitive result from “Analyzing and Improving Representations with the Soft Nearest Neighbor Loss” (Nicholas Frosst, Nicolas Papernot, Geoffrey Hinton) where it is shown that by increasing the entanglement of the representation of different categories in the hidden layers of a network, the network becomes more robust to adversarial methods and also leads to better generalization performance. The authors make a counter-intuitive explanation:
Surprisingly, we find that maximizing the entanglement of representations of different classes in the hidden layers is beneficial for discrimination in the final layer.
But what is entanglement? Entanglement and chaos are the mechanisms that lead to randomness. From the perspective of Kolmogorov complexity, perfect randomness is an incompressible generator of a string. A generator is less compressible if every component of the generator is causally dependent on everything else.
To conclude, one should not dismiss randomization as a defect of a system. Rather it is the key characteristic that leads to a more robust system. This may be difficult to imagine considering our educational bias and biological bias against randomness. That is, the correct system functioning is achieved only by reducing randomness. However, this approach risks throwing the baby out with the bathwater. Randomness is an intrinsic feature of these systems and thus should be leveraged appropriately.
Note: I use the world randomness here, but I actually mean diversity. ;-) That is because, there is no such thing as randomness.
Judea Pearl in a recent tweet expresses intractable nature of generative processes:medium.com
The theory of algorithmic randomness rests on the understanding of effectiveness as given by computability theory. The…www.scholarpedia.org
We explore various methods for computing sentence representations from pre-trained word embeddings without any…openreview.net
Deep Neural Networks (DNNs) excel on many complex perceptual tasks but it has proven notoriously difficult to…openreview.net
One of the core challenges of modern AI can be demonstrated with a rotating yellow school bus. When viewed head-on on a…nautil.us
A common belief in model-free reinforcement learning is that methods based on random search in the parameter space of…arxiv.org
Dealing with uncertainty is essential for efficient reinforcement learning. There is a growing literature on…arxiv.org
Abstract: We bring rigor to the vibrant activity of detecting power laws in empirical degree distributions in…arxiv.org
We consider the problem of transferring policies to the real world by training on a distribution of simulated…arxiv.org
We present a method for storing multiple models within a single set of parameters. Models can coexist in superposition…arxiv.org
We consider the problem of making machine translation more robust to character-level variation at the source side, such…arxiv.org
Neural architecture search (NAS) is a promising research direction that has the potential to replace expert-designed…arxiv.org