Awesome Papers on Machine Learning, Physics, and Biology

Charles Fisher
Unlearn.AI
Published in
12 min readApr 5, 2018

Machine learning methods have dramatically improved in recent years. However, most of that improvement has been for problems in image processing such as recognizing objects in images, or drawing new images of faces. Images are nice because they regular arrays of numbers and because there are lots of them. Most problems in science are not computer vision problems, so thinking about ways that machine learning be applied in science requires a different set of background reading.

With that in mind, here is our company’s “Awesome List” of recommended background reading on the intersection of unsupervised learning and generative models, statistics, statistical physics, and biology.

Unsupervised Learning and Generative Models

Ackley, David H., Geoffrey E. Hinton, and Terrence J. Sejnowski. “A learning algorithm for Boltzmann machines.” Cognitive science 9.1 (1985): 147–169.

Abstract: The computational power of massively parallel networks of simple processing elements resides in the communication bandwidth provided by the hardware connections between elements. These connections can allow a significant fraction of the knowledge of the system to be applied to an instance of a problem in a very short time. One kind of computation for which massively parallel networks appear to be well suited is large constraint satisfaction searches, but to use the connections efficiently two conditions must be met: First, a search technique that is suitable for parallel networks must be found. Second, there must be some way of choosing internal representations which allow the preexisting hardware connections to be used efficiently for encoding the constraints in the domain being searched. We describe a general parallel search method, based on statistical mechanics, and we show how it leads to a general learning rule for modifying the connection strengths so as to incorporate knowledge about a task domain in an efficient way. We describe some simple examples in which the learning algorithm creates internal representations that are demonstrably the most efficient way of using the preexisting connectivity structure.

**Hinton, Geoffrey E., and Ruslan R. Salakhutdinov. “Reducing the dimensionality of data with neural networks.” science 313.5786 (2006): 504–507.

Abstract: High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such “autoencoder” networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.

Comment: The paper that kicked-off the deep learning revolution.

Salakhutdinov, Ruslan, and Hinton, Geoffrey E. “Deep Boltzmann Machines.” Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics. 2009.

Abstract: We present a new learning algorithm for Boltzmann machines that contain many layers of hidden variables. Data-dependent expectations are estimated using a variational approximation that tends to focus on a single mode, and data-independent expectations are approximated using persistent Markov chains. The use of two quite different techniques for estimating the two types of expectation that enter into the gradient of the log-likelihood makes it practical to learn Boltzmann machines with multiple hidden layers and millions of parameters. The learning can be made more efficient by using a layer-by-layer “pre-training” phase that allows variational inference to be initialized with a single bottom-up pass. We present results on the MNIST and NORB datasets showing that deep Boltzmann machines learn good generative models and perform well on handwritten digit and visual object recognition tasks.

*Hinton, Geoffrey E. “A practical guide to training restricted Boltzmann machines.” Neural networks: Tricks of the trade. Springer, Berlin, Heidelberg, 2012. 599–619.

Abstract: Restricted Boltzmann machines (RBMs) have been used as generative models of many different types of data. RBMs are usually trained using the contrastive divergence learning procedure. This requires a certain amount of practical experience to decide how to set the values of numerical meta-parameters. Over the last few years, the machine learning group at the University of Toronto has acquired considerable expertise at training RBMs and this guide is an attempt to share this expertise with other machine learning researchers.

Comment: Step-by-step recipes for old-new school machine learning.

**Goodfellow, Ian, et al. “Generative adversarial nets.” Advances in neural information processing systems. 2014.

Abstract: We propose a new framework for estimating generative models via adversarial nets, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.

Comment: Sometimes, maximizing the likelihood of the data is a bad way to train generative models. The original GAN paper is a brilliant idea on how to avoid it.

**Mehta, Pankaj, et al. “A high-bias, low-variance introduction to Machine Learning for physicists”

Abstract: Machine Learning (ML) is one of the most exciting and dynamic areas of modern research and application. The purpose of this review is to provide an introduction to the core concepts and tools of machine learning in a manner easily understood and intuitive to physicists. The review begins by covering fundamental concepts in ML and modern statistics such as the bias-variance tradeoff, overfitting, regularization, and generalization before moving on to more advanced topics in both supervised and unsupervised learning. Topics covered in the review include ensemble models, deep learning and neural networks, clustering and data visualization, energy-based models (including MaxEnt models and Restricted Boltzmann Machines), and variational methods. Throughout, we emphasize the many natural connections between ML and statistical physics. A notable aspect of the review is the use of Python notebooks to introduce modern ML/statistical packages to readers using physics-inspired datasets (the Ising Model and Monte-Carlo simulations of supersymmetric decays of proton-proton collisions). We conclude with an extended outlook discussing possible uses of machine learning for furthering our understanding of the physical world as well as open problems in ML where physicists maybe able to contribute. (Notebooks are available at this https URL).

Comment: A comprehensive review of all areas of machine learning with example code. Disclosures: 1) I am one of the coauthors, 2) it is very long.

Statistics

Geyer, Charles J. “Markov chain Monte Carlo maximum likelihood.” (1991).

Abstract: Markov chain Monte Carlo (e. g., the Metropolis algorithm and Gibbs sampler) is a general tool for simulation of complex stochastic processes useful in many types of statistical inference. The basics of Markov chain Monte Carlo are reviewed, including choice of algorithms and variance estimation, and some new methods are introduced. The use of Markov chain Monte Carlo for maximum likelihood estimation is explained, and its performance is compared with maximum pseudo likelihood estimation.

*Biau, David Jean, Brigitte M. Jolles, and Raphaël Porcher. “P value and the theory of hypothesis testing: An explanation for new researchers.” Clinical Orthopaedics and Related Research® 468.3 (2010): 885–892.

Abstract: In the 1920s, Ronald Fisher developed the theory behind the p value and Jerzy Neyman and Egon Pearson developed the theory of hypothesis testing. These distinct theories have provided researchers important quantitative tools to confirm or refute their hypotheses. The p value is the probability to obtain an effect equal to or more extreme than the one observed presuming the null hypothesis of no effect is true; it gives researchers a measure of the strength of evidence against the null hypothesis. As commonly used, investigators will select a threshold p value below which they will reject the null hypothesis. The theory of hypothesis testing allows researchers to reject a null hypothesis in favor of an alternative hypothesis of some effect. As commonly used, investigators choose Type I error (rejecting the null hypothesis when it is true) and Type II error (accepting the null hypothesis when it is false) levels and determine some critical region. If the test statistic falls into that critical region, the null hypothesis is rejected in favor of the alternative hypothesis. Despite similarities between the two, the p value and the theory of hypothesis testing are different theories that often are misunderstood and confused, leading researchers to improper conclusions. Perhaps the most common misconception is to consider the p value as the probability that the null hypothesis is true rather than the probability of obtaining the difference observed, or one that is more extreme, considering the null is true. Another concern is the risk that an important proportion of statistically significant results are falsely significant. Researchers should have a minimum understanding of these two theories so that they are better able to plan, conduct, interpret, and report scientific experiments.

Comment: Classical hypothesis testing remains the dominant framework for analysis in many areas of science. It’s important to understand how it is supposed to work.

Gelman, Andrew, Jennifer Hill, and Masanao Yajima. “Why we (usually) don’t have to worry about multiple comparisons.” Journal of Research on Educational Effectiveness 5.2 (2012): 189–211.

Abstract: Applied researchers often find themselves making statistical inferences in settings that would seem to require multiple comparisons adjustments. We challenge the Type I error paradigm that underlies these corrections. Moreover we posit that the problem of multiple comparisons can disappear entirely when viewed from a hierarchical Bayesian perspective. We propose building multilevel models in the settings where multiple comparisons arise. Multilevel models perform partial pooling (shifting estimates toward each other), whereas classical procedures typically keep the centers of intervals stationary, adjusting for multiple comparisons by making the intervals wider (or, equivalently, adjusting the p values corresponding to intervals of fixed width). Thus, multilevel models address the multiple comparisons problem and also yield more efficient estimates, especially in settings with low group-level variation, which is where multiple comparisons are a particular concern.

*Gelman, Andrew, and John Carlin. “Beyond power calculations: assessing type S (sign) and type M (magnitude) errors.” Perspectives on Psychological Science 9.6 (2014): 641–651.

Abstract: Statistical power analysis provides the conventional approach to assess error rates when designing a research study. However, power analysis is flawed in that a narrow emphasis on statistical significance is placed as the primary focus of study design. In noisy, small-sample settings, statistically significant results can often be misleading. To help researchers address this problem in the context of their own studies, we recommend design calculations in which (a) the probability of an estimate being in the wrong direction (Type S [sign] error) and (b) the factor by which the magnitude of an effect might be overestimated (Type M [magnitude] error or exaggeration ratio) are estimated. We illustrate with examples from recent published research and discuss the largest challenge in a design calculation: coming up with reasonable estimates of plausible effect sizes based on external information.

Comment: Classical hypothesis testing remains the dominant framework for analysis in many areas of science. It’s important to understand why it doesn’t work very well.

Statistical Physics

Bryngelson, Joseph D., and Peter G. Wolynes. “Spin glasses and the statistical mechanics of protein folding.” Proceedings of the National Academy of Sciences 84.21 (1987): 7524–7528.

Abstract: The theory of spin glasses was used to study a simple model of protein folding. The phase diagram of the model was calculated, and the results of dynamics calculations are briefly reported. The relation of these results to folding experiments, the relation of these hypotheses to previous protein folding theories, and the implication of these hypotheses for protein folding prediction schemes are discussed.

*Roudi, Yasser, Erik Aurell, and John A. Hertz. “Statistical physics of pairwise probability models.” Frontiers in computational neuroscience 3 (2009): 22.

Abstract: Statistical models for describing the probability distribution over the states of biological systems are commonly used for dimensional reduction. Among these models, pairwise models are very attractive in part because they can be fit using a reasonable amount of data: knowledge of the mean values and correlations between pairs of elements in the system is sufficient. Not surprisingly, then, using pairwise models for studying neural data has been the focus of many studies in recent years. In this paper, we describe how tools from statistical physics can be employed for studying and using pairwise models. We build on our previous work on the subject and study the relation between different methods for fitting these models and evaluating their quality. In particular, using data from simulated cortical networks we study how the quality of various approximate methods for inferring the parameters in a pairwise model depends on the time bin chosen for binning the data. We also study the effect of the size of the time bin on the model quality itself, again using simulated data. We show that using finer time bins increases the quality of the pairwise model. We offer new ways of deriving the expressions reported in our previous work for assessing the quality of pairwise models.

Comment: This is a great overview of some models at the intersection of physics and machine learning.

Biology

Oshlack, Alicia, Mark D. Robinson, and Matthew D. Young. “From RNA-seq reads to differential expression results.” Genome biology 11.12 (2010): 220.

Abstract: Many methods and tools are available for preprocessing high-throughput RNA sequencing data and detecting differential expression.

Leek, Jeffrey T., et al. “Tackling the widespread and critical impact of batch effects in high-throughput data.” Nature Reviews Genetics 11.10 (2010): 733.

Abstract: High-throughput technologies are widely used, for example to assay genetic variants, gene and protein expression, and epigenetic modifications. One often overlooked complication with such studies is batch effects, which occur because measurements are affected by laboratory conditions, reagent lots and personnel differences. This becomes a major problem when batch effects are correlated with an outcome of interest and lead to incorrect conclusions. Using both published studies and our own analyses, we argue that batch effects (as well as other technical and biological artefacts) are widespread and critical to address. We review experimental and computational approaches for doing so.

Tkačik, Gašper, and Aleksandra M. Walczak. “Information transmission in genetic regulatory networks: a review.” Journal of Physics: Condensed Matter 23.15 (2011): 153102.

Abstract: Genetic regulatory networks enable cells to respond to changes in internal and external conditions by dynamically coordinating their gene expression profiles. Our ability to make quantitative measurements in these biochemical circuits has deepened our understanding of what kinds of computations genetic regulatory networks can perform, and with what reliability. These advances have motivated researchers to look for connections between the architecture and function of genetic regulatory networks. Transmitting information between a network’s inputs and outputs has been proposed as one such possible measure of function, relevant in certain biological contexts. Here we summarize recent developments in the application of information theory to gene regulatory networks. We first review basic concepts in information theory necessary for understanding recent work. We then discuss the functional complexity of gene regulation, which arises from the molecular nature of the regulatory interactions. We end by reviewing some experiments that support the view that genetic networks responsible for early development of multicellular organisms might be maximizing transmitted ‘positional information’.

*Lang, Alex H., et al. “Epigenetic landscapes explain partially reprogrammed cells and identify key reprogramming genes.” PLoS computational biology 10.8 (2014): e1003734.

Abstract: A common metaphor for describing development is a rugged “epigenetic landscape” where cell fates are represented as attracting valleys resulting from a complex regulatory network. Here, we introduce a framework for explicitly constructing epigenetic landscapes that combines genomic data with techniques from spin-glass physics. Each cell fate is a dynamic attractor, yet cells can change fate in response to external signals. Our model suggests that partially reprogrammed cells are a natural consequence of high-dimensional landscapes, and predicts that partially reprogrammed cells should be hybrids that co-express genes from multiple cell fates. We verify this prediction by reanalyzing existing datasets. Our model reproduces known reprogramming protocols and identifies candidate transcription factors for reprogramming to novel cell fates, suggesting epigenetic landscapes are a powerful paradigm for understanding cellular identity.

Comment: Illustrates how concepts from statistical physics and machine learning can be used to understand genetic regulation.

*Lovell, David, et al. “Proportionality: a valid alternative to correlation for relative data.” PLoS computational biology 11.3 (2015): e1004075.

Abstract: In the life sciences, many measurement methods yield only the relative abundances of different components in a sample. With such relative — or compositional — data, differential expression needs careful interpretation, and correlation — a statistical workhorse for analyzing pairwise relationships — is an inappropriate measure of association. Using yeast gene expression data we show how correlation can be misleading and present proportionality as a valid alternative for relative data. We show how the strength of proportionality between two variables can be meaningfully and interpretably described by a new statistic ϕ which can be used instead of correlation as the basis of familiar analyses and visualisation methods, including co-expression networks and clustered heatmaps. While the main aim of this study is to present proportionality as a means to analyse relative data, it also raises intriguing questions about the molecular mechanisms underlying the proportional regulation of a range of yeast genes.

Comment: Lots of experimental data in biology are actually relative fractions (e.g., relative species abundances from microbiota studies, relative transcript abundances from RNA sequencing). The implications are not widely appreciated.

--

--