Statistics

The 8 Most Important Statistical Ideas of the Past 50 Years

Explained in simple terms.

Benedict Neo
bitgrit Data Science Publication
9 min readJan 19, 2023

--

Photo by Naser Tamimi on Unsplash

In this article, we will highlight some of the most important statistical ideas that have emerged over the last half-century and explain how they have helped to shape the field of statistics as we know it today.

These eight statistical ideas have profoundly impacted our ability to make sense of complex data sets and draw meaningful insights, from developing new statistical tools and methods to applying these ideas in various fields.

Whether you are a professional statistician, a student of the field, or someone interested in the world of data, this article will provide simple explanations for some of the most important statistical ideas of the past 50 years.

This article was inspired by the paper “What are the most important statistical ideas of the past 50 years?” by Andrew Gelman and Aki Vehtari.

Read till the end for code samples for some of these ideas.

Code for this article → Deepnote

1. Counterfactual causal inference

Counterfactual causal inference is a method for making causal inferences from observational data.

In other words, it is a way of determining the effect of an intervention or treatment on an outcome without the need for a randomized controlled trial.

This is done by comparing the observed outcome with the outcome that would have been observed if the intervention had not been applied.

For example, suppose we are interested in the effect of a new drug on blood pressure. We might compare the blood pressure of a group who took the drug with the blood pressure of a similar group of people who did not. We can estimate the drug's effect on blood pressure by comparing these two groups.

Counterfactual causal inference has been developed in various fields, including econometrics, epidemiology, and psychology.

It has allowed researchers to make more precise assumptions about the conditions under which causal inferences can be made and has stimulated the development of new statistical methods for addressing these problems.

2. Bootstrapping & Simulation-based inference

Bootstrapping and simulation-based inference are statistical methods used to make inferences about a population based on a sample.

Bootstrapping is a non-parametric method of resampling that involves drawing random samples from the original data with replacement. This is done to estimate the sampling distribution of a statistic, such as the mean or standard deviation, and to construct confidence intervals for the statistic.

The simulation-based inference is a broader approach that uses simulations to make inferences about a dataset or model. This can involve resampling or creating replicated datasets from a model and is often used when conventional analytical methods are not applicable or when the data is complex. Some examples include permutation testing, parametric bootstrapping, and simulation-based calibration.

Both bootstrapping and simulation-based inference rely on computational methods and the availability of sufficient computational resources.

The increase in computational power in recent decades has made these methods more widely applicable and has allowed for more complex and accurate inferences to be made.

3. Over-parameterized Models & Regularization

Overparameterized models are models that have a large number of parameters, sometimes more parameters than data points.

These models are typically fit using some form of regularization, which prevents overfitting by imposing constraints on the model parameters. Regularization can be implemented as a penalty function on the model parameters or the predicted curve and can help to ensure that the model does not overfit the data by limiting the flexibility of the model.

Examples of overparameterized models include splines, Gaussian processes, classification and regression trees, neural networks, and support vector machines. These models have the advantage of being able to capture complex patterns in the data but can sometimes be prone to overfitting if not regularized properly.

In recent years, the development of powerful computational resources has made it possible to fit and regularize these models more effectively, leading to their widespread use for deep learning, such as image recognition.

Researchers have also developed methods for tuning, adapting, and combining inferences from multiple fits of overparameterized models. These methods include stacking, Bayesian model averaging, boosting, gradient boosting, and random forests, and they can help to improve the accuracy and robustness of predictions from these models.

Overall, overparameterized models and regularization are powerful tools for making predictions and understanding complex datasets and have become increasingly important in the field of statistics and data science.

4. Multilevel Models

Multilevel modeling is a statistical method used to analyze hierarchical data, where observations are grouped into higher-level units.

This approach allows for modeling both within- and between-group variation and can be useful for analyzing data where the units of observation are nested within one another.

One example of a multilevel model is a model that is used to predict student test scores based on various factors such as class size, teacher experience, and school resources.

In this example, the data is structured into different groups, such as schools and classrooms. A multilevel model could be used to predict student test scores for each classroom while considering factors such as the class size and the teacher's experience.

This would allow the model to make more accurate predictions by adapting to the specific characteristics of each classroom. The model could also be used to make inferences about the factors influencing student test scores, such as the relationship between class size and teacher experience.

Multilevel models have been used in areas such as animal breeding, psychology, pharmacology, and survey sampling and have been given a mathematical structure and justification by several researchers.

Today, multilevel models are widely used in statistics and data science and are a valuable framework for combining different sources of information and making inferences from structured data.

5. Generic Computation Algorithms

Generic computation algorithms are mathematical tools that can solve statistical problems and make inferences from data.

Some examples of these algorithms include the EM algorithm, the Gibbs sampler, and variational inference. These algorithms use the conditional independence structures of statistical models to make computations more efficient.

One of the main benefits of these algorithms is that they allow for the development of complex statistical models without requiring significant changes to the underlying computation. This means that researchers and analysts can focus on developing the models themselves rather than worrying about the details of how the calculations will be performed.

For example, a statistical model uses the EM algorithm to find the maximum likelihood estimates of parameters. It does this by iteratively updating the estimates of the parameters based on the data until a convergence point is reached. This allows researchers to quickly and efficiently find the values of the parameters that best fit the data without worrying about the underlying computation details.

Overall, advances in generic computation algorithms have played a crucial role in enabling the development of complex statistical models. They have also enabled extracting useful insights from large and complex data sets.

6. Adaptive Decision Analysis

Adaptive decision analysis is a field of study that focuses on making decisions in complex, uncertain environments. This field uses statistics, decision theory, and psychology tools to help people and organizations make more effective decisions.

One important development of adaptive decision analysis is Bayesian optimization, which is a method for finding the best solution to a problem using Bayesian statistics to update beliefs about the possible solutions based on the results of previous decisions. This can be useful in many situations, such as deciding which product to develop or which marketing campaign to run.

Another important development of adaptive decision analysis is reinforcement learning, a type of machine learning that involves training an algorithm to make decisions by rewarding it for making good decisions and punishing it for making bad ones. This can be used, for example, to train a computer to play a game like chess or Go by letting it practice against itself and learn from its mistakes.

Overall, adaptive decision analysis is a rapidly growing field that is helping people and organizations make better decisions in a wide range of contexts.

7. Robust Inference

Robust inference is a statistical approach that focuses on making inferences from data that is not well-behaved or may violate the assumptions of the statistical model being used.

This approach is based on the idea that statistical models can still be useful even when their assumptions are not perfectly satisfied, as long as they are designed to be robust to a wide range of possible violations of those assumptions.

One example of robust inference is using robust standard errors in regression analysis. This technique adjusts the standard errors of regression coefficients to account for the possibility that the regression model's assumptions may not be perfectly satisfied. This can provide more accurate inferences about the relationships between variables in the data.

Another example of robust inference is partial identification, a method for making inferences about parameters in a statistical model when those parameters cannot be fully identified from the data. This is often used in economics, where the data-generating process may need to be fully known or understood.

Overall, robust inference is an important concept in modern statistics, as it allows researchers to develop and use statistical models and methods that are not overly sensitive to assumptions that may not be perfectly satisfied in practice.

8. Exploratory Data Analysis

Exploratory data analysis is a statistical approach focusing on using graphical techniques to understand and summarize data rather than relying solely on mathematical equations and formal statistical tests.

This approach emphasizes the importance of open-ended exploration and communication and is often used to discover patterns and trends in data that may not be immediately apparent.

One of the key ideas behind exploratory data analysis is using graphical techniques to visualize data. These techniques can include histograms, scatterplots, and boxplots and are often used to quickly identify trends and patterns in data. For example, a histogram can be used to visualize the distribution of a quantitative variable, while a scatterplot can show the relationship between two quantitative variables.

Another important aspect of exploratory data analysis is the use of statistical graphics, which are visual displays of data that use the structure of the data itself to help the reader understand the data. These graphics can quickly and effectively communicate complex patterns and trends in data and are often used to help researchers and analysts make sense of large and complex data sets.

Overall, exploratory data analysis is an important approach to data analysis that emphasizes the importance of graphical visualization and open-ended exploration in understanding and summarizing data.

What’s Next?

The paper also includes what the authors think will be important statistical ideas for the next few decades. They place a safe bet that it will be a combination of the existing methods above.

  • causal inference with rich models for potential outcomes estimated using regularization
  • complex models for structured data, such as networks evolving
  • robust inference for multilevel models
  • exploratory data analysis for overparameterized models
  • subsetting and machine-learning meta-algorithms
  • interpretable machine learning
  • validation of inferential methods — applying unit testing on problems on learning from noisy data

Code Samples

The notebook has more detailed explanations of the code below.

Bootstrapping

Below we generate a random data set, compute the mean of the data, and then use bootstrapping to estimate the uncertainty of the mean.

Overparameterization + Regularization

Regularization can be used in over-parameterized models in PyTorch by adding a regularization term to the loss function. This regularization term penalizes large parameter values, which helps to prevent overfitting.

Here is an example of how to use regularization in a PyTorch model:

We first define a simple neural network architecture with two fully connected layers, with a rectified-linear activation function in between.

We create the model and use the Adam Optimizer and Cross Entropy Loss as the loss function.

We create two empty lists to store the losses.

First, we train the model without regularization.

Then with L1 regularization.

We plot the loss values.

Robust Inference

Here is an example of using Huber regression to fit a robust linear model to the data:

After training a simple Linear Regression (LR) model and a robust LR model, here are the metrics for both models.

We see that MAE and MSE are lower for Huber Regression since it uses the Huber loss function, which is more resistant to the effects of outliers.

Head to the notebook for a bonus example of multilevel models with Python!

That’s all for this article! I hope you found this article interesting and discovered some new statistical concepts. Dive deeper into these ideas in the paper.

Check out the Statistical Modeling, Causal Inference, and Social Science blog for interesting pieces by Andrew Gelman.

Like this article? Here are three articles you may like:

Be sure to follow the bitgrit Data Science Publication to keep updated!

Want to discuss the latest developments in Data Science and AI with other data scientists? Join our discord server!

Follow Bitgrit below to stay updated on workshops and upcoming competitions!

Discord | Website | Twitter | LinkedIn | Instagram | Facebook | YouTube

--

--