Simulation and modeling: What is a regression line?

Hansol Rheem
Human Systems Data
Published in
5 min readFeb 27, 2017

The inferential statistics such as the regression analysis have strength over exploratory statistics in that the inferential statistics do not require analyzing the whole data of the population, thus is more cost effective. However, it is also vulnerable to sampling bias. Since inferential statistics perform analysis on samples of a population, it is crucial for analysts to select samples that best represent their population. Otherwise, we would end up getting wrong ideas on the population we are making inference to. Linear regression is one of the inferential statistical techniques that makes inference about the relationship between variables that exist in a population. Specifically, the regression analysis looks for a causal relationship between two variables and quantifies how changes in one variable lead to changes in the other variable on average.

From what we have learned so far, we know that the regression line we get after linear regression analysis is a visual representation of the data-fitted function. However, this line plays more roles than just visualizing an estimated function and its prediction. The regression line can also be viewed as a simulated dataset, a dataset that was generated by the function you estimated through linear regression analysis. Therefore, you are not just estimating a function and making predictions with it when you are performing the regression analysis: You are modeling the population of interest, and simulating what happens to a sample that follows your function.

To help you illustrate this view, let’s assume a regression function of “y=x” (with intercept of 0 and coefficient of 1). As you change the value of the x by the smallest number possible (i.e. x+0.00001), y will change accordingly (y+0.00001). Iterate this process using sets of Xs you want to try out and record all of the Xs and Ys. If you plot every Xs and Ys, you will be drawing a diagonal line we know as the regression line. This iteration of processes is called the simulation and serves many roles including response prediction. Now, let’s take a look at figure 1 from our reading, and re-interpret the plot using this view. If you look at the figure 1, we see a red line which represents the true function of the population, and many other bluish lines which represent functions identified from multiple different regression analyses. When we want to test and simulate these functions to see if they are close to population function, we provide them with a sames set of Xs and see what happens to the overall shape of the plot (or the values of Ys). In the case of figure 1, most of the simulation results seem to be close to the true function which suggests that these models can be used for other purposes of simulation.

Figure 1. Regression graph showing predictions and the simulation results

The simulation serves many roles other than prediction. One of the most frequent usages is to provide datasets from hypothetical situations. Simulation can provide datasets of interests which cannot be obtained due to ethical and safety issues. For example, the data of firefighter’s cognitive ability in risky environments is not what we can obtain easily if we are to measure them in real situations. However, it can be obtained easily with the help of accurate modeling and simulation. Moreover, the simulation provides us a cheap and a fast way to obtain large datasets when there are many conditions to compare.

In one of my previous studies, I asked two participants to play a game called the “Chicken game” in which the participants were asked to steer one of the two cars moving toward to each other. In this experiment, each participant could choose to avoid the car from crashing the opponent’s car or to rush toward the car. The reward was determined based on the behaviors of the two participants. When one participant avoided, and the other rushed anyway the participant who rushed earned 1000 points, whereas the participant who avoided received no point. When both participants rushed resulting in the car crash, both participants were subtracted 2000 points. Due to this point system, four distinct patterns were observed. There was the “switching“ pattern in which each participant avoided and rushed alternately to maximize the overall point, the “unfair“ pattern in which only one participant rushed and the other avoided, and the “mutual avoidance“ and “mutual rush“ patterns in which both participants rushed or avoided throughout the whole experiment. Categorizing each participant pair to one of these four patterns was challenging because no pairs showed consistent patterns throughout the 100 trials they played. For example, how would you categorize a participant pair which showed switching patterns in 50 trials, unfair patterns in 40 trials, and mutual avoidance patterns in 10 trials? Can we categorize them as the switching pattern pair just because the half of their trials showed the switching pattern? My approach to this challenge was to create simulation data of a million participant pairs. Then, I used the dataset to determine the thresholds for categorizing participant pairs. If I were to conduct this experiments myself to obtain the same amount of data, I would have wasted about 27 years of my life even if I conducted 100 experiments per day. Instead, it just took me a second to create the large dataset using simulation.

Thus, modeling and simulation offer an effective way to conduct research. They can save you time and money. They even allow you study risky situations or rare situations. Literally, there is no limit on what you can do with modeling and simulation as long as you can express the model of interest in the mathematical format. The beauty of the linear regression analysis is that it is the simplest, yet powerful technique to perform simulation and modeling (Shallahamer, 2007). It is now widely used in marketing, logistics and many other areas thanks to its simplicity. However, there are also occasions when the simple regression cannot capture the true function of a complex population (James et al., 2013). Therefore, one must not consider it as the all-in-one tool for modeling and simulation and make mistakes of over-simplifying the data or the model.

Reference

James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An introduction to statistical learning (Vol. 6). New York: springer.

Shallahamer, C. (2007). Linear regression modeling. Forecasting Oracle Performance, 199–228.

--

--