Limitations of the Multiple Regression Model

Hansol Rheem
Human Systems Data
Published in
5 min readMar 28, 2017

Can we see the forest for the trees? When examining a phenomenon with multiple causes, will it help us understand the phenomenon if we look at the causes one at a time? Or, is it better when we see the overall pattern created by the multiple causes? For this week blog post, I wanted to explore methods and issues in visualizing the multiple regression model in comparison to the simple regression model. As you are aware, the simple linear regression model is a methods of mapping a causal relationship between a predictor (cause of a phenomenon) and a response. It can be read and interpreted easily.

Figure 1. An example of the simple linear regression model

Take figure 1 as an example. We can infer that the x-axis represents the advertising dollars (predictor), and the y-axis represents the sales (response). We can also infer that the person who created this plot was interested in evaluating the causal relationship between the sales and the advertising dollars. The interpretation of this plot is simple: The sales changes as a function of the advertising dollars.

How about the multiple regression model? Are they easy to read and interpret? Multiple regression model allows us to examine the causal relationship between a response and multiple predictors. Let’s see the plot I created for this week’s blog assignment (see figure 2). This is the dataset from the class project. I, Pouria and Vipin are trying to find non-medical predictors of the type 2 diabetes (T2D). The body mass index (BMI) is basically the weight to height ratio (703*(weight/height²)), and a person with high BMI value is considered as obese. The pedigree variable reflects the family medical history of diabetes, and a higher pedigree value indicates that one has a more direct family member who was diagnosed with diabetes. The last predictor variable is age. Finally, The response variable is the diabetes variable which has two levels: The T2D patients are coded as 1 (red dots), and the healthy people are coded as 0 (black dots).So, what is our verdict now?

Figure 2. 3D Scatter plot depicting the linear relationship between the variables BMI, age, pedigree, and diabetes

Fortunately, this plot (figure 2) was not difficult to interpret. The red dots (diabetes patients) are mostly observed at the upper right side of the plot. Considering that all variables increase in values along the upper right direction, we can easily infer that people are more likely to be diagnosed with T2D if they are older, has higher BMI, and has a direct kin diagnosed with diabetes. This conclusion is then supported by the linear relationship between the pedigree, BMI, and age which is represented as a grid surface in the middle of figure 2. Note that this relationship is represented as a 2-dimensional plane, which is different from the 1-dimensional line representation from the simple regression model. This is because the multiple regression model considers multiple predictors, whereas the simple regression model considers only one predictor.

Again, we were fortunate to observe a clear data pattern this time. However, the multiple regression model does not always work like this. Moreover, figure 2 had a critical problem. That is, this plot described 4D data in a 2D plane which made the plot more difficult to read. For example, see the black dot inside the purple circle in figure 2. How old do you think that person is? 70? 80? The answer is…91. To avoid this problem, one can create an interactive plot like the following plot.

But, what if we are to report the results from the multiple regression model in a paper journal? As in the example of figure 2, we might be able to put every piece of results in a 2D plane. Nevertheless, this would only induce confusion and increase the difficulty of the interpretation, which is a mistake that Gelman and Unwin (2013) were worrying about. The conventional approach would be to break down the results into subsets of variables. For example, we could include three plots with two variables, instead of including only one plot with three variables in our paper journal. In this way, audiences of the plots will get a better understanding of the data. Moreover, the breaking down of the multiple regression model seems to conform to the methods the regression analysis uses to identify significant predictors. For instance, the multiple regression analysis examines the subsets of predictors to come up with the predictor combination that best predicts the response. For these reasons, I believe that the breaking down of the multiple regression model into its subsets seems to be the most appropriate method to plot them at this point (at least for paper journals).

However, this has to change eventually. We might be able to create plots that would allow easier understanding of the dataset’s details but at the cost of the understanding to the overall data pattern (or the forest). In the previous blog, I criticized Gelman and Unwin’s idea (2013) that data visualization should have more influence over the statistical side (people who are interested in finding effective and precise ways of representing data, even at the cost of audiences’ interests). It seems to me that the multiple regression model is an exception because the current plots of multiple regression model seem to lack the ability to communicate efficiently even to the educated audiences. So, to answer my initial question of ”Can we see the forest for the trees? ”, I would say yes and no. Neither the tree nor the forest is superior, and we need both to understand each other.

P.S. I am also leaving the link to the R codes I made for this week’s blog post, in case you would like to check out.

Reference

Gelman, A., & Unwin, A. (2013). Infovis and Statistical graphics: different goals, different looks. Journal of Computational and Graphical Statistics, 22(1), 2–28.

--

--