Exploratory data analysis (EDA) using the multivariable scatterplot smoother (MSS)

The call for doing lots of exploratory data analysis (EDA) before turning to estimation and testing of variable associations has been made numerous times before. EDA is important for two reasons in this regard. (1) It points us in the right direction in terms of how to model the associations, and (2) it suggests how to visualize the associations one finally chooses to report. In my 30-plus years as a data analyst and researcher, I have increasingly come to rely on the multivariable scatterplot smoother (MSS) as the device for doing the first steps of EDA. That is, almost every time I get access to a new data set, I use the MSS to get to get the first grip of the associations between the dependent variable, y, and the independent variables (x-es) in the data. In this post, I am going to show you the benefits of using MSS in EDA. As usual, I eschew equations, formulas, and abstract reasoning anyway I can.

The setting: The determinants of the price of second-hand recreational vehicles (RVs)

I have a data set on 188 second-hand recreational vehicles or RVs, and the dependent variable, y, is the sales price of the RV. Figure 1 shows that the median sales price is 39,000 Euro. The mean is 41,000 Euro (not shown). Also, about 50 percent of the RVs lie in the price range from about 25,000 to 50,000 Euro. Finally, there are some outliers north of about 80,000 Euro.

Figure 1.

The key independent variables are the RVs’ age (number of years), mileage (km), weight (kg), and horsepower (count). We ask how these independent variables simultaneously are associated with the sales price of the RVs.

The multivariable scatterplot smoother: the bivariate case

For starters, let’s look at how only the age variable is associated with the price variable using the MSS. Figure 2 takes care of this. Two things are noteworthy. First, the data-driven blue line — i.e., the scatterplot smoother — suggests that there is an approximately linear relationship between the two variables. Second, there are some outliers that might need special attention before proceeding to formal estimation (i.e., some very costly RVs).

Figure 2.

So far, this exercise is only bivariate. Yet the benefit of the MSS is its ability to simultaneously handle several independent variables. That is, the MSS yields one smoothed line for each independent variable adjusted for the other lines, much akin to how multiple regression analysis controls for the effect of other independent variables.

The multivariable scatterplot smoother: the multivariate case

Figure 3 presents the multivariate scenario. The relationship between age and sales price is much as it was in Figure 2, whereas the relationship between mileage and sales price is slightly negative but with a “bump” for RVs with low mileages. There are also some peculiar non-linearities in the relationships between weight and sales price and between horsepower and sales price. Both need attention (and perhaps some removing of outliers) before one can proceed to more formal estimation. There is also perhaps also a multicollinearity problem regarding age and milage, but let’s not get into that.

Figure 3.

The MSS can also incorporate all sorts of categorical variables. Figure 4 presents the “effects” of the RV having a rearview camara (coded 1) or not (coded 0) and whether the RV is smoke free (coded 1) or not (coded 0). Both “effects” or lines are adjusted for the numerical independent variables in Figure 3, but I have omitted them in the figure for space reasons. In any event, you must probably pay extra for an RV with a rearview camera and a smoke free RV.

Figure 4.

Takeaways

Exploratory data analysis (EDA) is fundamental and necessary first step in all serious data analysis. In this regard, and in my experience, the MSS is an invaluable tool.

Acknowledgement

I have used the Stata-command “mrunning” to obtain the MSS-results in this post. This command is an add-on to official Stata, and it was developed by Patrick Royston and Nicholas J. Cox. I would be surprised if there is no equivalent MSS-command in R, LIMDEP, or SAS.

About me

I’m Christer Thrane, a sociologist and professor at Inland University College, Norway. I have written two textbooks on applied regression modeling and applied statistical modeling. Both are published by Routledge, and you find them here and here. I am on ResearchGate here, and you also reach me at christer.thrane@inn.no

--

--

Christer Thrane (christer.thrane@inn.no)

I am Christer Thrane, a sociologist and professor at Inland University College, Norway. You find me on ResearchGate. I do lots of regression modeling ... :-)