## Data Analysis

# How to be a better used-car salesman

## An example application of Principle Component Analysis (PCA)

In the following post, I describe how Principal Component Analysis (PCA) works using a notional example of selling used cars. A fully documented, reproducible R script is also available on github.

Imagine for a moment you are a used-car salesman working at a dealership that offers cars from several manufacturers. Your dealership has tons of options to choose from, but each customer is going to be looking for something specific to their taste, personality, and budget. Your livelihood depends on helping customers find the “right” car, but knowing the specific features of each car is a lot to ask. On the other hand, maybe it’s reasonable for you to understand the primary factors that differentiate one car brand from another.

Fortunately, Bob from the front office gave you a spreadsheet of data with ratings on how manufacturers performed across 6 categories (Note: All data used in this example is made up). It looks like useful data, but some of the categories are highly ambiguous and, depending on how they are interpreted, seem redundant. For example, what does the “For Family” rating really mean? Mothers and fathers might view “safety” as a key feature for a family car. Other families on a tight budget might view low-cost, “practical” purchases as being family friendly; or, perhaps, the customer is just looking for a roomy car with lots of seats, like a minivan, and views safety and practicality as completely separate categories.

Fortunately, **correlation analysis** provides a means to explore relationships between the different ratings. For example, the correlation matrix below shows car ratings in the “For Family” category have a strong positive correlation with the safety (0.505) and practicality (0.903) ratings. In contrast, “For Family” ratings have a strong negative correlation with luxury (-0.723), sporty (-0.961) and exciting (-0.869) ratings. Overall, the correlation matrix reveals we are dealing with highly correlated multivariate data.

This is where PCA comes in handy. PCA provides a way to re-express highly correlated multivariate data in uncorrelated components that capture independent pieces of information represented in the larger data set.

Principal components (PCs) are calculated as linear combinations of multivariate data. We determine the weights for the linear combinations and how many components to use by performing **eigenanalysis **of the correlation matrix. As a general rule of thumb, we should only use PCs with an eigenvalue greater than 1. In the eigenmatrix below, two eigenvalues are greater than 1. Hence, we will only use the eigenvectors in columns `U1`

and `U2`

to calculate the PCs.

Another technique for determining how many PCs to retain is to examine a **scree plot** of the eigenvalues. A scree plot is is a line plot that shows eigenvalues on the y-axis and PC numbers on the x-axis. Scree plots are called “scree” plots because they look like screes.

Typically, the number of PCs we retain should be “one less than the elbow” of the scree plot. Perhaps, a more appropriate way say this is, “find the base of the cliff, and retain the PCs that fall on the cliff face”. Regardless of which analogy we want to use, the scree plot of eigenvalues suggests we should retain 2 PCs since the “elbow” or “base of the cliff” appears to be located at the 3rd PC.

We can also take a more precise approach and analyze the eigenmatrix. The eigenmatrix describes how much information is captured in the PCs. In this example, the table below indicates the first two PCs describe about 90.3% of the variation from the original 6-dimensional data, and the remaining variation can be attributed to measurement error.

Next, we reduce the dimensionality of the data by transforming the 6-dimensional data to 2-dimensional data. Dimensionality reduction produces a matrix of PCs (**Z**) and is calculated by multiplying the original data (**X**) by the matrix of eigenvectors (**U**).

On the plus side, the transformed variables are now mutually uncorrelated and account for about 90.3% of the total variance — (4.3526 + 1.076)/6 = 0.903. On the negative side, our old variable labels are no longer useful, and we have to determine how to interpret the new variables.

To determine new meanings for the PCs, it’s often helpful to examine the relationship between the PCs and the original variables. **Principal component loadings** refer to the correlations between the PCs and the original variables. We obtain the loadings (FF) by multiplying each eigenvector *uᵢ* by the scalar *√λᵢ* (i.e., the standard deviation of the i-th PC). Squaring the loading matrix allows us to determine the amount of variance in the original data captured by each of the PCs.

Based on the table of loading values and plot of the two PC loading values, the 1st PC appears to be a classification variable that captures vehicle cost. Safe, family, and practical cars are usually cheap and appear on the left side of the loading plot; whereas, exciting, luxurious, sporty cars are usually expensive and appear on the right.

The interpretation of the 2nd PC is not as clear in the loading matrix or plot. However, the table of explained variance indicates the 2nd PC captures 68.7% of the variance for the safety rating, 27.7% of the luxury rating, and almost none of the variance in the remaining rating categories. Hence, the 2nd PC appears to be related primarily to safety, which makes sense if we think about how safety relates to the other categories. Exciting, sporty cars that go fast are not going to be as safe as a car with lots of safety features (e.g., four-wheel drive). Likewise, practical, family cars will be less safe because additional safety features cost money.

Based on our PCA, we believe car options can be reduced to two primary considerations, not six: 1) how much does a car cost? and 2) how safe is it? To explore this theory, we can examine a **score plot** of PC values for the 10 manufacturers.

The score plot appears to confirm our intuition about how to interpret the PCs. Luxury brands are clustered together on the right side of the plot; affordable brands are clustered together on the left. Volvo, plotting far away from the other affordable options on safety makes sense, because Volvo’s brand identity is based on its enduring reputation for being safe. Likewise, Lexus and Mercedes tend to have high safety ratings among luxury cars.

Now that we have sufficiently covered how PCA works, let’s discuss why our PCA results are useful. PCA makes explaining the differences in car options to a customer much simpler. For instance, the first question you should ask is:

**“Are you interested in looking at any luxury cars today? or what price range are you looking to target today?”**

If the customer says they have a tight budget, you probably should start by showing them a Volvo, Jeep, Ford, or Chrysler, or you might mention the 0% APR for a year sales event your dealership is offering and see if they want to check out an Infinity or Saab. If they are dressed in a suit wearing a Rolex and look like they have money to spend, maybe you should start by asking:

**“Are you looking for something fun and exciting or something that handles well in bad weather?”**

If the customer says they want a car to impress people, you probably should start by showing them a BMW or Porsche. These are the types of questions salesman ask when you visit a car dealership. When a salesman describes a car as “sporty”, “exciting”, or “luxurious”, the implication is you’re going to pay more for it. PCA sees through the semantics and identifies the primary factor that distinguishes car options — how much are you willing to spend?