Just got Chi-Squared!
One of these graph is telling a false story! The question we are attempting to answer is, “Does marital status and/or gender effect the type of car an individual drives?”
This question reminds me of an impressive experience I had with Adobe Analytics product and UX manager whose goal was to verify if I’m able to understand and navigate a new feature of their product. I walked away wondering how they would quantify my response and find a way to apply their findings to a larger population set. The intent of this article is to show one standard way to approach investigating a relationship between categorical variables, as alway the immediate audience to this article is myself.
The mosaic graph of marital status vs. type of car tells a story that the married segment drives a more family type car than the single segment. Secondly, the single segment drives more sporty cars than the married segment. From my experience and observations, this makes sense! Similarly, the mosaic graph of the gender vs. the type of car tells another story. The female population tends to drive more family cars than the male population. In contrast, males tends to drive more sporty cars compared to females. From my experience and observations, this also makes sense!
How do we determine if a relationship exists if all we have are several nominal and ordinal measurements? How can we confirm if marital status and gender indeed have an effect on the type of car an individual drives?
This is where the Chi square independence test comes handy. In some cases we have to dummy code the responses. A chi square statistic is used to learn about the relationship between two qualitative variables. We can investigate whether distributions of binomial or multinomial measures differ from one another. Responses to such questions as “What is your marital status?” or What type of car do you own?” are categorical because they yield data such as “Single” or “Sporty”
Chi-Square is based on the difference between expected count and observed count. Look at the deviations in the contingency tables.
Marital Staus and Type of Car:
Here we see the p-value is < 0.05. This indicates, statistical significance, that the difference between the expected and the observed count is big enough to counter-explain the sampling error. Hence, we see that marital status has an effect on the type of car an individual drives.
Gender and Type of Car:
Here we see the p-value is > 0.05 indicating that the difference between expected and observed count is not big enough to ignore the possibility that this relationship could be because of sampling error. Hence, gender does not show any effect on the type of car an individual drives.
This test gives a bit more transparency to “makes sense” of the categorical data! From Chi-Squared test we learned that even though the relationship between gender and car type seemed to make sense, the deviation is not big enough to confirm any relationship in these variables. This understanding of relationships between variables can help us to avoid faulty conclusions and expensive managerial mistakes. That’s how I got chi-squared’ed ;)