Analyzing Survey Results: Efficiently Performing Lots of Chi-Squared Tests

Kevin Stern
Apr 3, 2019 · 4 min read

Surveys can provide a direct source of data into your members feelings, thoughts, and opinions. Connecting survey results to other existing data can unearth deeper insights about your member base.

One way to go about this integrated analysis is to find relationships between existing member characteristics (ex: age and gender) and the survey responses. For example, after sending out a survey, you may hope to answer some questions like:

- Do older members think frequency of contact is more important?
- Are female members more likely to recommend the product to their friends?
- Are members who receive the product more frequently (but in smaller amounts) more satisfied with the quality of the product?

In Python, it is fairly straightforward to find the pairwise correlation between all columns in a data set using the pandas .corr() function; however, this will give a measure of the strength and direction of a continuous linear relationship, which is often not suitable for categorical variables.

To test if two categorical variables were significantly related you can use a chi-squared test of independence. However, if there are many questions in the survey and you want to test them against a lot of member characteristics, finding the strongest relationships can quickly become quite tedious.

Our primary goal is to efficiently perform chi-squared tests, and collect:
1. p-values
2. minimum observed frequency in each contingency table
3. odds ratio from the contingency table

The first step we took in preparing survey results data was to remap values of positive responses to 1, and negative responses to 0. For example, if a member said that frequency of contact with your company was important, or that they were satisfied with the quality of the product, their response was mapped to a 1. For the purposes of this exercise, we mapped any response to a 1 or 0, but as the number of survey respondents increases, it may be worthwhile to map responses to ordinal values (i.e. ‘Strongly agree’ would be assigned a higher value than ‘Somewhat agree’).

In this example, the survey data had already been augmented with member characteristics prior to importing to Python

Next, we created three empty dataframes where the rows were all the different survey questions, and the columns were all the member characteristics that we wanted to test against. Each cell would contain the p-value, minimum observed frequency, or odds ratio for that specific question/member characteristic chi-squared test.

To narrow down the relationships to focus on, we ruled out tests where the minimum observed frequency count was less than 15 in any of the 2x2 contingency table cells (we believed there weren’t enough data in these tables), and tests where the p-value was higher than 0.1. We don’t want to advocate the use of p-values alone in determining if there is a relationship, but they can be useful in identifying areas where further investigation may be appropriate.

This provides us with an organized dataframe of all the values we were looking for, and we can even use the seaborn heatmap function to quickly visualize the most significant relationships.

After plotting the heatmaps, it is easy to see the minimum frequency counts from each contingency table, and the p-values from each chi-squared test. You can use these to confirm why a cell may not be populated in the final odds ratio dataframe.

Heatmap showing minimum frequency count from contingency tables
Heatmap showing p-values

For example, the p-value from the chi-squared test between the variables IS_FEMALE and WILL_RECOMMEND is 0.44, which is higher than the 0.1 threshold we set earlier, so the odds ratio for that question/member characteristic combination will not show up in the dataframe below.

Heatmap showing odds ratios if the minimum frequency count and p-value met the thresholds

From this odds ratio dataframe, it appears that there is a significant relationship between older members and members who will recommend the product to others. Using this data, we could say that the odds of a member older than 45 recommending the product are 1.9 times higher than a member who is younger than that.

When testing numerous survey responses and member characteristics, this methodology will allow you to efficiently and effectively pinpoint the strongest relationships between categorical variables. This can then drive further analyses to help you gain a better understanding of what will keep your members happy and satisfied.

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

Kevin Stern

Written by

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data science

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade