Analyzing Survey Results: Efficiently Performing Lots of Chi-Squared Tests

Kevin Stern
Apr 3, 2019 · 4 min read

Surveys can provide a direct source of data into your members feelings, thoughts, and opinions. Connecting survey results to other existing data can unearth deeper insights about your member base.

One way to go about this integrated analysis is to find relationships between existing member characteristics (ex: age and gender) and the survey responses. For example, after sending out a survey, you may hope to answer some questions like:

- Do older members think frequency of contact is more important?
- Are female members more likely to recommend the product to their friends?
- Are members who receive the product more frequently (but in smaller amounts) more satisfied with the quality of the product?

In Python, it is fairly straightforward to find the pairwise correlation between all columns in a data set using the pandas .corr() function; however, this will give a measure of the strength and direction of a continuous linear relationship, which is often not suitable for categorical variables.

To test if two categorical variables were significantly related you can use a chi-squared test of independence. However, if there are many questions in the survey and you want to test them against a lot of member characteristics, finding the strongest relationships can quickly become quite tedious.

Our primary goal is to efficiently perform chi-squared tests, and collect:
1. p-values
2. minimum observed frequency in each contingency table
3. odds ratio from the contingency table

The first step we took in preparing survey results data was to remap values of positive responses to 1, and negative responses to 0. For example, if a member said that frequency of contact with your company was important, or that they were satisfied with the quality of the product, their response was mapped to a 1. For the purposes of this exercise, we mapped any response to a 1 or 0, but as the number of survey respondents increases, it may be worthwhile to map responses to ordinal values (i.e. ‘Strongly agree’ would be assigned a higher value than ‘Somewhat agree’).

In this example, the survey data had already been augmented with member characteristics prior to importing to Python

Next, we created three empty dataframes where the rows were all the different survey questions, and the columns were all the member characteristics that we wanted to test against. Each cell would contain the p-value, minimum observed frequency, or odds ratio for that specific question/member characteristic chi-squared test.

To narrow down the relationships to focus on, we ruled out tests where the minimum observed frequency count was less than 15 in any of the 2x2 contingency table cells (we believed there weren’t enough data in these tables), and tests where the p-value was higher than 0.1. We don’t want to advocate the use of p-values alone in determining if there is a relationship, but they can be useful in identifying areas where further investigation may be appropriate.

This provides us with an organized dataframe of all the values we were looking for, and we can even use the seaborn heatmap function to quickly visualize the most significant relationships.

After plotting the heatmaps, it is easy to see the minimum frequency counts from each contingency table, and the p-values from each chi-squared test. You can use these to confirm why a cell may not be populated in the final odds ratio dataframe.

Image for post
Image for post
Heatmap showing minimum frequency count from contingency tables
Image for post
Image for post
Heatmap showing p-values

For example, the p-value from the chi-squared test between the variables IS_FEMALE and WILL_RECOMMEND is 0.44, which is higher than the 0.1 threshold we set earlier, so the odds ratio for that question/member characteristic combination will not show up in the dataframe below.

Image for post
Image for post
Heatmap showing odds ratios if the minimum frequency count and p-value met the thresholds

From this odds ratio dataframe, it appears that there is a significant relationship between older members and members who will recommend the product to others. Using this data, we could say that the odds of a member older than 45 recommending the product are 1.9 times higher than a member who is younger than that.

When testing numerous survey responses and member characteristics, this methodology will allow you to efficiently and effectively pinpoint the strongest relationships between categorical variables. This can then drive further analyses to help you gain a better understanding of what will keep your members happy and satisfied.

Ro Data Team Blog

Ro Data Team Blog: data analytics, data engineering, data…

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store