TURF Analysis in Python

Total Unduplicated Reach and Frequency

Nick Campanelli
Analytics Vidhya
3 min readApr 28, 2020

--

A Total Unduplicated Reach and Frequency Analysis (TURF, for short), is a statistical method used by marketers and market researchers to identify an optimal combination of products or services. If you want more of a background, I like this explanation.

This analysis is relatively straightforward to implement but can be time consuming if done manually in a tool like Excel. In order to streamline my workflow, and because I couldn’t find a package in Python to conduct this analysis, I wrote a short program that would give me a quick TURF output. Please read ahead if you’d like to know about the process, or head over to my GitHub repository if you’re more interested in the code.

Part 1 | Data Format

A TURF analysis is built off a series of preference choices from survey data where 1=prefers and 0=does not prefer. How you classify the two is dependent on the particular problem at hand and the question that was asked of respondents.

Data setup for TURF analysis

To run a TURF analysis with this tool in Python, the data needs to be in a matrix like the one above — where each unique id is linked to a preference across all features (here, labeled reasons). The set up for this question allowed respondents to select multiple options they preferred. For this analysis, assume this preference is equal across all choices.

Part 2 | Sets in Python

A set is an un-ordered collection of unique values that allows for simple, speedy comparisons and joins. Using sets to create unique groupings of user ids that prefer each feature allows for lightning fast comparisons across large data sets (as opposed to using lists). Iterating over a pandas dataframe column wise makes this simple.

Now using this list of sets and the feature with the greatest individual reach I can identify the optimal ordering and reach of each successive feature. To do this I’ll compare the selected set with all other sets individually using the difference method. After calculating which other set had the largest difference in unique user ids, I can use the union method to join the sets and drop duplicates.

The length of this new set, divided by the number of respondents, gives the maximum unduplicated reach possible with a set of two features. Repeat this process through every feature and I’ll have the optimal ordering and the total unduplicated reach. The code that performs these operations is below.

Part 3 | Output and Notes

The output of the code above is two lists. The first is an ordered list of the features names and the second is the cumulative reach obtained with each additional feature. Using the two together I can build the plot below, my TURF output. I can see that my first feature reaches ~45% of the population and adding 4 more features (to reason4) has me reaching a little over three-quarters of the population.

TURF Output

Thanks for reading. Please check out my GitHub repository for this project if you’d like to perform your own simple TURF analyses.

--

--