Segmenting the academic audience

Moving beyond the .edu approximation

B Nelson
Data Science at Microsoft
9 min readJan 19, 2021

--

The academic market — with over one billion students and educators globally — carries with it a special purchasing status. For example, educators can tap into great deals at craft outlets and bookstores. Flashing a student ID gets discounts at movie theatres, museums, and sporting events. Students who need cloud resources can access the Azure for Students offer, as well as GitHub’s Student Developer Pack — both of which offer a range of free developer tools. For students with credit cards, another way to access discount cloud services is through the Azure Free Trial, which includes $200 in credit and additional free tier services for 12 months. Each of these offers seeks to appeal to the academic market with low prices that do not sacrifice utility or relevance.

For customers opted into communications, additional academic support includes notifications of student related programs, offers, and events. But in the case of Azure Free Trial, there are multiple audiences using the offer — and therein lies the challenge. In order to support the Free Trial audience appropriately we need to segment it — distinguishing students and educators from professional developers, small businesses, researchers, and other professionals. In this article I highlight how applied data science techniques that I used to segment the academic audience can help you do the same while simultaneously preserving customer privacy.

The .edu approximation

One of the simplest ways to segment the academic audience is through an academic email domain. That is, if the domain ends in .edu, the associated owner is very likely a member of the academic audience, because .edu is a reserved top-level domain (TLD). This method of segmentation can be described as “the .edu approximation.” It’s an approximation because the .edu domain also includes customers who are not necessarily students or educators, such as administrators. It’s also not sufficient, by itself, to identify all relevant students and educators.

While it’s only approximate, I use this method all the time — and for good reason. It’s intuitive and easy for stakeholders to understand. It’s also simple to implement. And it’s privacy friendly, because it requires only the TLD of an email address instead of the full account name.

The problem with the .edu approximation

Despite the usefulness of the .edu approximation, the method is not perfect. First, customers in the academic audience segment do not always have — or use — academic email addresses. This means we inevitably miss some customers with this approach. Second, because the .edu domain is limited to U.S. post-secondary institutions, this approach is geographically biased.

While we can’t control the email addresses that customers use to register for our services, we can improve our understanding of academic domains. I describe below an internal “hackathon” project that I participated in used domain extensions, natural language processing, and frequentist inference to create a scalable, privacy-friendly way to segment customers by domain extensions.

Getting started

I based my analysis on a Genome-wide Association (GWA) study — an observational method in bioinformatics that looks at frequency statistics among populations to see whether genetic variants are associated with specific physical traits. My project was similar in that I wanted to leverage two populations of customers to identify domain patterns associated with a specific audience. More specifically, I was looking for functional equivalents of the .edu domain that are used in other countries.

Because our goal was to identify audience-specific domain patterns in other geographies, we started with a labeled set of domains within a specific geography. Using such data allowed us to test the hypothesis that some domain patterns are audience specific.

Inputs and hypotheses

The inputs of this project were email domains from populations A and B. Population A included domains from customers in a general audience experience. Conversely, population B included domains from customers enrolled in an academic experience. In terms of a GWA study, the general experience audience was the “control” group, and the academic population was the “case” group.

With these case-control populations, we can hypothesize that our case population has a higher frequency of audience-specific domains compared to the control population. We can test this hypothesis by calculating the odds ratio of domain patterns appearing in these populations and using a chi-square test to quantify significance. In carrying out this procedure, we expect to be left with a set of domain patterns that are most and least academic.

Working with domain extensions

Domains are hierarchical. They decrease in importance from right to left, meaning that the right-most letter combination is the most important, hence the term “top-level domain.”

The domains .com and .edu are both examples of top-level domains. Globally, second-level domains (the letter combinations immediately prior to TLDs) are often considered the “brand” of a domain. For example, the “Microsoft” in microsoft.com or the “MIT” in mit.edu. The pattern represented by this model in its simplest form is brand.TLD.

The model becomes more complex with the introduction of country codes. Did you know that each country has its own two-letter top-level domain in the form of a country code? (Even the United States has one, though not often seen, which is .us.) It’s called a country-code top-level domain, or ccTLD. For example, Amazon.co.uk is the local Amazon equivalent for the United Kingdom. In this example, the top-level domain is .uk and the second-level domain is .co. Given this information, the relevant pattern expands to include brand.2LD.ccTLD.

This model is further complicated by the variety of domains that can exist within the academic space. For example, in higher education, some universities parse their domains by role or research area, resulting in these patterns:

alumni.brand.TLD

student.program.brand.TLD

In primary and secondary schools, domains can include permutations of school name, school district, state, county, country, and so on, with these patterns:

brand.district.ccTLD

brand.county.k12.state.ccTLD

The complexity of the known domain patterns, combined with our desire to include them in our analysis, suggested the need for Natural Language Processing (NLP) to be used to process the data.

Natural Language Processing: Words, stop words, and n-grams

The fundamentals of NLP involve breaking down bodies of text into words and n-grams, where words are unique sequences of characters and n-grams are unique sequences of n words. For this project, I transformed domains into word sequences by using the dot character as the delimiter between domain levels, in a process known as tokenization. Then, analogous to the use of stop-words in NLP — words filtered out of natural language data — I treated the first position in every sequence as a “stop position” by removing it from the data set.

As mentioned earlier, domains are hierarchical, and the left-most position in the sequence is the least important part of the domain. Removing this position from our analysis reduces its scope and complexity, while also removing many of the identifiable brands (such as “outlook” and “hotmail”) that would otherwise appear as noise.

I then used the remaining sequence of words in each domain to identify unigrams (individual words), bigrams (sequences of two adjacent words), and trigrams (adjacent triplets of words) that appear in a population of domains.

Domains are transformed using Natural Language Processing techniques to generate n-grams, or contiguous sequences of n words, from an original text.

After processing, the final data set contained a finite number of unigrams, bigrams, and trigrams — collectively, the n-grams — appearing with different frequencies in the starting populations of domains. The final step involved using frequentist inference to find the n-grams most strongly associated with the academic audience.

Frequentist statistics

Knowing the frequencies of n-grams in each population allows us to construct a contingency table for each domain pattern. By using the following table and employing a chi-square test, we can test the null hypothesis that the frequencies of a given n-gram in populations A and B are equivalent.

Sample contingency table for a single n-gram.

For each n-gram, we do a single chi-square test, which yields a single p-value ranging from 0 to 1. The most extreme p-values are very close to zero, and the n-grams associated with them are the ones we’re interested in identifying so we can use them to segment our audience.

Plotting and significance thresholds

Visualizing tiny numbers is not easy. One way to visualize extreme p-values is with a Manhattan Plot, because the log transform of the p-value on the y-axis causes the smallest values to stand out in a big way.

The figures below show the Manhattan plots for the unigram and bigram scores of domains for a sample of customers in India. I’ve annotated the six most extreme unigrams and the three most extreme bigrams. Together, the plots tell a story about the academic domain patterns in our data, as they become readily identifiable by visually rising above the main sequence of the plot.

Color is used to indicate whether a domain is positively or negatively associated with our case population. This information does not come from the p-value, but the relative n-gram frequencies in the contingency table. N-grams more associated with the case population are blue, whereas n-grams more associated with the control population are gray.

Manhattan plot for domain patterns in India annotating six unigrams that can be used to segment the academic audience in India.

We start by focusing attention on the unigrams. We recognize .edu, .com, and .org as common top-level domains. We confirm that .in is India’s ccTLD. In our research we learn that Manipal is a city in India — but also that there is an academic institution of the same name. It’s not immediately clear what the .ac unigram means, but nevertheless we can look at the bigram scores to complete the unigram narrative.

Manhattan plot for domain patterns in India reveal two bigrams associated with the academic audience, and a separate bigram (.co.in) is particularly unaffiliated with the academic audience.

In the bigram scores, we see .in acting as a ccTLD in all three cases. Furthermore, while the .co.in bigram mirrors the earlier .co.uk example, we see that .ac and .edu appear to be second-level domains associated with India’s academic institutions. Indeed, we can confirm that these are reserved academic domains in India by trying to register for them and observing the domain restrictions that result.

Determining an exact threshold of significance with this method is context specific. With exploratory analysis a threshold may not be needed. When this is not the case, testing every n-gram in the dataset necessitates that a multiple testing correction be applied. The Bonferroni correction is one approach, but it’s a very conservative method that may not be suitable for all use cases. Ultimately, the threshold of significance for this method depends on the scenario and intended use of the data.

Lessons learned

The methods used in this analysis resulted in an improvement on the simple .edu approximation. Today the expanded .edu approximation we use, based on the approach I’ve outlined, includes patterns from more than 60 countries, resulting in a more geographically representative approximation and a 100 percent increase in our ability to segment the academic audience successfully. (Along the way, we also identified countries without academic domains, indicating the need for a separate model to further reduce our geographic bias.) Since our original analysis in 2019, we’ve also observed the continued growth of Wikipedia entries for academic second-level domains, suggesting that the desire to understand academic domains is growing (see the entry for .ac and the one for .edu).

In its final form, our analysis involved applying frequentist analysis to geographically rich n-grams derived from email domains, which resulted in an easy-to-understand, easy-to-implement, and privacy-friendly way to segment the academic audience.

Conclusion

In this article I’ve shared how I leveraged domain patterns, natural language processing, and frequentist statistics to help segment a customer population in a scalable, privacy-compliant manner. In this case, we now have better means to connect students and educators with Azure offerings to help meet their specific needs and budgets. While the focus here is on the academic audience, the method can be applied to any population of interest.

Brittany Nelson is on LinkedIn.

--

--

B Nelson
Data Science at Microsoft

Data Scientist at Microsoft. Scientist, Software Engineer, and SDET. Formerly at SMART Technologies and Wellspring Worldwide.