Infinite Segmentation: Scalable Mutual Information Ranking on Real World Graphs

Ankit Srivastava, Joshua Allen, Ken Johnston, Microsoft

Ankit Srivastava

Published in

Microsoft Azure

16 min readMar 28, 2019

Abstract

These days it’s not about normal growth, it’s about finding and driving hockey-stick levels of growth. Sales and marketing organizations are looking to AI to help growth hack their way to new markets and new segments. We have used Mutual Information (MI) Ranking for many years to help filter out noise and find the critical insights to new cohorts of users, devices, businesses, and networks and now we can do it at nearly infinite granularity across massive data sources and dimensions. In this paper, we will explain the problem space of fine grained segmentation and introduce an implementation of Mutual Information Ranking that works not just on binary partitions but can scale to real world graphs with billions of entities and millions of partitions leveraging Hadoop / Spark cluster computing frameworks. We shall showcase cross industry use cases of Mutual Information ranking in scenarios ranging from segmentation, feature engineering, newsfeed ranking, cohort validation, query expansion and link prediction.

Introduction

Random variables share information with each other. The amount of information shared between them is symmetric and can be computed via Mutual Information [1]. If you know the temperature is 90 degrees in a city like Seattle, you have information about what month of the year it might be. If the month is December, there is information about what temperature to expect in the city. We employed MI ranking in the past to distinguish devices used by teacher vs student, developer vs non-developer, and IT Professional vs non-IT professionals. This helps us make Windows better for the applications these key segments care about. The alternative is to develop labelled data for a custom ML model for each new segment which is expensive, time consuming, and many times not viable.

1.1. Needs for segmentation

Given a population of users for any product / service, segmentation is a fundamental need to learn about sub-populations of consumers and identify distinguishing patterns in which they interact with a product or service. These distinguishing patterns can then subsequently be fed into recommender systems, supervised learning classification, and regression models to assist personalization and deliver improved customer satisfaction. They can be fed into an organizations’ BI reports as inferred dimensions to help generate insights on growing and improving the services and overall market share.

1.2. Identifying Blue Lakes

There is an important marketing concept of identifying blue oceans [2] i.e. an entire segment where there is minimal competition and maximum opportunity to capture the market. Organizations are always on the lookout for such blue oceans. With pervasive nature of services and organizations in virtually all domains, it is hard to find such large blue oceans, however, it is possible to find small blue lakes i.e. sub-populations where focus of competitors is minimal and the ability for a company to deliver unique and differentiated value remains. Often these opportunities require rapid and timely identification and can allow for significant gains in market share. Highly scaled Mutual Information Ranking helps identify these blue lakes in a swift, highly repeatable, and privacy compliant way.

1.3. Look-a-like modelling vs Seed Expansion

Look-a-like modelling [3] is a popular advertising technique productized in many platforms like Facebook [4], Google Display Network [5] and Yandex [6]. Marketing teams use attributes available in the advertising platform to create cohorts that are like their current customers. using attributes available about the current customers. This technique is top down rules based where certain set of rules criterion helps filter down to the audience worth advertising to. Mutual Information (MI) ranking is a statistical approach which works bottom up by expanding on a set of seed entities like website hosted by the organization or set of search queries searched for on the web about an organization’s services. MI helps identify other websites / search queries that are distinguishing of the visitors of your organization’s websites / search queries. For example, visitors of microsoft.com are likely to visit azure.com, bing.com, msn.com and so on.

Mutual Information

Mutual Information of two discrete random variables X and Y can be defined as [1]:

where p(x, y) is the joint probability function of X and Y, and p(x) and p(y) are the marginal probability distribution of X and Y respectively. Mutual Information is a common measure of association used in machine learning applications, with one desirable property being that it is scale invariant, reducing the need for feature normalization.

2.1. Data Format

We can apply the above equation with input datasets stored in a cluster computing framework like Hadoop / Spark. The inputs are structured as two row sets — features and partitions. These are then passed through the Mutual Information Ranking implementation in Hive / PySpark to rank features by MI gain per partition.

Input Data Format required for Binary and N-ary MI Ranking

Features common in B2B can be any information carrying user intent like web page visit, search query, application used, network visited. Consumer features might be segmented using products purchased, music listened to, or locations visited. Partitions can be any categorical variable associated to the visitors like country they belong to, industry they work in, role they have in their organization. In the naïve binary scenario, partitions are binary as in Table IIA like visitor belongs to country X vs rest of population, works in specific industry Y vs rest of the industries, has a certain role in the organization Z or not.

2.2. Algorithm

The binary case is just a special case of N-ary MI with N partitions. Table III lists the steps to implement it in a scalable map-reduce environment given the row sets available to us in Table IIB.

Table III: N-ary Mutual Information Ranking Psuedocode

To calculate information gain, ‘CalcBinaryMI’ makes 4 calls to ‘CalcSingleMI’ and sums the results –

Equation II: Mutual Information Gain Per Feature And Partition Label

Equation III: Direction of Association of Feature ‘x’ towards Partition Label ‘y’ denoted by a Boolean value ‘tip’

2.3. Results

The MI implementation in PySpark returns a result row set that contains all the intermediate probabilities and final information gain and tip columns as per Table III. The way we interpret the results is that the features who are ranked with high information gain are the ones whose presence or absence is very distinguishing of that label. Many times, we are interested only in positive association. For example, when we ran MI ranking on a sample of 5K StackOverflow posts subset to 3 arbitrarily chosen tags — “ml”, “iot” and “neo4j”, we found the most distinguishing unigrams for the 3 labels are ones whose presence is indicative that the post is related to that tag.

Table IV: Distinguishing Unigrams per 3 StackOverflow Tags

Table IV shows that the top unigram is the same as the tag name itself like in the case of “iot” and “neo4j”. But “ml” being an abbreviation of “machine learning” has “learning” show up just below “training”. If we build a multi-class classifier to predict these classes using unigrams as predictors, unigrams in Table IV shall rank highest in feature importance. So, we look at Mutual Information ranking as a scalable way to compute feature importance per label without need for building the entire classifier. It can easily scale to terabyte and petabyte scale graphs within minutes. In later sections, we shall demonstrate use of MI ranking on unsupervised, semi-supervised and positive-unlabelled (PU) learning problems.

Mutual Information Ranking Implementation in PySpark

Below is our implementation of Scalable Mutual Information Ranking in PySpark run on Azure DataBricks environment —

3.1. User Defined Functions

import math
def CalcSingleMI(pr_x, pr_y, pr_xy):
  if(pr_xy / (pr_x * pr_y) > 0):
    return pr_xy * math.log(pr_xy / (pr_x * pr_y))
  else:
    return 0
spark.udf.register("CalcSingleMI", CalcSingleMI, DoubleType())def CalcBinaryMI(pr_x, pr_y, pr_xy):
  pr_notx = 1 - pr_x
  pr_noty = 1 - pr_y
  pr_xnoty = pr_x - pr_xy
  pr_notxy = pr_y - pr_xy
  pr_notxnoty = 1 - pr_xy - pr_xnoty - pr_notxy
  return CalcSingleMI(pr_x, pr_y, pr_xy) +\
                    CalcSingleMI(pr_x, pr_noty, pr_xnoty) +\
                    CalcSingleMI(pr_notx, pr_y, pr_notxy) +\
                    CalcSingleMI(pr_notx, pr_noty, pr_notxnoty)
spark.udf.register("CalcBinaryMI", CalcBinaryMI, DoubleType())

3.2. MI Implementation

from pyspark.sql.types import DoubleType
import pyspark.sql.functions as func# Create a view or table
partitionlabels = partitionlabels.selectExpr("Id", "LabelId", "LabelName")
featureData = featureusage.selectExpr("Id", "Feature", "Observation")
featureData = featureData.withColumn("Observation", featureData["Observation"].cast("double"))
featureData = featureData.filter(featureData.Observation > 0.5)
featureData.createOrReplaceTempView( "FeatureData")
partitionlabels.createOrReplaceTempView("PartitionLabels")all = sqlContext.sql("SELECT FeatureData.Id,\
               FeatureData.Feature,\
               FeatureData.Observation,\
               PartitionLabels.LabelId,\
               PartitionLabels.LabelName\
        FROM FeatureData\
    INNER JOIN PartitionLabels\
    ON FeatureData.Id == PartitionLabels.Id\
    WHERE Observation > 0")per_user = spark.sql("SELECT Id,\
               SUM(Observation) AS UserSum\
        FROM all\
        GROUP BY Id")
per_user.createOrReplaceTempView("per_user")#  Normalize it
all = spark.sql("SELECT all.Id,\
               all.Feature,\
               Observation /UserSum * 100.0 AS Observation,\
               LabelId,\
               LabelName\
        FROM all\
        INNER JOIN per_user\
        ON all.Id == per_user.Id")
all.createOrReplaceTempView("all")n = spark.sql("select 1 AS key,\
               sum(Observation) AS n\
        from all")
n.createOrReplaceTempView("n")# P(X)
p_x = spark.sql("select Feature,\
               SUM(Observation) AS Observation\
        from all\
        group by Feature")
p_x.createOrReplaceTempView("p_x")p_x = spark.sql("select 1 AS key,\
                       * from p_x")
p_x.createOrReplaceTempView("p_x")p_x = spark.sql("select p_x.*,\
               n.n\
        from p_x\
        inner join n\
        on p_x.key == n.key")
p_x.createOrReplaceTempView("p_x")p_x = spark.sql("select key,\
               Feature,\
               Observation/n AS p_x\
        from p_x")
p_x.createOrReplaceTempView("p_x")# P(X,Y)
p_xy = spark.sql("SELECT LabelId,\
               LabelName,\
               Feature,\
               Observation\
        FROM all")
p_xy.createOrReplaceTempView("p_xy")p_xy = spark.sql("SELECT LabelId,\
               LabelName,\
               Feature,\
               SUM(Observation) AS Observation\
        FROM p_xy\
        GROUP BY LabelId, LabelName, Feature")
p_xy.createOrReplaceTempView("p_xy")p_xy = spark.sql("select 1 AS key,\
                       *\
        from p_xy")
p_xy.createOrReplaceTempView("p_xy")p_xy = spark.sql("SELECT p_xy.*,\
               n.n\
        FROM p_xy\
        INNER JOIN n\
        ON p_xy.key == n.key")
p_xy.createOrReplaceTempView("p_xy")p_xy = spark.sql("SELECT key,\
               LabelId,\
               LabelName,\
               Feature,\
               Observation/n AS p_xy\
        FROM p_xy")
p_xy.createOrReplaceTempView("p_xy")# P(Y)
p_y = spark.sql("SELECT 1 AS key,\
               LabelId,\
               LabelName,\
               SUM(Observation) AS Observation\
        FROM all\
        GROUP BY LabelId, LabelName")
p_y.createOrReplaceTempView("p_y")p_y = spark.sql("SELECT p_y.key,\
               LabelId,\
               LabelName,\
               p_y.Observation /n.n AS p_y\
        FROM p_y\
        INNER JOIN n\
        ON p_y.key == n.key")
p_y.createOrReplaceTempView("p_y")# Combined
mi = spark.sql("SELECT p_xy.LabelId,\
               p_xy.LabelName,\
               p_xy.Feature,\
               p_x,\
               p_y,\
               p_xy\
        FROM p_xy\
        INNER JOIN p_x\
        ON p_xy.Feature == p_x.Feature\
        INNER JOIN p_y\
        ON p_xy.LabelId == p_y.LabelId")
mi.createOrReplaceTempView("mi")mi = spark.sql("SELECT LabelId,\
               LabelName,\
               Feature,\
               p_x,\
               p_y,\
               p_xy,\
               CalcBinaryMI(p_x, p_y, p_xy) AS mi\
               FROM mi")
mi.createOrReplaceTempView("mi")ordered_mi = spark.sql("SELECT LabelId,\
               LabelName,\
               Feature,\
               p_x,\
               p_y,\
               p_xy,\
               mi\
        FROM mi\
        ORDER BY mi DESC")
ordered_mi.createOrReplaceTempView("ordered_mi")# Direction
ordered_mi = spark.sql("SELECT *,\
               (p_x - p_xy) AS pxminusy,\
               (p_xy/p_y) AS x\
        FROM ordered_mi")
ordered_mi.createOrReplaceTempView("ordered_mi")ordered_mi = spark.sql("SELECT *,\
               y>x AS tip\
        FROM ordered_mi")
ordered_mi.createOrReplaceTempView("ordered_mi")# Positive Association to label
%sql
select LabelId, LabelName, Feature, mi, tip from ordered_mi WHERE tip == false

Applications of Mutual Information Ranking

4.1. Segmentation without labels

One of early applications we applied MI ranking was to an ask where we were about to release Windows 10S. Educators wanted a locked down experience for the devices owned by students with good security and manageability of applications that could be installed on the devices. Therefore, Win10S only supported apps in the Microsoft Store. The ask was to find which Win32 applications are used predominantly by teachers and students but not by rest of population. We wanted to make sure such Win32 applications are available in the Microsoft Store. However, there was no easy way to get labels for teacher and student devices in Windows 10 K-12 US install base.

From domain knowledge, we knew ‘Smart Ink’ was a popular Win32 application used for digitizing the markerboard heavily used by teacher devices but not on student devices. Using that as a seed application, we identified a cohort of US K-12 devices that had opted in for Full diagnostic data and Tailored experiences. We then categorized devices using ‘Smart Ink’ in one category and the rest of K-12 US devices population in second category. MI ranking helped us identify other such applications that were distinguishing of teacher devices vs student / admin devices.

Example of Binary MI: Distinguishing Teacher vs Student Apps

This helped expand the cohort from just initial seed app we began with to then identify hundreds of applications that have high information gain in distinguishing teachers’ and students’ devices and thus let us building a segment of teacher and student devices. We also gained insight about student devices’ browser apps being locked down by applications like “lockdownbrowser.exe” for purposes of student testing. Instead of MI ranking, if we had just gone with top apps used, the browser apps would come up on top. But they do not have high information gain between teacher and student/admin devices.

4.2. Multi-Class Classification at scale

For Edge browser, website compatibility is a big focus area. We used sampled anonymized webpage visit logs per country from devices opted into collecting full diagnostic data. There are a lot of top websites that get visited by users across countries. To understand country specific compatibility needs, we applied MI ranking to understand websites predominantly visited by users of a specific country but not by others. MI ranking in a single run was able to identify top distinguishing websites for each of the 200+ countries. We pick 3 arbitrary countries from the per country results and showcase top distinguishing websites for them in Table VI.

Example of scalable MI: Distinguishing websites per Country

We found that telecom / ISP websites, national banks, government websites, regional sports, entertainment, news, search engines with geolocation prefixes/suffixes are distinguishing of the visitor’s geography.

4.3. Cohort Validation

A similar application of MI ranking was to validate a cohort of IT professionals and non-IT professionals who provided us their roles by filling our surveys or enrolling in our newsletters and opted into collecting full diagnostic data. Survey data suffers from self-enrolment biases. MI ranking helped identify if there were distinguishing patterns of application usage between the two cohorts of users on Microsoft Apps. We can clearly see distinguishing patterns of use and the labels being of sufficient quality to use for building a classifier in Table VII.

Example of Cohort Validation: IT Pro vs non-IT Pro Survey Validation using Microsoft Apps

4.4. Query Expansion

One of the key functions of marketing is to buy keywords on search engines. Often this is done with some basic analysis of query volumes and click through rates on competitors ads. This creates a bidding war over a small set of fairly generalized keywords. MI allows us to identify even more distinguishing keywords and often those will have a lower bid threshold CPC (Cost per Click). We had a similar need for understanding IoT enthusiasts who query on Bing search engine. Azure offers a suite of IoT services. So, we started with a seed query of “azure iot” and applied MI ranking to find queries distinguishing of visitors who search for “azure iot” in a time span of 28 days.

Example of Query Expansion: Finding Azure IoT enthusiast queries

From just a seed query with thousands of visitors, in just hours of processing and a few days of analysis we were able to derive a breadth of related search queries that are distinguishing of IoT enthusiasts and from there build a target audience that is suitable for ads around our service offering.

4.5. Link Prediction

In the StackOverflow posts example in Section 2.3, we showed that we could link unigrams that are distinguishing of certain tags. In a graph with a set of known linkages, information gain can help with weighted link prediction and establish graph triangles. Graph triangles help in ego-net splitting and overlapping community detection algorithms [8].

Example of link prediction using MI Ranking

4.6. Newsfeed Ranking and Anomaly Detection

For business reporting, we track hundreds of metrics across our products and services across thousands of dimensions. Metrics change month over month and current dashboards hardcode the trends around some key metrics. But those are not necessarily the best ones with high information gain to look at. We developed a new ranked list of metric changes per dimension vector (per segment, country) sorted by information gain. If someone wanted to filter down to a specific set of dimensions, for example focus on highly informative metric changes in commercial segment in US, they could do so. Such a ranked list of metric changes also helps identify anomalous changes that might be caused by errors in data transformation.

Symmetric Mutual Information

Mutual Information gain between two random variables is symmetric i.e.

where X and Y are two random variables.

So, we can reverse MI results from distinguishing feature per partition to distinguishing partitions per feature. This is of interest where we want to find specific segments where a given feature, for example a specific application is used as a distinguishing application.

Example of reversing MI results to find distinguishing partitions per feature

We can see in Table IX that many entities are only distinguishing of a single segment (built for that segment) like Minecraft for Education is used as a distinguishing app in K-12 segment. Some entities are distinguishing of more than 1 segment. The feature or application might be used across all segments but some leverage it significantly more than others. For example, given the collaborative nature of industries like Partner Professional Services, Media, Telecommunications and Technology, Microsoft Teams is distinguishing app within those segments.

MI ranking works best when at least one of the sides of comparison is not too high-dimensional. For example, it’s fine to have billions of web sites compared with thousands of segments but comparing billions of entities on one side with billions of partitions on the other side will not perform well.

N-Fold Mutual Information Ranking

In certain applications, looking at one entity type is not enough. For example, to build a propensity model on which organizations are likely to purchase a service, there’s a need to tap into various customer touch points like Bing Search results, Microsoft marketing documentation visits, and Microsoft service subscriptions. So, we built a methodology to apply 2-fold MI.

Step 1: Start with seed entities on each entity type (doc visits, search queries, services) and expand the list of distinguishing entities.

Step 2: Cluster the distinguishing entities across entity types as a binary segment (positive or negative association to the offering for which we build the propensity model). We can link the entity to segments to now find distinguishing segments that have high positive association to entities that are in turn distinguishing of seed entities we pass in.

This process is illustrated in Fig 2. and the steps can be alternated repeatedly to expand the number of associated entities discovered.

Mutual Information Cut-Off

In the example mentioned in Section 4.2, we plot the MI gain from top distinguishing websites per country. We find that MI gain is disproportionate across partitions. For example, MSN with US suffix is more predictive that the visitor is from US (as the browser has already resolved the page to return using the visitor’s geolocation) compared to a travel website like IRCTC is for visitors from India. Visitors could be booking travel from another country for their connections in India.

MI gain ranked results follow a long tail distribution. There’s a need for MI cut-off after which the gain is not high enough for it to be considered a distinguishing feature. We can do label shuffling [9] and then set a false discovery rate (for example 0.01%) to get the row number after which MI gain is insignificant. Monte Carlo [10] is another common way to set false discovery rate. It’s sometimes more convenient to do label shuffling in our context because we can just re-run the analysis on the cluster.

We recommend a sample size of at least 1,000 Ids per partition to keep a low false discovery rate.

Conclusion

We demonstrated a scalable implementation of Mutual Information Ranking that can be ported in any cluster computing framework. We then applied the implementation in a diverse range of problems and demonstrated sample results. We extended Mutual Information ranking to more than binary partitions and beyond single step. As future work, this can be leveraged as an operationalized library for use in commercial applications and on large-scale graphs.

Leveraging such a ranking algorithm at scale has potential for abuse. We have implemented data protection guidance and policies in place to help prevent that in compliance with GDPR [11] and other regulations. These include limited retention and expungement, delete processors, aggregation to prevent re-identification. Each use case scenario where MI ranking is leverage is reviewed by privacy team to ensure use is within policy. Such data protection measures and policies are imperative to prevent abuse and ensure compliance.

Acknowledgment

We would like to thank Core Data Engineering team in Azure COSINE who helped us setup the Azure Databricks environment. We work with a lot of partner teams within Microsoft and their collaboration have helped us immensely in developing and applying this methodology on a wide range of business problems.

References

[1] Cover, T.M.; Thomas, J.A. (1991). Elements of Information Theory (Wiley ed.). ISBN 978–0–471–24195–9.

[2] Kim, W. C., & Mauborgne, R. (2005). Blue ocean strategy: how to create uncontested market space and make the competition irrelevant. Boston, Mass, Harvard Business School Press.

[3] Pai D., Sharang A., Yadagiri M.M., Agrawal S. (2014) Modelling Visit Similarity Using Click-Stream Data: A Supervised Approach. In: Benatallah B., Bestavros A., Manolopoulos Y., Vakali A., Zhang Y. (eds) Web Information Systems Engineering — WISE 2014. WISE 2014. Lecture Notes in Computer Science, vol 8786. Springer, Cham

[4] Yu D., Houg A. (2014) Facebook Analytics, Advertising, and Marketing. In: Facebook Nation. Springer, New York, NY

[5] Ruohonen, Anna-Aleksandra. (2018) “Planning and running an internet marketing campaign: case SockAdore. com.”

[6] Baker, S. and Baker, K., (1995). Desktop Direct Marketing: How to Use Up-to-the-minute Technologies to Find and Reach New Customers. New York, NY: McGraw-Hill.

[7] Arnab Sinha, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. 2015. An Overview of Microsoft Academic Service (MA) and Applications. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15 Companion). ACM, New York, NY, USA, 243–246. DOI=http://dx.doi.org/10.1145/2740908.2742839

[8] Epasto, Alessandro, Silvio Lattanzi, and Renato Paes Leme. “Ego-Splitting Framework: from Non-Overlapping to Overlapping Clusters.” Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2017.

[9] Kim, Seoung Bum, et al. “Controlling the false discovery rate for feature selection in high‐resolution NMR Spectra.” Statistical Analysis and Data Mining: The ASA Data Science Journal 1.2 (2008): 57–66.

[10] Lin, D. Y. “An efficient Monte Carlo approach to assessing statistical significance in genomic studies.” Bioinformatics 21.6 (2004): 781–787.

[11] Voigt, Paul, and Axel Von dem Bussche. “The EU General Data Protection Regulation (GDPR).” A Practical Guide, 1st Ed., Cham: Springer International Publishing (2017).