Extending the mode into the K coverage

Uri Itai
8 min readJun 26, 2023

--

Data scientists follow the fundamental rule of constructing metrics while extracting valuable insights from data. The choice of metrics largely depends on the nature of the data being analyzed. Numeric data offers a variety of metrics to consider, such as mean, median, and standard deviation. Conversely, categorical data generally has fewer metrics at hand, with the mode being the most commonly used. Nevertheless, categorical features contain additional information beyond the mode.

Categorical values

Categorical values hold immense significance within the realm of data science as they represent distinct categories or labels. Categorical data can be classified into two types: nominal and ordinal. Nominal data refers to categories without any inherent order or ranking, while ordinal data involves categories with a specific order or ranking.

To illustrate this concept, let’s consider a dataset that contains customer information for an e-commerce site. Within the categorical variables, we can identify:

  1. Product category: This variable demonstrates nominal categorical data as the categories lack any inherent order or ranking.
  2. Customer satisfaction level: This variable serves as an example of ordinal categorical data since the categories exhibit a specific order or ranking (e.g., “Very satisfied,” “Somewhat satisfied,” “Neutral,” “Somewhat dissatisfied,” “Very dissatisfied”).

When working on data science projects, it is crucial to appropriately handle categorical data, especially when constructing models or analyzing the data. Due to their distinct characteristics and properties, different methods may be required to handle nominal versus ordinal data.

The mode is the most commonly used metric for analyzing categorical values. It represents the category that occurs most frequently within a dataset, offering insights into the predominant category or label.

By comprehending the nature of categorical data and employing suitable techniques, data scientists can effectively harness this valuable information to gain deeper insights and make informed decisions.

The mode

The mode, as a statistical measure, represents the value or values that occur most frequently in a dataset. It identifies the most common value(s) within a given set of data.

To illustrate this concept, let’s consider a couple of examples.

In the set S = {dog, dog, cat, dog, horse, dog}, the mode is “dog” since it appears most frequently. In this case, the mode provides a reasonable description of the data. However, in the set L = {dog, mice, cat, dog, horse, dog, cat, man, monkey, man, elephant, donkey, seal}, the mode is still “dog.” While it accurately represents the most common value, it fails to provide a comprehensive description of set L.

To further emphasize this point, let’s consider a real-life scenario. Imagine examining the feature of “language spoken” in different countries. When considering the Netherlands, the mode would be “Dutch.” However, relying solely on the mode to understand the language distribution in the neighboring country, Belgium would lead to a poor understanding. While Dutch (Flemish) is indeed the mode, it only accounts for a little less than 60% of the population. Approximately 40% of the population speaks French. Therefore, in this case, the mode fails to provide an accurate description of the language distribution.

Similar challenges can arise when analyzing data for countries like the USA and Canada. Such issues can be particularly problematic when dealing with “long tail” data, where there is a significant presence of rare or less frequent categories. Relying solely on the mode for analysis in these cases may lead to misunderstandings or oversimplifications.

It is important to be aware of these limitations and consider a more comprehensive approach when analyzing data, especially in situations where the mode may not provide an accurate representation of the overall distribution.

Long tail

The long-tail distribution phenomenon in categorical data refers to a pattern where a small set of categories appears frequently, while a larger number of categories occur infrequently. This distribution is characterized by a high occurrence of a few categories (the “head”) and a low occurrence of numerous other categories (the “tail”).

This pattern commonly emerges in datasets with a large number of possible categories, such as user ratings, product reviews, or website traffic data. In such datasets, a handful of popular categories tend to dominate the majority of occurrences, while the remaining categories exhibit only a few instances.

To illustrate, consider a dataset of product ratings on an e-commerce platform. The most popular products might accumulate hundreds or even thousands of ratings, whereas less popular products might only receive a few ratings. Consequently, a long-tail distribution arises, where a small number of highly popular products account for the majority of ratings, while numerous less popular products have only a handful of ratings.

Recognizing and comprehending the long-tail distribution in categorical data can offer valuable insights for businesses and researchers alike. It aids in identifying the most popular categories or products, as well as highlighting less popular categories or products that may require additional attention or promotional efforts. Furthermore, this understanding informs decision-making processes related to resource allocation, marketing strategies, and product development, allowing for more effective and targeted actions.

Long tail examples

We can see long-tail distribution in the following:

  • YouTuber — There is a small number of content providers that get most of the views.
  • Football fans by clubs — The top ten clubs attract most of the fans worldwide.
  • Academic papers reference — a small number of papers gets most of the citations. While most get very few.
  • Movies — few movies get most of the audience.
  • Keywords — most of the searches use a limited number of keywords.

The lost tail can be regarded with power law distribution. This would be regarded in a future post.

Therefore, it is very important to learn if the data has a long tail.

Coverage

Before delving into the concept of coverage, let’s revisit the mode. As we know, the mode represents the most frequently occurring value in a feature. It is natural to question the percentage of data that corresponds to the mode value in that feature. We observed in the case of language in Belgium and the Netherlands that this provides us with confidence in utilizing the mode. Building upon this idea, we can further explore what happens if we consider more than one value. For instance, what if we consider the two most common values or the top n most common values? To formalize this, we need to introduce the concept of coverage.

Definition: k Coverage: When we consider the k most common values, the percentage of data that corresponds to these values is referred to as k coverage.

p Coverage: p coverage is defined as the minimum number of unique values needed to be selected to achieve a k coverage that exceeds a given threshold, p.

This concept is reminiscent of the decision involved in choosing the number of components in principal component analysis (PCA), where the goal is to determine the minimum number of components necessary to capture a desired level of variance.

By introducing the concepts of k coverage and p coverage, we can explore the extent to which the most common values capture the data and the minimum number of values required to achieve a certain level of coverage. This approach provides us with a framework similar to that of PCA, enabling us to make informed choices and gain insights from categorical data.

For the bank marketing data set, we choose the feature job

The code

import pandas as pd

import numpy as np

import matplotlib.pyplot as plt

path_data = r”somepathfile.csv”

data = pd.read_csv(path_data)

job_df = pd.DataFrame(data.job.value_counts(normalize = True))

job_df[‘cover_value’] = job_df[‘job’].cumsum()

job_df[‘place’] = list(range(1,len(job_df)+1))

job_df[‘cover_value’].plot()

plt.xlabel(‘nubmer of values’)

plt.ylabel(‘coverage percentage’)

plt.title(‘values vs coverage’)

plt.show()

The plot of the k convergence is:

While the p coverage is

coverage_ind = [job_df[job_df.cover_value > j].place.min() for j in np.linspace(0,1,11) ]

plt.plot(np.linspace(0,1,11),coverage_ind)

plt.ylabel(‘nubmer of values’)

plt.xlabel(‘coverage percentage’)

plt.title(‘ coverage vs values ‘)

plt.show()

number of values vs coverage

One can plot these graphs. However, we need to find a method for the choice of k and p.

It is crucial to observe that this graph exhibits a strict monotonic increase while displaying negative curvature. This implies that selecting additional values yields diminishing returns. Like other comparable instances, our focus lies in determining the minimum number of values that offer a meaningful representation. The elbow method accomplishes this objective.

Elbow method

Similar to selecting the number of clusters in the K-means algorithm or the number of components in Principal Component Analysis (PCA), determining the appropriate number of unique values for analysis depends on the specific application. However, there is a technique called the elbow method, also known as the “knee of the curve,” which can be helpful in the initial analysis.

The elbow method involves identifying the optimal point on a cumulative curve where the transition shifts from convex to concave. Alternatively, one can consider the minimal value of the third derivative or the second derivative of the value count. These techniques help in determining a suitable number of unique values for analysis by examining patterns and changes in the data. It is analogous to determining the optimal number of clusters in K-means clustering.

It’s important to note that the specific approach chosen may vary depending on the dataset and analysis goals. However, these methods provide a starting point for initial exploration and offer a quantitative basis for decision-making in categorical data analysis.

In Python, you can apply the elbow method by testing the second derivative using the following code: np.diff(np.diff(job_df[‘job’]))

This value provides the reflection point of the data. Further details about this method can be explored in a future blog or discussion.

Summing it up

This blog post delves into an expanded concept of the mode, with the aim of enhancing our understanding of categorical data. By presenting a practical example along with accompanying code, we demonstrate a method that surpasses a simple diversity index. This method empowers us to extract interesting values from categorical features and gain valuable insights.

However, our exploration doesn’t conclude here. In an upcoming blog post, we will delve into a comparison between the coverage method and other approaches such as entropy and Gini. Through examining these different methods, our objective is to provide a comprehensive analysis of their effectiveness in handling categorical data and extracting meaningful information.

We invite you to stay tuned for our forthcoming post, where we will delve deeper into these methodologies and evaluate their capabilities, expanding our understanding of categorical data analysis.

--

--

Uri Itai

Mathematician in exile, researching algorithms and machine learning, applying data science, and expanding my ideas.