Sitemap

An Introduction to Non-IID Data and the Challenges of Data Heterogeneity

5 min readNov 5, 2024

--

Quick Recap :

In the last article, we discussed the issue of data silos and concluded that migrating all data generated from different sources to a central server for machine learning training is often impractical or restricted due to privacy concerns or bandwidth constraints.

Note: This article is the second installment in the Unlocking Data for AI series, following Data Silos: The Hidden Barriers to AI Innovation.

In this article, we will start with an imaginary scenario: what if we could train models directly on these sources instead of collecting all the data on a central server? Then, somehow, we could combine these models to create a global model that incorporates the intelligence of all models trained on different sources. Sounds interesting?

Caution: If this rings a bell and makes you think of distributed data parallelization techniques for deep learning model training [1] , hold that thought. Keep following the series — we’re onto something different ;)

The Challenge of Data Heterogeneity

To build such an on-device model training, or what I should rather call on-source-of-data model training, we face a significant challenge: the difference in the statistical distribution of data generated from different sources, even when the purpose of data collection is the same.

For example, in the case of traffic sensors collecting data to predict traffic flow and congestion. the data from sensors placed in urban versus suburban areas will have different traffic patterns due to population density and commuting behaviour resulting in different data distribution resulting in heterogeneity.

Heterogeneity in training data can lead to several issues like model convergence challenges, sampling bias, adaptability issue and robustness.

This article dives deeper into the various types of data heterogeneity that arise when data is generated from different sources or, in some cases, even from the same source.

What is IID?

I want to introduce you to a term called “IID”. IID stands for Independent Identically Distributed (IID). Serving a definition hot from wikipedia [2] below:

- Identically distributed means that there are no overall trends — the distribution does not fluctuate and all items in the sample are taken from the same probability distribution.
-
Independent means that the sample items are all independent events. In other words, they are not connected to each other in any way; knowledge of the value of one variable gives no information about the value of the other and vice versa.

Understand it this way, if you’re rolling a fair six-sided die multiple times, the results are IID because each roll is independent (the outcome of one roll doesn’t affect another) and they all come from the same distribution (the probability of rolling each number is the same).

Understanding Non-IID Data

So now, as we are clear on what IID is, it should be pretty obvious that a distribution not following these properties would be considered Non-IID. Non-IID will be the center of the following discussion, and you can consider it a synonym for heterogeneous data for the rest of the article.

Non-IID Data Generated from the Same Source

Let’s begin with a slightly unintuitive idea: it is somewhat intuitive that data generated from different sources can be Non-IID, but how can data generated from the same source be Non-IID? A good example is a recommendation system. Our preferences for particular brands, types of items, ranges of spending, etc., change over time, and changing user preferences over time can render the data Non-IID.

This is also called temporal correlation, where the data points are time-dependent, meaning they change over time. This is more common than you might think. In cyber-physical systems, it’s not just the external environment that changes; sometimes, the sensors themselves or major software updates also lead to changes in the distribution of data collected by a source over time.

Another type of correlation is spatial correlation, where people closer in geographical region might have similar preferences. This also introduces dependency, and the distribution no longer remains IID.

In general, Non-IID data often arises due to either temporal dependencies or spatial correlations between data points.

Non-IID Data from Different Sources

When data is generated from different sources, we can broadly categorize the reasons for Non-IID data into the following three categories. I really liked the examples by M. F. Criado et al. [4] for describing these categories, and I am going to use the same.

1. Different Input Spaces

This could be the case, for example, with participants collecting data for training an autonomous car. Some users may drive on the left, while others drive on the right, and they will encounter different circumstances. This results in a skewed input space for the different participants. However, they gather data with one common objective and are expected to act similarly.

2. Different Outputs for Analogous Inputs

This scenario occurs when the input spaces perceived by the clients are analogous, but their outputs are not. A real-world example related to training an autonomous car is the case of a yellow traffic light. When encountering a yellow traffic light, the correct action for some participants might be to stop the car, while for others it might be to continue driving without any change. This causes incompatibilities among the clients.

3. Combination of Different Input Spaces and Different Outputs for Analogous Inputs

This is a combination of the two previous situations: participants want to learn a common task, such as driving, but their input spaces are significantly unequal, and their reactions to some of the inputs differ as well.

This is how data heterogeneity, or Non-IID data, originates.

I began this article by presenting an imaginary scenario

“what if we could train models directly on these sources instead of collecting all the data on a central server? Then, somehow, we could combine these models to create a global model that incorporates the intelligence of all models trained on different sources.”

Guess what, its not imaginary! The Google keyboard you use, employs similar methods for predicting the next word you might type. However, in these techniques, data heterogeneity remains a significant challenge. Interestingly, some people have found ways to leverage this heterogeneity to provide personalized experiences for users. So yeah, there are a lot of such interesting things.

This series is all about discussing such techniques. So stay tuned if you also want to unlock data for your AI use cases!

References:
[1] https://www.run.ai/guides/gpu-deep-learning/distributed-training
[2]https://en.wikipedia.org/wiki/Independent_and_identically_distributed_random_variables
[3]https://arxiv.org/html/2401.00809v1#:~:text=Non%2DIID%20data%20implies%20that,same%20underlying%20distribution%20breaks%20down
[4]https://arxiv.org/abs/2111.13394

--

--

No responses yet