Chebyshev’s Inequality

Amrita Aash
Analytics Vidhya
Published in
4 min readOct 14, 2020

In the world of Data Science or Data Analytics the following type of questions are quite important and common:

  • What percentage of students in a given school have heights within a given interval? (say, [135cm,200cm]) or
  • What percentage of people have their salaries within a given range? or
  • What percentage of children have weights in a given interval? and so on…the list is endless.

Now, these questions become extremely easy to answer if we know that the random variable under consideration has Normal Distribution with a given mean and standard deviation. We can then easily apply the 68–95–99.7 rule to get the answers to the above questions.

The rule simply states that 68% of data points lie within 1 stnd-dev around the mean([mean-1*stnd-dev,mean+1*stnd-dev]), 95% of the data points lie within 2 stnd-dev around the mean([mean-2*stnd-dev,mean+2*stnd-dev]) and 99.7% of the data points lie within 2 stnd-dev around the mean([mean-3*stnd-dev,mean+3*stnd-dev]).

The problem arises when we don't know the underlying distribution of the random variable that we are dealing with and this is when Chebyshev’s Inequality comes to our rescue!

It requires two conditions to be met with:

  • the mean of the concerned random variable should be finite and
  • its standard deviation must be finite and non-zero.

Though Chebyshev’s Inequality does not give us the exact percentage of data lying with a particular range, but rather gives an approximation or a minimum value of the same.

“ In probability theory, Chebyshev’s inequality (also called the Bienaymé–Chebyshev inequality) guarantees that, for a wide class of probability distributions, no more than a certain fraction of values can be more than a certain distance from the mean.”- Wikipedia

Probabilistic statement

Let X (integrable) be a random variable with finite expected value μ and finite non-zero variance σ². Then for any real number k > 0,

Chebyshev’s inequality

Let’s dive into the formula a bit more!

For doing so we will consider an example-

Let X be a random variable representing the salaries of individuals in a country with a mean of $40,000 and a standard deviation of $10,000. And we have the following two questions to answer-

Q1) What percentage of people have their salaries within a range of [$20,000,$60,000]? or in other words What percentage of people have their salaries within 2 std-dev around the mean?

Q2) What percentage of people have their salaries within a range of [$10,000,$70,000]? or in other words What percentage of people have their salaries within 3 std-dev around the mean?

If X had been Gaussian Distribution then we could have given the answers to the above questions directly without any further calculations:

Answers: Q1)95% and Q2) 99.7%

But now let’s try to solve the above questions with the formula of Chebyshev’s inequality.

Looking at the formula makes it a bit difficult at first to find a solution or a meaning out of it. Let’s take one part at a time.

this part of the formula can also be written as:

X≥μ+kσ and X≤μ-kσ

Now we can read the whole formula as follows:

Pr(X≥μ+kσ and X≤μ-kσ)≤ 1/k²

the probability of finding a value which is greater than or equal to μ+kσ or less than or equal to μ-kσ is ≤ 1/k².

The above formula can be re-written as:

Pr(μ-kσ <X< μ+kσ)> 1- (1/k²)

which can be read as- the probability that X random variable lies between k std-dev away from mean is > 1- (1/k²).

Thus going back to the two questions, we can now answer them very easily by using the above formula-

Answers:

Q1) here k=2 as salaries were within 2 std-dev around the mean, so putting k=2 in the above formula we get 3/4 (75%). So we can write,

Atleast 75% of people’s salaries lie between $20,000 and $60,000.

Q2) here k=3 as salaries were within 3 std-dev around the mean, so putting k=3 in the above formula we get 8/9 (90% approx). So we can write,

Atleast 90% of people’s salaries lie between $10,000 and $70,000.

Conclusion

Thus Chebyshev’s inequality helps us immensely in answering important data analytics questions and gives us a minimum percentage of points that will lie within k standard deviation away from the mean.

--

--