Probability Distribution, Part-3
Topics Covered
- Bernoulli Distribution
- Beta Distribution
- T-distribution or Fat-tailed distribution
Bernoulli Distribution
The Bernoulli distribution is a discrete probability distribution that takes a binary outcome, typically 1 for success and 0 for failure. It can be used to model the probability of success or failure in a single experiment or trial.
Here is an example code to plot a Bernoulli distribution with a success probability of 0.6
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import bernoulli
# Define the success probability p
p = 0.6
# Create a Bernoulli distribution object
dist = bernoulli(p)
# Generate some random samples
samples = dist.rvs(size=1000)
# Calculate the probability mass function for all possible outcomes
x = np.arange(2)
pmf = dist.pmf(x)
# Plot the probability mass function
fig, ax = plt.subplots()
ax.stem(x, pmf, use_line_collection=True)
ax.set_xlabel('Outcome')
ax.set_ylabel('Probability')
ax.set_title('Bernoulli Distribution (p=0.6)')
plt.show()
Insights:
The Bernoulli distribution is a discrete probability distribution that takes only two possible outcomes (usually 1 for success and 0 for failure).
The success probability parameter p determines the shape of the distribution. In this example, p is set to 0.6, which means that the probability of success is higher than the probability of failure.
The probability mass function (PMF) of the Bernoulli distribution is a step function that takes the value p for the success outcome and 1-p for the failure outcome. In this example, the PMF shows that the probability of success is approximately 0.6 and the probability of failure is approximately 0.4.
The stem plot is a common way to visualize discrete probability distributions, as it shows the probability mass function for all possible outcomes.
Beta Distribution
The Beta distribution is a probability distribution that is commonly used to model probabilities or proportions of events that occur within a bounded interval, typically between 0 and 1.
It is often used when dealing with probabilities, rates, proportions, or percentages.
Some situations where the Beta distribution can be used include:
Modeling the probability of success or failure in a binary event, such as the probability of a customer buying a product.
Analyzing the proportion of a population that has a certain characteristic, such as the proportion of a population that is left-handed.
Estimating the parameters of other probability distributions, such as the Gamma distribution or the Dirichlet distribution.
In general, the Beta distribution is useful when dealing with probabilities or proportions that have a known upper and lower bound. It is also useful when there is prior knowledge or information about the distribution of the data being analyzed.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import beta
# Define parameters
a = 2
b = 5
# Generate random samples from beta distribution
samples = beta.rvs(a, b, size=10000)
# Plot histogram of samples
plt.hist(samples, bins=50, density=True, alpha=0.7)
# Plot PDF of beta distribution
x = np.linspace(0, 1, 100)
plt.plot(x, beta.pdf(x, a, b), 'r-', lw=5, alpha=0.6)
# Add titles and labels
plt.title('Beta Distribution')
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()
The beta distribution is a continuous probability distribution with a range of [0, 1]. It is commonly used in Bayesian analysis to model probabilities. The distribution has two shape parameters, a and b, which control the shape of the distribution. In the code example above, we set a=2 and b=5 to generate the beta distribution. The resulting histogram and PDF plot show the shape of the distribution, with a peak at around 0.4 and a long tail towards 1.
T-distribution or Fat-tailed distribution
Tailed distributions are used when the data is not normally distributed and the tails of the distribution are longer than the tails of the normal distribution. In this case, the distribution is said to have heavier tails. The t-distribution is a type of tailed distribution that is used when the sample size is small and the population standard deviation is unknown.
The t-distribution is similar to the normal distribution but has heavier tails, which allows for a larger range of values to be considered. It is commonly used in hypothesis testing, where the sample size is small, and the population standard deviation is unknown. The t-distribution allows for more variability in the data, which can lead to more accurate conclusions.
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import t
# Define degrees of freedom
df = 3
# Generate random samples from t-distribution
samples = t.rvs(df, size=10000)
# Plot histogram of samples
plt.hist(samples, bins=50, density=True, alpha=0.7)
# Plot PDF of t-distribution
x = np.linspace(-5, 5, 100)
plt.plot(x, t.pdf(x, df), 'r-', lw=5, alpha=0.6)
# Add titles and labels
plt.title('T-Distribution')
plt.xlabel('Value')
plt.ylabel('Density')
plt.show()
The t-distribution is a continuous probability distribution with a range of (-∞, ∞). It is similar to the normal distribution but has heavier tails, making it a “fat-tailed” distribution. The t-distribution has a parameter called degrees of freedom (df), which controls the shape of the distribution. In the code example above, we set df=3 to generate the t-distribution. The resulting histogram and PDF plot show the shape of the distribution, with heavier tails than a normal distribution.
Experimenting on the Hotel dataset by checking the distribution of “Price” Column and the probability of price being less than $100 and also analysing “availability 365” column.
#importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, lognorm
import seaborn as sns
from scipy.stats import kstest
# Load the dataset the dataset is uploaded in github
df = pd.read_csv("https://raw.githubusercontent.com/Venkata-Bhargavi/ML_Data_cleaning_and_feature_selection_2724793/main/AB_NYC_2019.csv")
# Select the "price" column
price = df["price"]
# Clean the data by removing missing values and outliers
price = price.dropna()
price = price[price.between(0, 500)]
# Plot a histogram of the prices
plt.hist(price, bins=50, density=True)
plt.xlabel("Price ($)")
plt.ylabel("Density")
plt.show()
# Using statistical methods to describe the distribution
mu = price.mean()
median = price.median()
sigma = price.std()
print("Mean: ${:.2f}".format(mu))
print("Median: ${:.2f}".format(median))
print("Standard deviation: ${:.2f}".format(sigma))
# Fit a normal distribution to the data
params = norm.fit(price)
print("Normal distribution parameters: loc={:.2f}, scale={:.2f}".format(params[0], params[1]))
# Use the fitted distribution to calculate probabilities and quantiles
x = np.linspace(price.min(), price.max(), 1000)
pdf = norm.pdf(x, *params)
cdf = norm.cdf(x, *params)
p_less_than_100 = norm.cdf(100, *params)
q_95_percentile = norm.ppf(0.95, *params)
print("P(price < $100) = {:.2f}".format(p_less_than_100))
print("95th percentile price = ${:.2f}".format(q_95_percentile))
STEPS INVOLVED:
- Import the dataset into a Pandas data frame and select the “price” column.
- Cleaning the data by removing any missing values and outliers
- Plotting a histogram of the prices to visualize the distribution
- Using statistical methods (such as mean, median, and standard deviation) to describe the distribution and determine if it follows a particular probability distribution (such as normal or lognormal distribution)
- Fitting a probability distribution to the data using a suitable library (such as Scipy’s stats module).
- Using the fitted distribution to calculate probabilities and quantiles for different price levels.
Insights
- The output shows that the price variable follows a right-skewed distribution, which is evident from the long tail on the right-hand side of the histogram and the high value of the skewness coefficient. The median is at $100
- The probability of price being <$100 uis 0.36(36%)
- The Mean is approximately $132.
The graph also provides information about the central tendency and spread of the price variable. This indicates that the distribution of prices is positively skewed, as the mean is higher than the median.
This analysis helps in making data-driven decisions related to pricing strategies, marketing, and other aspects of the business.
Now let's analyse the “availability_365” column, which tells us the number of particular hotels available in a year
availability = df['availability_365']
# Clean the data by removing any missing values
availability.dropna(inplace=True)
# Plot a KDE plot to visualize the distribution
sns.kdeplot(availability, shade=True)
plt.xlabel('Availability (in days)')
plt.ylabel('Density')
plt.title('KDE plot of availability')
plt.show()
- From the KDE plot, we can observe that the distribution of the “availability_365” feature is heavily skewed towards the right, indicating that most of the listings are available for a large number of days throughout the year.
- The peak of the distribution is around 0, indicating that a significant proportion of the listings are not available for booking for most of the year.
- The long tail towards the right suggests that there are also a significant number of listings that are available for a very high number of days in a year.
This information can be useful for potential guests to make informed decisions about their booking, and for hosts to adjust their pricing and availability strategies based on the demand patterns.
Conclusion
In conclusion, probability distribution is a fundamental concept in statistics and data analysis. It helps us understand the likelihood of various outcomes and events, and enables us to make predictions and informed decisions based on data. There are various types of probability distributions, each with its own characteristics and use cases.
Some commonly used probability distributions include normal distribution, Poisson distribution, exponential distribution, uniform distribution, and Bernoulli distribution. These distributions can be used to model various types of data and phenomena, such as stock prices, customer arrivals, and product defects.
By analyzing the probability distribution of a dataset, we can gain insights into the central tendencies, variability, and skewness of the data. We can also use probability distribution to calculate probabilities, quantiles, and confidence intervals for different scenarios.
Overall, understanding and using probability distribution is essential for any data analyst, researcher, or decision-maker who wants to make sense of data and make informed decisions based on it.
References:
1. https://www.kaggle.com/code/hamelg/python-for-data-22-probability-distributions
2. https://www.kaggle.com/datasets/dgomonov/new-york-city-airbnb-open-data
Refer this article for Probability Distribution Worked Examples : https://medium.com/@bhargavi.sikhakolli31/probability-distribution-worked-examples-ac81465c7210
License
All code in this article is available as open source through the MIT license.
All text and images are free to use under the Creative Commons Attribution 3.0 license. https://creativecommons.org/licenses/by/3.0/us/
These licenses let people distribute, remix, tweak, and build upon the work, even commercially, as long as they give credit for the original creation.
Copyright 2023 AI Skunks https://github.com/aiskunks
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.