Probability Distribution -Worked Examples

Introduction

bhargavi sikhakolli
5 min readMar 13, 2023

This article will have a working example to understand probability distribution on New York Airbnb dataset and S&P 500 stock data.

New York City Airbnb Open Data: This dataset includes information about Airbnb listings in New York City, such as location, price, number of reviews, and availability.

#importing required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm, lognorm
import seaborn as sns
from scipy.stats import kstest

# Load the dataset the dataset is uploaded in github
df = pd.read_csv("https://raw.githubusercontent.com/Venkata-Bhargavi/ML_Data_cleaning_and_feature_selection_2724793/main/AB_NYC_2019.csv")

# Select the "price" column
price = df["price"]

# Clean the data by removing missing values and outliers
price = price.dropna()
price = price[price.between(0, 500)]



# Plot a histogram of the prices
plt.hist(price, bins=50, density=True)
plt.xlabel("Price ($)")
plt.ylabel("Density")
plt.show()

# Using statistical methods to describe the distribution
mu = price.mean()
median = price.median()
sigma = price.std()

print("Mean: ${:.2f}".format(mu))
print("Median: ${:.2f}".format(median))
print("Standard deviation: ${:.2f}".format(sigma))

# Fit a normal distribution to the data
params = norm.fit(price)
print("Normal distribution parameters: loc={:.2f}, scale={:.2f}".format(params[0], params[1]))

# Use the fitted distribution to calculate probabilities and quantiles
x = np.linspace(price.min(), price.max(), 1000)
pdf = norm.pdf(x, *params)
cdf = norm.cdf(x, *params)

p_less_than_100 = norm.cdf(100, *params)
q_95_percentile = norm.ppf(0.95, *params)

print("P(price < $100) = {:.2f}".format(p_less_than_100))
print("95th percentile price = ${:.2f}".format(q_95_percentile))

STEPS INVOLVED:

  • Importing the dataset into a Pandas dataframe and select the “price” column.
  • Cleaning the data by removing any missing values and outliers
  • Plotting a histogram of the prices to visualize the distribution
  • Using statistical methods (such as mean, median, and standard deviation) to describe the distribution and determine if it follows a particular probability distribution (such as normal or lognormal distribution)
  • Fitting a probability distribution to the data using a suitable library (such as Scipy’s stats module).
  • Using the fitted distribution to calculate probabilities and quantiles for different price levels.

Insights

  • The output shows that the price variable follows a right-skewed distribution, which is evident from the long tail on the right-hand side of the histogram and the high value of the skewness coefficient. Median is at $100
  • The probability of price being <$100 uis 0.36(36%)
  • The Mean is approximately $132.

The graph also provides information about the central tendency and spread of the price variable. This indicates that the distribution of prices is positively skewed, as the mean is higher than the median.

This analysis helps in making data-driven decisions related to pricing strategies, marketing, and other aspects of the business.

Now lets analyze the “availability_365” column, which tells us the number of particular hotel is available in a year

availability = df['availability_365']

# Clean the data by removing any missing values
availability.dropna(inplace=True)

# Plot a KDE plot to visualize the distribution
sns.kdeplot(availability, shade=True)
plt.xlabel('Availability (in days)')
plt.ylabel('Density')
plt.title('KDE plot of availability')
plt.show()

From the KDE plot, we can observe that the distribution of the “availability_365” feature is heavily skewed towards the right, indicating that most of the listings are available for a large number of days throughout the year.

The peak of the distribution is around 0, indicating that a significant proportion of the listings are not available for booking for most of the year.

The long tail towards the right suggests that there are also a significant number of listings that are available for a very high number of days in a year.

This information can be useful for potential guests to make informed decisions about their booking, and for hosts to adjust their pricing and availability strategies based on the demand patterns.

Now lets experiment on “S&P 500 stock data”

Column details :

Date — in format: yy-mm-dd

Open — price of the stock at market open (this is NYSE data so all in USD)

High — Highest price reached in the day

Low Close — Lowest price reached in the day

Volume — Number of shares traded

Name — the stock’s ticker name

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy.stats as stats
from scipy.stats import norm

# Load S&P 500 stock data from CSV file
df = pd.read_csv('https://raw.githubusercontent.com/Venkata-Bhargavi/Datasets/main/A_data.csv')
# Clean the data by removing any missing values
df = df.dropna()

# Select the "Close" column
close_prices = df['close']

# Calculate the mean and standard deviation
mean = close_prices.mean()
std = close_prices.std()

# Calculate the PDF using a normal distribution with the same mean and standard deviation as the data
pdf = stats.norm.pdf(close_prices, mean, std)

# Calculate probability of closing below 50
prob_below_50 = stats.norm.cdf(50, loc=mu, scale=sigma)

# Plot a histogram of the data
plt.hist(close_prices, bins=30, density=True, alpha=0.6, label='Close Prices')

# Plot the PDF
plt.plot(close_prices, pdf, label='Normal PDF')

# Set the title and labels for the plot
plt.title('Distribution of S&P 500 Stock Close Prices')
plt.xlabel('Close Price ($)')
plt.ylabel('Density')
plt.legend()



# Show the plot
plt.show()



mu = np.mean(df['close'])
sigma = np.std(df['close'])

# Print mean and standard deviation
print("Mean:", mu)
print("Standard Deviation:", sigma)
print("Probability of closing below 50:", prob_below_50)

Steps:

  • Loading the S&P 500 stock data from the “A_data.csv” file
  • cleaning the data by removing any missing values
  • Selecting “Close” column and calculate the mean and standard deviation of the data.

Using these values to calculate the PDF using a normal distribution

  • Finally, plotting a histogram of the data and overlay the PDF on top of it.

The resulting graph shows that the distribution of S&P 500 stock close prices is approximately normal, with a mean of around 49 and a standard deviation of around 9.

It also shows that the probability of closing the stock below $50 is 0.53 (53%)

--

--