Unlocking Insights: Mastering Descriptive Statistics with the Ames Housing Dataset

Hasan Khan
7 min readJul 22, 2024

--

You begin your data science journey with the Ames Housing dataset and descriptive statistics. This dataset is rich and allows descriptive statistics to summarize data in a meaningful way. This first step in analysis provides a clear overview of the main features of a dataset. Descriptive statistics are important because they simplify complex information, help explore data, make comparisons, and tell data-driven stories.

As you dive into the Ames properties dataset, you’ll see how powerful descriptive statistics can be, turning large amounts of data into useful summaries. You’ll learn about important metrics and their meanings, such as how the average being higher than the median shows skewness. (Download data from here)

Let’s get started.

Overview

This post is divided into three parts; they are:

  • Fundamentals of Descriptive Statistics
  • Data Dive with the Ames Dataset
  • Visual Narratives

Fundamentals of Descriptive Statistics

This post will show you how to make use of descriptive statistics to make sense of data. Let’s have a refresher on what statistics can help describing data.

Central Tendency: The Heart of the Data

Central tendency captures the dataset’s core or typical value. The most common measures include:

  • Mean (average): The sum of all values divided by the number of values.
  • Median: The middle value when the data is ordered.
  • Mode: The value(s) that appear most frequently.

Dispersion: The Spread and Variability

Dispersion uncovers the spread and variability within the dataset. Key measures comprise:

  • Range: Difference between the maximum and minimum values.
  • Variance: Average of the squared differences from the mean.
  • Standard Deviation: Square root of the variance.
  • Interquartile Range (IQR): Range between the 25th and 75th percentiles.

Shape and Position: The Contour and Landmarks of Data

Shape and Position reveal the dataset’s distributional form and critical markers, characterized by the following measures:

  • Skewness: Asymmetry of the distribution. If the median is greater than the mean, we say the data is left-skewed (large values are more common). Conversely, it is right-skewed.
  • Kurtosis: “Tailedness” of the distribution. In other words, how often you can see outliers. If you can see extremely large or extremely small values more often than normal distribution, you say the data is leptokurtic.
  • Percentiles: Values below which a percentage of observations fall. The 25th, 50th, and 75th percentiles are also called the quartiles.

Descriptive Statistics gives voice to data, allowing it to tell its story succinctly and understandably.

Data Dive with the Ames Dataset

To delve into the Ames dataset, our spotlight is on the “SalePrice” attribute.

import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns

Ames = pd.read_csv('AmesHousing.csv')
sales_price_description = Ames['SalePrice'].describe()
print(sales_price_description)
Output

This summarizes “SalePrice,” showing the count, mean, standard deviation, and percentiles.

median_saleprice = Ames['SalePrice'].median()
print("Median Sale Price:", median_saleprice)

mode_saleprice = Ames['SalePrice'].mode().values[0]
print("Mode Sale Price:", mode_saleprice)
Output

The average “SalePrice” (or mean) of homes in Ames is about $178,053.44. The median price is $159,900, meaning half the homes are sold below this value. The difference between these numbers suggests that high-value homes are pushing the average up. The mode gives us a look at the most common sale prices.

range_saleprice = Ames['SalePrice'].max() - Ames['SalePrice'].min()
print("Range of Sale Price:", range_saleprice)

variance_saleprice = Ames['SalePrice'].var()
print("Variance of Sale Price:", variance_saleprice)

std_dev_saleprice = Ames['SalePrice'].std()
print("Standard Deviation of Sale Price:", std_dev_saleprice)

iqr_saleprice = Ames['SalePrice'].quantile(0.75) - Ames['SalePrice'].quantile(0.25)
print("IQR of Sale Price:", iqr_saleprice)
Output

The range of “SalePrice,” from $12,789 to $755,000, shows the wide variety in Ames’ property values. The variance of about $5.63 billion highlights the large differences in prices, further emphasized by a standard deviation of around $75,044.98. The Interquartile Range (IQR), which represents the middle 50% of the data, is $79,800, showing the spread of the central bulk of housing prices.

skewness_saleprice = Ames['SalePrice'].skew()
print("Skewness of Sale Price:", skewness_saleprice)

kurtosis_saleprice = Ames['SalePrice'].kurt()
print("Kurtosis of Sale Price:", kurtosis_saleprice)

tenth_percentile = Ames['SalePrice'].quantile(0.10)
ninetieth_percentile = Ames['SalePrice'].quantile(0.90)
print("10th Percentile:", tenth_percentile)
print("90th Percentile:", ninetieth_percentile)

q1_saleprice = Ames['SalePrice'].quantile(0.25)
q2_saleprice = Ames['SalePrice'].quantile(0.50)
q3_saleprice = Ames['SalePrice'].quantile(0.75)
print("Q1 (25th Percentile):", q1_saleprice)
print("Q2 (Median/50th Percentile):", q2_saleprice)
print("Q3 (75th Percentile):", q3_saleprice)
Output

The “SalePrice” in Ames shows a positive skewness of about 1.76, which means the distribution has a longer or fatter tail on the right. This indicates that high-priced properties are affecting the average sale price, while most homes are sold at prices below this average. The positive skewness highlights that the distribution is not symmetrical, with higher-priced homes pushing the average up. When the average (mean) sale price is higher than the median, it suggests the presence of high-priced properties, leading to a right-skewed distribution. The kurtosis value of around 5.43 further emphasizes this, indicating there may be extreme values or outliers that make the distribution’s tails heavier.

Looking closer, the quartile values provide insight into the data’s central tendencies. With Q1 at $129,950 and Q3 at $209,750, these quartiles cover the middle 50% of the data, showing the central range of prices. The 10th percentile is $107,500, and the 90th percentile is $272,100, marking the boundaries where 80% of home prices fall. This shows a wide range in property values and highlights the diversity of the Ames housing market.

Graphical Narratives

Visualizations breathe life into data, narrating its story. Let’s dive into the visual narrative of the “SalePrice” feature from the Ames dataset.

import matplotlib.pyplot as plt
import seaborn as sns

# Set style and calculate statistics
sns.set_style("whitegrid")
mean = Ames['SalePrice'].mean()
median = Ames['SalePrice'].median()
mode = Ames['SalePrice'].mode().values[0]
skewness = Ames['SalePrice'].skew()
kurtosis = Ames['SalePrice'].kurt()

# Plot histogram with KDE and reference lines
plt.figure(figsize=(14, 7))
sns.histplot(Ames['SalePrice'], bins=30, kde=True, color="skyblue")
for stat, color, linestyle, label in [(mean, 'r', '--', f"Mean: ${mean:.2f}"),
(median, 'g', '-', f"Median: ${median:.2f}"),
(mode, 'b', '-.', f"Mode: ${mode:.2f}")]:
plt.axvline(stat, color=color, linestyle=linestyle, label=label)

plt.annotate(f'Skewness: {skewness:.2f}\nKurtosis: {kurtosis:.2f}', xy=(500000, 100),
fontsize=14, bbox=dict(boxstyle="round,pad=0.3", edgecolor="black", facecolor="aliceblue"))

plt.title('Histogram of Ames\' Housing Prices with KDE and Reference Lines')
plt.xlabel('Housing Prices')
plt.ylabel('Frequency')
plt.legend()
plt.show()
Output

The histogram above provides a clear view of Ames’ housing prices. The noticeable peak around $150,000 highlights a large number of homes in this price range. The Kernel Density Estimation (KDE) curve adds a smoothed version of the data distribution, offering a more continuous view compared to the histogram’s discrete bins. This KDE curve refines the histogram’s data representation, capturing details that might be lost with binning.

The rightward tail of the KDE curve reflects the positive skewness we calculated earlier, showing that most homes are priced below the average. The colored lines — red for mean, green for median, and blue for mode — help quickly compare and understand the distribution’s central tendencies. Together, these visualizations provide a thorough look at the distribution and characteristics of housing prices in Ames.

from matplotlib.lines import Line2D
import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 8))

# Box plot
sns.boxplot(x=Ames['SalePrice'], color='skyblue', showmeans=True, meanprops={"marker": "D", "markerfacecolor": "red",
"markeredgecolor": "red", "markersize": 10})

# Annotations
annotations = [('Q1', q1_saleprice, -70000), ('Q3', q3_saleprice, 20000), ('Median', q2_saleprice, -90000)]
for label, x, x_offset in annotations:
plt.annotate(label, xy=(x, 0.30 if label != 'Median' else 0.20), xytext=(x + x_offset, 0.45 if label != 'Median' else 0.05),
arrowprops=dict(edgecolor='black', arrowstyle='->'), fontsize=14)

# Titles, labels, and legends
plt.title('Box Plot Ames\' Housing Prices', fontsize=16)
plt.xlabel('Housing Prices', fontsize=14)
plt.yticks([])

plt.legend(handles=[Line2D([0], [0], marker='D', color='w', markerfacecolor='red', markersize=10, label='Mean')],
loc='upper left', fontsize=14)

plt.tight_layout()
plt.show()
Output

The box plot offers a clear view of central tendencies, ranges, and outliers, providing insights that aren’t as evident in the KDE curve or histogram. The Interquartile Range (IQR), which stretches from Q1 to Q3, shows the middle 50% of prices, highlighting the central range. The red diamond, representing the mean, being to the right of the median, indicates that high-value properties are pulling the average up.

The “whiskers” of the box plot are crucial for understanding the data spread. The left whisker extends from the box’s left edge to the smallest data point within 1.5 times the IQR below Q1, while the right whisker extends from the box’s right edge to the largest data point within 1.5 times the IQR above Q3. These whiskers mark the boundaries beyond the central 50%, with points outside them often considered potential outliers.

Outliers, shown as individual points, highlight exceptionally priced homes, possibly luxury properties, or those with unique features. In the plot, there are no outliers on the lower end but many on the higher end. Identifying these outliers is important as they can reveal unique market trends or anomalies in the Ames housing market.

Visualizations like this transform raw data into engaging stories, uncovering insights that numbers alone might not reveal. As we continue, it’s important to appreciate how visualization enriches data analysis, offering a deeper understanding of complex data.

--

--

Hasan Khan

In the sea of numbers, every data point has a story to tell 📈