Karthik
Variablz Academy
Published in
5 min readOct 24, 2023

--

In the world of data analysis, the need to efficiently process and analyze vast datasets is a common challenge. You might find yourself faced with slow processing speeds, limited capabilities, or incompatible libraries when working with data in Python.

This is where NumPy, the Numerical Python library, steps in to offer a compelling solution.

You may ask (if you are a beginner) what is NumPy and why we are using it,

NumPy (Numerical Python) is a fundamental and widely used library in Python for data analysis. NumPy will make it easy to process vast amounts of data in a matrix format.

Many consider NumPy to be the most powerful package in Python.

And to answer why NumPy — Processing speed of NumPy.

It is always a good idea to move the data processing into NumPy rather than using multiple programming conditional and looping statements. They tend to slow down the processing speed significantly.

  1. Multidimensional Arrays: Provides ‘ndarray’ data structure, which allows to create and manipulate arrays with any number of dimensions.
  2. Mathematical operations: Includes a wide range of mathematical functions, this consists of element-wise operations, linear algebra, statistical functions, and more.
  3. Random number generation: Provides a tool for generating random numbers and arrays.
  4. Broadcasting: NumPy allows you to perform operations on arrays of different shapes.
  5. Integration with Other Libraries: NumPy integrates well with other scientific and data analysis libraries like SciPy (for scientific computing), Matplotlib (for data visualization), and Pandas (for data manipulation).

Now let’s see about Descriptive statistics,

It is used to summarize and describe the main features of a dataset.

The most important part of the data field is Analysing it, which can be done using both descriptive and inferential statistics. Both are used to analyze the data and draw conclusions from it.

Data is everything in statistics. Calculating the range, median, and mode of the data set is all a part of descriptive statistics.

Say, for instance, “Is it going to rain today? Should I bring my umbrella to work or not? We will pull out our phones and check the weather forecast to find the answers to these queries. How is this accomplished? There are computer programs that use statistics to compare previous and present weather conditions to forecast the weather in the future.

Descriptive Statistics in NumPy can be differentiated into two sectors, they are Measures of Central Tendency and Measures of Variability (or Dispersion)

We will see one by one:

Measures of Central Tendency:

Section 1: Mean

As the image depicts, the Mean is to find the average value of a date set. To compute the mean, sum all the values and divide the sum by the number of values.

import numpy as np 
temperatures = np.array([72, 75, 68, 70, 74, 73, 71])
mean_temperature = np.mean(temperatures)
print(f”The mean temperature is {mean_temperature} degrees.”)

The mean temperature is 71.85714285714286 degrees.

Section 2: Median

As the picture implies, the Median is the middle value of a data set. It is obtained by ordering all data points and picking out the one in the middle.

import numpy as np 
prices = np.array([250000, 275000, 310000, 425000, 725000, 290000])
median_price = np.median(prices)
print(f”The median house price is ${median_price}.”)

The median house price is $300000.0.

Section 3: Mode

Mode is nothing but a value occurring frequently in a data set.

import numpy as np
from scipy import stats # Import the stats module for mode calculation

# Sample dataset
data = np.array([2, 3, 4, 5, 3, 4, 6, 5, 5, 2, 7, 7, 2])
# Calculate the mode using the stats module from SciPy
mode_result = stats.mode(data)

print(f”The mode of the dataset is: {mode_result.mode[0]}”)
print(f”It occurs {mode_result.count[0]} times.”)

import numpy as np
returns = np.array([0.02, 0.03, -0.01, -0.02, 0.01])
variance = np.var(returns)
std_deviation = np.std(returns)
print(f”Variance: {variance}, Standard Deviation: {std_deviation}”)

The mode of the dataset is: 2
It occurs 3 times.

Now comes the Measures of Variability / Dispersion

Section 1: Minimum and Maximum — Identifying Extremes

The minimum and maximum values in a dataset provide insights into the data’s range. They are crucial when dealing with constraints or identifying outliers.

There are other functions to calculate minimum and maximum such as numpy.amin() and numpy.amax(), numpy.ptp() function to calculate range in Python.

import numpy as np 
scores = np.array([85, 92, 77, 98, 63, 100])
min_score = np.min(scores)
max_score = np.max(scores)
print(f”Minimum score: {min_score}, Maximum score: {max_score}”)

Minimum score: 63, Maximum score: 100

Section 2: Variance and Standard Deviation — Measuring Data Spread

Variance and standard deviation quantify how spread out the data is. Variance measures the average of the squared differences from the mean, while standard deviation is the square root of the variance.

import numpy as np 
returns = np.array([0.02, 0.03, -0.01, -0.02, 0.01])
variance = np.var(returns)
std_deviation = np.std(returns)
print(f”Variance: {variance}, Standard Deviation: {std_deviation}”)

Variance: 0.000344, Standard Deviation: 0.01854723699099141

Section 3: Percentiles and Interquartile Range — Dividing Data into Quarters

Percentiles divide data into hundredths, while quartiles (Q1 and Q3) divide data into four quarters. The interquartile range (IQR) represents the range between Q1 and Q3.

import numpy as np 
income = np.array([40000, 55000, 60000, 75000, 85000, 95000, 120000])
q1 = np.percentile(income, 25)
q3 = np.percentile(income, 75)
iqr = q3 — q1
print(f”Q1: {q1}, Q3: {q3}, IQR: {iqr}”)

Q1: 57500.0, Q3: 90000.0, IQR: 32500.0

Conclusion:

In essence, descriptive statistics, combined with the capabilities of NumPy, equips us with the tools to dig into datasets, understand their characteristics, and draw meaningful conclusions.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

# Sample dataset
data = np.array([2, 3, 4, 5, 3, 4, 6, 5, 5, 2, 7, 7, 2])

# Calculate mean, median, and mode
mean_value = np.mean(data)
median_value = np.median(data)
mode_result = stats.mode(data)
mode_value = mode_result.mode[0]

# Create a list of labels
labels = [‘Mean’, ‘Median’, ‘Mode’]

# Create a list of values
values = [mean_value, median_value, mode_value]

# Create a line plot
plt.plot(labels, values, marker=’o’, linestyle=’-’, color=’b’)

# Add labels and a title
plt.xlabel(‘Statistics’)
plt.ylabel(‘Values’)
plt.title(‘Mean, Median, and Mode Line Plot’)

# Display the plot
plt.grid(True)
plt.show()

In conclusion, you’ve not only explored the fundamental principles of descriptive statistics but also harnessed the remarkable capabilities of NumPy, equipping yourself with the tools to uncover valuable insights and make informed decisions. Whether you’re a novice data enthusiast or an experienced statistician, you’ve learned to find the mean, median, mode, and measures of variability, paving the way for data analysis excellence.

With curiosity as your guide and perseverance as your companion, you’re well-prepared to embrace a data-driven future, where your newfound skills will undoubtedly lead to positive and impactful outcomes.

I will come up with another article until then adios.

Karthik Saravanan

https://www.linkedin.com/in/karthik-sa/

--

--