The Ultimate Guide to Machine Learning: Statistics and Statistical Modelling— Part -3

20 min readMar 3, 2023

Introduction

In this third installment of the ultimate machine learning guide, we will look at the fundamentals of statistical modeling and how to implement them in Python, a powerful programming language widely used in data analysis and scientific computing. We’ll go over fundamental concepts like probability distributions, hypothesis testing, regression analysis, and classification, as well as hands-on techniques like data preparation, model selection, and evaluation.

👉 Before Starting the Blog, Please Subscribe to my YouTube Channel and Follow Me on Instagram 👇
📷 YouTube — https://bit.ly/38gLfTo
📃 Instagram — https://bit.ly/3VbKHWh

👉 Do Donate 💰 or Give me Tip 💵 If you really like my blogs, Because I am from India and not able to get into Medium Partner Program. Click Here to Donate or Tip 💰 — https://bit.ly/3oTHiz3

Fig.1 — Statistics and Statistical Modelling

In many fields, including data science, machine learning, and finance, statistical modeling is a fundamental tool for understanding and analyzing data. We can make predictions, test hypotheses, and gain insights into complex phenomena by developing mathematical models that capture the relationships between variables.

This guide will prepare you with the knowledge and tools you need to tackle real-world data problems and make data-driven decisions. So, let’s get started with Python and explore the exciting world of statistical modeling.

Probability Theory
Descriptive Statistics
Inferential Statistics
Generalized Linear Models
Bayesian Statistics and Inference
Markov Chain Monte Carlo (MCMC)
Conclusion

Probability Theory

Probability theory is a branch of mathematics that studies random events. It provides a framework for understanding how likely events occur and how we can make predictions based on those probabilities.

We begin with a sample space in probability theory, which is the set of all possible outcomes of an experiment. After that, we define events as subsets of the sample space and assign probabilities to them. An event’s probability is a number between 0 and 1, with 0 indicating that it is impossible and 1 indicating that it is certain. Consider the roll of a fair six-sided die. In this case, the random variable is the number obtained from the dice roll. This random variable’s probability distribution is given by a uniform distribution, in which each value of the random variable has an equal chance of occurring.

1. Expected value and variance: The concepts of expected value and variance are essential in probability theory. A random variable’s expected value is the average value it would take if the experiment were repeated many times. The variance is the amount by which the random variable deviates from its expected value.

Consider calculating the expected value and variance of a random variable with a normal distribution and a mean of 0 and a standard deviation of 1. To generate random numbers from a normal distribution, we can use the NumPy library:

import numpy as np

# Generate 10000 random numbers from a normal distribution
rand_nums = np.random.normal(loc=0, scale=1, size=10000)

# Calculate the expected value and variance
expected_value = np.mean(rand_nums)
variance = np.var(rand_nums)

print(f"Expected value: {expected_value:.2f}")
print(f"Variance: {variance:.2f}")

In probability theory, conditional probability and Bayes’ theorem are also important concepts. The probability of an event given that another event has occurred is known as conditional probability. The Bayes’ theorem is a formula that describes how to update an event’s probability based on new information.

For example, consider a medical test that is 95% accurate in detecting a disease when it is present, but also has a 5% false positive rate. If the prevalence of the disease in the population is 1%, what is the probability that a person who tests positive actually has the disease?

# Define the probabilities
p_disease = 0.01
p_no_disease = 1 - p_disease
p_positive_given_disease = 0.95
p_positive_given_no_disease = 0.05

# Calculate the probability of testing positive
p_positive = (p_positive_given_disease * p_disease) + (p_positive_given_no_disease * p_no_disease)

# Calculate the probability of having the disease given a positive test
p_disease_given_positive = (p_positive_given_disease * p_disease) / p_positive

print(f"Probability of having the disease given a positive test: {p_disease_given_positive:.2f}")

2. Probability Distributions

Probability distributions are mathematical functions that describe the probability of different outcomes in a random event. There are numerous types of probability distributions, each with its own set of characteristics and applications. Some of the most common probability distributions and their properties are as follows:

a. Normal Distribution: It is a continuous distribution with a bell-shaped curve that is also known as the Gaussian distribution. It is commonly used in statistics to describe real-world phenomena such as IQ scores, heights, and weights. The mean (μ) and standard deviation (σ) of the distribution are two parameters.

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Create a normal distribution with mean 0 and standard deviation 1
mu, sigma = 0, 1
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 100)
plt.plot(x, norm.pdf(x, mu, sigma))
plt.show()

b. Binomial Distribution: It is a discrete distribution that describes the likelihood of a given number of successes in a given number of trials. It has two parameters: n (the number of trials) and p (the probability of success) (the probability of success in each trial).

from scipy.stats import binom

# Create a binomial distribution with n=10 and p=0.5
n, p = 10, 0.5
x = np.arange(binom.ppf(0.01, n, p), binom.ppf(0.99, n, p))
plt.plot(x, binom.pmf(x, n, p), 'bo', ms=8)
plt.show()

c. Poisson Distribution: It is a discrete distribution that describes the likelihood of a certain number of events occurring in a given time or space interval. It only has one parameter λ (the average number of events per interval).

from scipy.stats import poisson

# Create a Poisson distribution with lambda=2
mu = 2
x = np.arange(poisson.ppf(0.01, mu), poisson.ppf(0.99, mu))
plt.plot(x, poisson.pmf(x, mu), 'bo', ms=8)
plt.show()

d. Exponential Distribution: It is a continuous distribution that describes the time between events in a Poisson process, where events happen at random and independently. It only has one parameter λ (the rate of the Poisson process).

from scipy.stats import expon

# Create an exponential distribution with lambda=0.5
rate = 0.5
x = np.linspace(0, 10, 100)
plt.plot(x, expon.pdf(x, scale=1/rate))
plt.show()

e. Gamma Distribution: It is a continuous distribution that describes the time between k events in a Poisson process, where each event has a rate of λ. It has two parameters, k and λ.

from scipy.stats import gamma

# Create a gamma distribution with k=2 and lambda=0.5
k, theta = 2, 1/0.5
x = np.linspace(gamma.ppf(0.01, k), gamma.ppf(0.99, k), 100)
plt.plot(x, gamma.pdf(x, k, scale=theta))
plt.show()

Probability theory is a tough topic, but it is necessary for understanding and analyzing random events and their outcomes. We can better understand and apply these concepts to real-world problems by using Python and mathematics.

Descriptive Statistics

Descriptive statistics can be used to describe the characteristics of a data set by creating summaries of data samples. A population census may include descriptive information such as the proportion of males and females in a particular city.

Measures of central tendency and variability

These are statistical concepts used to describe the characteristics of a dataset.

Central Tendency: The value that best represents the entire dataset is the central tendency. It can be calculated using a variety of methods, including mean, median, and mode.

Mean: It is calculated by dividing the total number of data points by the total number of data points. It is the most commonly used central tendency measure. The mean can be calculated using the following formula: mean = (sum of all values) / (number of values) We can calculate the mean of a dataset in Python using the NumPy library, as shown below:

import numpy as np
data = np.array([2, 4, 6, 8, 10])
mean = np.mean(data)
print("Mean:", mean)

Median: It is the dataset’s middle value. It is the value that separates the upper 50% of the data from the lower 50% of the data in a sorted dataset. If there are an even number of values in the dataset, the median is the average of the two middle values. We can calculate the median of a dataset in Python using the NumPy library, as shown below:

import numpy as np
data = np.array([2, 4, 6, 8, 10])
median = np.median(data)
print("Median:", median)

Mode: It is the most frequently occurring value in the dataset. There can be one or more modes in a dataset. We can calculate the mode of a dataset in Python using the SciPy library, as shown below:

import scipy.stats as stats
data = [2, 4, 6, 8, 10, 10]
mode = stats.mode(data)
print("Mode:", mode)

Variability: The spread or dispersion of data points around the central tendency is measured by variability. Variability is commonly measured using range, variance, and standard deviation.

Range: It is the difference between the dataset’s maximum and minimum values. We can calculate the range of a dataset in Python using the NumPy library, as shown below:

import numpy as np
data = np.array([2, 4, 6, 8, 10])
range = np.max(data) - np.min(data)
print("Range:", range)

Variance: It is the sum of the squared deviations from the mean. It determines how far the data deviates from the mean. The variance is calculated using the following formula: variance = sum((xi — mean)2) / (n — 1) We can calculate the variance of a dataset in Python using the NumPy library, as shown below:

import numpy as np
data = np.array([2, 4, 6, 8, 10])
variance = np.var(data, ddof=1)
print("Variance:", variance)

Standard Deviation: It is the variance’s square root. It is used to calculate the spread of data in a dataset. The standard deviation can be calculated using the following formula: sqrt = standard deviation (variance) The standard deviation of a dataset can be calculated in Python using the NumPy library, as shown below:

import numpy as np
data = np.array([2, 4, 6, 8, 10])
std_dev = np.std(data, ddof=1)
print("Standard Deviation:", std_dev)

For Histograms and Visualisation plots, and Correlation and covariance Checkout — EDA — Part1 and for Skewness and kurtosis Checkout — Feature Engineering Part -2

Inferential Statistics

Inferential statistics is a type of statistics that employs sample data to draw conclusions about a larger population. By analysing a sample of data from a population, inferential statistics can make predictions and draw conclusions about that population.

To make these inferences, inferential statistics employs techniques such as hypothesis testing and confidence intervals, which are based on probability theory.

Hypothesis Testing

Hypothesis testing is a statistical method for determining whether data support a hypothesis about a population parameter. It includes determining a null hypothesis, which represents the assumption that no difference exists between two groups, and an alternative hypothesis, which represents the claim that a difference exists between the two groups.

The basic steps involved in hypothesis testing are:

State the null hypothesis and the alternative hypothesis
Choose a significance level (alpha)
Collect data and calculate a test statistic
Determine the p-value
Draw a conclusion based on the p-value and the significance level

Here is an example of hypothesis testing:

Suppose a company claims that their new product will increase sales by at least 10%. To test this claim, we can set up the null hypothesis as:

H0: There is no increase in sales due to the new product (μ ≤ 0)

And the alternative hypothesis as:

Ha: There is an increase in sales due to the new product (μ > 0)

Following that, we can collect sales data before and after the launch of the new product and compute the test statistic (e.g., the t-statistic). We can then calculate the p-value using this test statistic, which represents the probability of obtaining a test statistic as extreme as the one observed, assuming the null hypothesis is true.

If the p-value is less than the chosen significance level (alpha), the null hypothesis is rejected and evidence for the alternative hypothesis is found. If the p-value is greater than alpha, we fail to reject the null hypothesis and conclude that the alternative hypothesis is supported by insufficient evidence.

In Python, we can use the scipy.stats module to perform hypothesis testing. Here is an example of a t-test:

import numpy as np
from scipy.stats import ttest_ind

# Generate some sample data
group1 = np.random.normal(10, 2, size=50)
group2 = np.random.normal(12, 2, size=50)

# Perform a two-sample t-test
t, p = ttest_ind(group1, group2)

print("t-statistic:", t)
print("p-value:", p)

2. Confidence Intervals

Inferential statistics employ confidence intervals to estimate a population parameter such as mean or proportion based on a sample of data. A confidence interval is a range of values that, with a certain level of certainty, contain the true value of a parameter.

Assume we want to estimate the average height of all students in a school. We can measure the heights of a random sample of students and use the sample mean to estimate the population mean. However, we don’t know how precise this estimate is. We can provide a range of values that are likely to contain the true population mean with a certain level of confidence by constructing a confidence interval.

To create a confidence interval, we must first select a level of confidence, which is typically 90%, 95%, or 99%. The level of confidence represents the proportion of times the interval would contain the true population parameter if the sampling process were repeated many times. A 95% confidence interval, for example, means that if we took 100 samples and built a confidence interval for each one, approximately 95 of those intervals would contain the true population parameter.

The formula for a confidence interval is determined by the sample size and the distribution of the sample data. For example, if the sample size is large and the data is normally distributed, we can use the following formula to calculate a 95% confidence interval for the population mean:

CI = x̄ ± 1.96 * (s / √n)

where CI is the confidence interval, x̄ is the sample mean, s is the sample standard deviation, n is the sample size, and 1.96 is the z-score for a 95% confidence level.

In Python, we can use the scipy.stats module to calculate confidence intervals. For example, to calculate a 95% confidence interval for the population mean from a sample of data, we can use the following code:

import numpy as np
from scipy.stats import t

# Generate some sample data
data = np.random.normal(0, 1, size=100)

# Calculate the sample mean and standard deviation
xbar = np.mean(data)
s = np.std(data, ddof=1)

# Calculate the t-value for a 95% confidence level with n-1 degrees of freedom
tval = t.ppf(0.975, len(data)-1)

# Calculate the confidence interval
ci = (xbar - tval*s/np.sqrt(len(data)), xbar + tval*s/np.sqrt(len(data)))

print(ci)

3. Power Analysis

Power analysis is a statistical method for determining the sample size required for a study in order to detect a significant effect, if one exists. The power of a study is the likelihood of detecting a significant effect, assuming one exists. Power analysis is useful for determining the sample size required for a study in order to avoid Type II errors (false negatives) and to ensure that the study’s results are reliable and valid.

Several factors must be considered when conducting a power analysis, including the desired level of significance (alpha), the effect size, and the sample size. The magnitude of the difference or relationship between two variables is referred to as the effect size, and it is usually expressed as a standardised measure of difference, such as Cohen’s d. Power analysis also considers sample size, as larger sample sizes generally result in higher power.

Here is an example of conducting a power analysis in Python using the statsmodels library:

import statsmodels.stats.power as smp

# Set parameters for power analysis
effect_size = 0.5
alpha = 0.05
power = 0.8

# Conduct power analysis
nobs = smp.tt_ind_solve_power(effect_size=effect_size, alpha=alpha, power=power)

print("Sample size: ", round(nobs))

In this example, we set the effect size to 0.5, which is considered a moderate effect size, the alpha level to 0.05, which is a commonly used significance level, and the power level to 0.8, which is a commonly used power level. The tt ind solve power function is then used to calculate the sample size required for a two-sample t-test with the specified parameters. The code returns the required sample size, rounded to the nearest integer.

Generalized Linear Models

GLMs are a type of statistical model that extends the linear regression model to handle non-normal and non-continuous response variables. The basic idea behind GLMs is to use a link function and a probability distribution function to model the relationship between the response variable and the predictor variables.

Fig.8 — GLM Model for Regression (Examples)

The three key components of a GLM are:

The random component: This describes the probability distribution of the response variable. The most commonly used probability distributions in GLMs are the normal, binomial, and Poisson distributions.
The systematic component: This describes the relationship between the response variable and the predictor variables. It is modeled through a linear combination of the predictor variables and the link function.
The link function: This is a function that links the mean of the response variable to the linear predictor in the systematic component. It is chosen based on the nature of the response variable and the distribution being used.

Some examples of GLMs and their associated probability distributions are:

Linear regression: Normal distribution
Logistic regression: Binomial distribution
Poisson regression: Poisson distribution

The process of fitting a GLM involves specifying the random component, choosing an appropriate link function, and estimating the parameters using maximum likelihood estimation.

Python has several packages for fitting GLMs, including statsmodels and scikit-learn. Here is an example of fitting a logistic regression model using statsmodels:

import statsmodels.api as sm

# Load data
data = sm.datasets.get_rdataset("Titanic", "vcd").data

# Fit logistic regression model
model = sm.formula.glm("survived ~ age + sex", data=data, family=sm.families.Binomial()).fit()

# Print model summary
print(model.summary())

The response variable is binary (survived or not survived), so the binomial distribution is used. The link function is the logistic function, which is the default for the binomial distribution. The output of the model.summary() function gives information about the estimated coefficients, standard errors, and significance tests for each predictor variable.

Let me explain one regression problem using GLM, so let’s checkout Poisson Regression:

Poisson regression is a generalised linear model (GLM) type that is used to model count data. It’s especially useful when the response variable is a count, like the number of times an event happens in a given time period. Poisson regression assumes a Poisson distribution for the response variable and models the expected value of the response variable as a linear function of the predictor variables.

The Poisson distribution is a probability distribution that describes how many times an event happens in a given time interval given the average rate of occurrence. It is defined by a single parameter called lambda, which represents the average number of events per unit of time. The Poisson probability mass function gives the probability of observing k events in a given time interval:

P(k) = (lambda^k * e^(-lambda)) / k!

In Poisson regression, the response variable y is assumed to follow a Poisson distribution with a mean of lambda, where lambda is modeled as a linear function of the predictor variables:

E(y) = lambda = exp(b0 + b1x1 + b2x2 + … + bpxp)

where b0 is the intercept, b1, b2, …, bp are the regression coefficients, and x1, x2, …, xp are the predictor variables.

One important assumption of Poisson regression is that the variance of the response variable is equal to its mean. This is known as the equidispersion assumption. If the data violate this assumption, a negative binomial regression model may be more appropriate.

import statsmodels.api as sm
import pandas as pd

# Load data
data = pd.read_csv("data.csv")

# Fit Poisson regression model
model = sm.GLM(data["y"], data[["x1", "x2", "x3"]], family=sm.families.Poisson()).fit()

# Print model summary
print(model.summary())

Bayesian Statistics and Inference

Bayesian statistics is a statistical branch that deals with updating beliefs or probabilities in response to new evidence. It allows prior knowledge or beliefs to be factored into statistical inference and decision-making.

Bayesian statistics employs Bayes’ theorem, a mathematical formula that describes the relationship between two events’ conditional probabilities. According to Bayes’ theorem, the probability of a hypothesis (H) given some observed data (D) is proportional to the likelihood of the data given the hypothesis multiplied by the hypothesis’s prior probability.

P(H|D) = P(D|H) * P(H) / P(D)

where:

P(H|D) is the posterior probability of hypothesis H given data D
P(D|H) is the likelihood of the data D given hypothesis H
P(H) is the prior probability of hypothesis H
P(D) is the marginal probability of data D

import pandas as pd
import statsmodels.api as sm

# Load the iris dataset
iris = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
                   names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species'])

# Fit a mixed-effects linear regression model
random = {'sepal_width': '1 + petal_width', 'petal_length': '1 + petal_width*sepal_length'}
model = sm.MixedLM.from_formula('sepal_length ~ petal_length + C(species)', data=iris, re_formula='1', vc_formula=random,groups='species')

# Print the summary of the model
result = model.fit()
result.summary()

Bayesian statistics has several advantages over traditional frequentist statistics, including the ability to incorporate prior knowledge, directly estimate hypotheses’ probabilities, and update beliefs as new data becomes available. Bayesian statistics have many applications, including medical research, social science, finance, and engineering. One example is Bayesian inference for drug development, which uses prior knowledge and data from clinical trials to estimate the likelihood that a new drug will be effective.

Bayesian inference is a statistical method for updating the probability of a hypothesis or model based on new evidence or data. It is named after the mathematician Thomas Bayes and involves the application of Bayes’ theorem, which relates the conditional probabilities of events.

A prior probability distribution represents prior knowledge or beliefs about the hypothesis or model in Bayesian inference. Using Bayes’ theorem, this prior distribution is converted to a posterior distribution based on the observed data. Given the observed data, the posterior distribution represents the hypothesis or model’s updated probability.

Markov Chain Monte Carlo (MCMC)

The Markov Chain Monte Carlo (MCMC) method is a type of algorithm used to sample from complex probability distributions. They are widely used in Bayesian inference to estimate model parameter posterior distributions. In this answer, we will go over MCMC methods and provide an example Python code implementation.

To generate a sequence of samples from a complex distribution that is difficult to sample directly, MCMC methods are used. The fundamental idea behind MCMC methods is to build a Markov chain whose stationary distribution is the target distribution from which we want to sample. We can generate a sequence of samples from the target distribution by simulating this Markov chain for a long enough time.

A Markov chain is a sequence of random variables X1, X2, …, Xn, where each Xi is drawn from a probability distribution that depends only on the previous state, X{i-1}. The probability of transitioning from one state to another is given by the transition kernel, K(x{i-1}, xi). The Markov chain is said to be reversible if the following detailed balance condition holds:

Markov Chain

where pi(x) is the stationary distribution of the Markov chain. This means that the probability of transitioning from x{i-1} to xi is the same as the probability of transitioning from xi to x{i-1}.

The Metropolis-Hastings algorithm is a popular MCMC algorithm used for sampling from a complex distribution. The algorithm works by proposing a new state y based on the current state x, and accepting or rejecting the proposed state based on the acceptance probability:

Metropolis-Hastings algorithm

where pi(x) is the target distribution we want to sample from, q(x|y) is the proposal distribution for transitioning from y to x, and q(y|x) is the proposal distribution for transitioning from x to y.

If the proposed state y is accepted, then x is updated to y. If the proposed state is rejected, then x remains the same. The acceptance probability ensures that the detailed balance condition is satisfied, and the resulting Markov chain has pi(x) as its stationary distribution

Here is an example Python implementation of the Metropolis-Hastings algorithm for sampling from a normal distribution:

import numpy as np
import matplotlib.pyplot as plt

# Define the target distribution
def target(x):
    return np.exp(-0.5*(x-2)**2) + np.exp(-0.5*(x+2)**2)

# Define the proposal distribution
def proposal(x, sigma=1):
    return np.random.normal(x, sigma)

# Initialize the Markov chain
x = 0
samples = []

# Run the algorithm
for i in range(10000):
    # Propose a new state
    y = proposal(x)

    # Compute the acceptance probability
    alpha = min(1, target(y)/target(x))

    # Accept or reject the proposed state
    if np.random.rand() < alpha:
        x = y

    # Add the sample to the list of samples
    samples.append(x)

# Plot the samples
plt.hist(samples, bins=50, density=True, label='Estimated distribution')
plt.legend()
plt.show()

After plotting the histogram of the samples, the next step would be to estimate the posterior distribution of the parameter(s) of interest. There are different methods to perform this estimation, such as kernel density estimation, Gaussian mixture models, or simply fitting a parametric distribution (e.g., Gaussian distribution) to the samples.

Here is an example of how to estimate the posterior distribution using kernel density estimation:

from scipy.stats import gaussian_kde

# Estimate the posterior distribution using kernel density estimation
kde = gaussian_kde(samples)

# Define the range of the x-axis for the plot
x_min = min(samples)
x_max = max(samples)
x_range = np.linspace(x_min, x_max, 1000)

# Plot the posterior distribution
plt.plot(x_range, kde(x_range), label='Posterior distribution')
plt.legend()
plt.show()

After estimating the posterior distribution, we can use it to compute different statistics of interest, such as the mean, standard deviation, or quantiles. For example, to compute the 95% credible interval of the parameter, we can use the percentile function from numpy as follows:

# Compute the 95% credible interval
lower_bound = np.percentile(samples, 2.5)
upper_bound = np.percentile(samples, 97.5)
print(f"95% credible interval: [{lower_bound}, {upper_bound}]")

There are also different diagnostics that can be used to assess the performance of the MCMC algorithm and check for convergence. Some common diagnostics include trace plots, autocorrelation plots, and the Gelman-Rubin statistic. These diagnostics are beyond the scope of this answer, but you can find more information about them in the literature.

Conclusion

Eventually, statistics is a vital field of study that provides a framework for comprehending and analysing data. The foundation of statistical inference is probability theory, which involves making inferences about a population based on a sample.

Inferential statistics are used to make predictions and test hypotheses, whereas descriptive statistics are used to summarise and describe data.
Generalized Linear Models (GLMs) are widely used in fields such as economics, biology, and epidemiology because they provide a flexible framework for modelling a wide range of data types.
Bayesian statistics and inference provide a powerful alternative to traditional frequentist methods, allowing for more nuanced and realistic uncertainty modelling.
Methods such as Markov Chain Monte Carlo (MCMC) provide a powerful tool for simulating complex probability distributions and can be used to estimate

For Coding and Examples, Checkout my GitHub Profile.

Top-Machine-Learning-Algorithms-Python/UltimateGuidetoMachineLearning_Statistics.ipynb at main ·…

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

If you like the article and would like to support me make sure to:

👏 Clap for the story (100 Claps) and follow me 👉🏻Simranjeet Singh

📑 View more content on my Medium Profile

🔔 Follow Me: LinkedIn | Medium | GitHub | Twitter | Telegram

🚀 Help me in reaching to a wider audience by sharing my content with your friends and colleagues.

🎓 If you want to start a career in Data Science and Artificial Intelligence and you do not know how? I offer data science and AI mentoring sessions and long-term career guidance.

📅 Consultation or Career Guidance

📅 1:1 Mentorship — About Python, Data Science, and Machine Learning