1. Introduction to Statistical Methods in AI — Overview

17 min readSep 22, 2023

Introduction to Statistical Methods in AI

Introduction

In the realm of machine learning, models are the heart and soul of data-driven decision-making. They serve as tools that help us make sense of complex data and make predictions or draw insights. These models can be broadly classified into two categories: parametric and non-parametric. Additionally, data can be categorized as fully observed or partially observed, and tasks can be divided into predictive and descriptive analysis. In this article, we will delve into these fundamental aspects of model-based machine learning to gain a deeper understanding of how they shape our approach to solving real-world problems.

Models in Machine Learning

Machine learning models come in various forms, but they can be categorized into two main types: parametric and non-parametric.

Parametric Models

Parametric models make certain assumptions about the data distribution, such as linearity or normality. These assumptions simplify the learning process but can lead to high bias and underfitting if the assumptions do not hold true. Some common examples of parametric models include linear regression, logistic regression, and Naive Bayes.

Benefits of Parametric Models:

1. Simplicity: These models are easier to interpret and understand.
2. Speed: Parametric models are computationally efficient and can learn from data quickly.
3. Less Data Dependency: They do not require as much training data and can still perform well with imperfect fits.

Limitations of Parametric Models:

1. Constraint: The chosen functional form limits their ability to represent complex relationships.
2. Limited Complexity: They may struggle with highly intricate data patterns.
3. Poor Fit: In practice, they may not accurately capture the true data distribution.

Non-Parametric Models

Non-parametric models make fewer assumptions about data distributions and offer greater flexibility in capturing complex relationships. However, they can suffer from high bias and overfitting, especially when trained on small datasets. Examples of non-parametric models include k-Nearest Neighbors and decision trees.

Benefits of Non-Parametric Models:

1. Flexibility: They can fit a wide range of functional forms, making them versatile.
2. No Assumptions: Non-parametric models make minimal assumptions about the underlying data.
3. Performance: They can yield higher performance when the data is complex.

Limitations of Non-Parametric Models:

1. More Data Required: These models often need larger training datasets to estimate complex mappings.
2. Slower Training: Due to their flexibility, non-parametric models can be slower to train.
3. Overfitting Risk: There’s a higher risk of overfitting, and explaining predictions can be challenging.

Data Types

Data can be categorized into two main types: fully observed and partially observed.

Fully Observed Data

Fully observed data contain complete information for all variables in the dataset, with no missing values. This type of data simplifies analysis and modeling, aligning with the assumptions of many machine learning algorithms.

Partially Observed Data

Partially observed data have missing values for some variables in certain data points. Handling such data requires strategies like imputation or using specialized models designed to work with incomplete information. Partially observed data is a common challenge in data analysis and machine learning.

Tasks in Machine Learning

Machine learning tasks can be broadly classified into two categories: predictive analysis and descriptive analysis.

Predictive Analysis

Predictive analysis involves building models that make predictions based on historical data. Two common tasks in predictive analysis are regression and classification.

1. Regression: This supervised learning task aims to predict a continuous numeric value based on input features. Examples include predicting house prices or stock prices.

2. Classification: In classification, also a supervised task, the goal is to assign data points to predefined categories or classes. It’s used in tasks like spam email detection or image classification.

Descriptive Analysis

Descriptive analysis focuses on summarizing and understanding data patterns without making predictions. It includes tasks like dimensionality reduction, density estimation, and clustering.

1. Dimensionality Reduction: This technique reduces the number of input features while preserving important information, aiding in simplifying complex datasets.

2. Density Estimation: Density estimation helps in understanding the underlying probability distribution of data, useful for generating new samples from the learned distribution.

3. Clustering: Clustering groups similar data points together, revealing hidden patterns and structures within the data.

Workflow of an ML Problem

The Initial Experiment

Objective of the task can be anything like detection of spam mails, predicting stock values, or driving a car safely without human intervention.
Once the task is clear, collect the data relevant to solve the objective. In the case of detecting spam mails, we collect the past data of business letters, spam reported mails.
Data is not often directly usable and may not always be digital in nature, it may be multimodal(occurring from several modes or sources). Hence the needed data has to be filtered from raw data and transformed to numerical data in order to be fed into learning algorithms.
The next steps involve developing using algorithms and training the model with the training datasets.
After the model is trained, It can be evaluated based on the performance and accuracy of the output.
The model can then be published and operated.
Based on the observations and inferences from the model, the objective may be redefined and more data collected for the same, which brings us back to square one. Hence, this workflow runs in a cycle.

With each cycle, tweak the data sample according to previous results to make concrete inferences.

Data Collection

Sources of Data

Data is collected from various sources depending on the task and nature of data.

Example Task: Detect spam email
Source of Data: Emails from multiple inboxes labeled as spam or not spam.

Example Task: Predict values of a stock
Source of Data: Previous few years’ data of stock prices of a company.

Example Task: Predict the effect of advertising on sales
Source of Data: Statistics of advertisements used.

Example Task: Drive a car safely without human intervention
Source of Data: Data collected from various sensors (LIDAR, Camera, GPS, ultrasonic, etc.) installed in the car.

Example Task: Translate text from one language to another
Source of Data: Paired sentences (S1, S2): S1 from language 1 and its translation S2 in language 2.

Human domain expert might be required as a source of data, e.g., to translate sentences.
Raw data might not always be in digital format, e.g., extracting information from a handwritten cash memo.
Data may be multi-modal and may need to be synchronized.
— Multi-modal: Input data might be in multiple formats, e.g., text, images, audio, etc.
— Synchronized: Input data from multiple sources might need to be synchronized.

What Data to Collect?

- Not all of the raw data will be relevant. It is necessary to extract the relevant information from it. For example, in the case of spam email prediction, we need to extract the email text content from the raw data and ignore all the metadata about senders, receivers, etc.
— Raw data is often not directly usable.
— Filter out the required data.
— Transform all data to numerical data.

How (much) Data to Collect?

- Raw data might be too little in quantity. This will make it difficult for learning algorithms.
— On the other hand, if there is too much raw data, the available computation and storage might not be able to accommodate it.

Sampling Techniques

Sampling frame can be explained as a list of people within the target population who can contribute to the research. Even though the size of the target population is much greater than that of the sample, we assume that it is capable of representing the entire population.

- Sample as diverse a dataset as possible to avoid biases.
— Collect the right amount of data; too much data might include irrelevant information and mislead inferences, apart from causing storage problems. Less data may not generate enough analysis to conclude results.
— Data collected may be multi-modal and needs to be synchronized. For example, a self-driven car may rely on data being fed to it via cameras, RADAR, LiDAR, ultrasonic sensors, etc., and all of it must be synchronized to successfully drive the car without human aid.
— Raw data needs to be filtered to remove irrelevant data and then transformed for evaluation.

Data Preparation

Taxonomy of Data Variables

Data Type:

All the collected data can be classified into two types.

1. Quantitative:

It is the data type that can be counted, measured, and compared. Quantitative data can be either discrete or continuous.

Examples:
— There are 3 cones; cone 1 has two scoops. Here, the number of cones and scoops is discrete, hence it is an example of discrete data.
— Cone 3 weighs 79.4 g. Cone 2 is at 8.3°C. Here we have mentioned the weight and temperature of the scoops, which are examples of continuous data.
— Temperature of surroundings, humidity are examples of continuously varying quantities.
— Number of CPU cores, courses taken in a semester are examples of discretely varying quantities.

Quantitative data is meaningful only if it has both data and information. For example, “13 trees” is meaningful, but ‘13’ and ‘trees’ have no meaning individually and are of no use to us.

2. Qualitative:

Attributed, labeled kind of data comes under Qualitative. It can be further classified into 3 types, namely:

a) Binary:
The attributes which have only two distinct values are called Binary. Examples include Yes/No, Spam/Not Spam, -1/1, and 0/1. Binary attributes cannot be compared.

b) Nominal:
The attributes which have a finite set of distinct values are called Nominal attributes. These can’t be compared. Examples include a set of distinct colors, pin codes, and watch brands.

Food for thought:
— A set of distinct colors might not be finite (at least theoretically). How would you address this issue?
— Why is Pincode Nominal? Can we use these codes as numbers?

c) Ordinal:
The attributes which have a finite set of distinct values in which the comparison of values is meaningful and there is a particular ordering among them are called Ordinal. For example, if we consider letter grades (A, A-, B, B-, …), cloth size, or the Likert scale, there is always a distinct but clear ordering within them.

Food for Thought:
— Why should we not treat Nominal Data as Ordinal? (Think about Penalty on Mistakes during learning)
— Food for Thought: What changes when Ordinal Data is treated as Nominal? (Think about comparison/association, encoding, etc.)
— Which of these Mistreatments can blow up your Data size?

From Qualitative to Quantitative

Ultimately, for real-world scenarios and tasks, all data — even qualitative — must become quantitative so that mathematical operations/comparisons can be performed on it. To do this, we take qualitative data and encode it into numbers.

1. For Binary data:

a) If we have a column in a dataset that contains Yes and No’s, we can replace each Yes with 1 and each No with 0.

b) Another way to encode in the above case is to use 1 and -1. Depending on the type of data present, one of the ways of encoding can be better or more intuitive than the other. For example, if we want to count the upvotes and downvotes, a 1/-1 encoding is superior to 1/0.

2. In the case of Ordinal data, the scale is user-dependent. Depending on the quantities being compared, one can decide if polarity in the scale is required.

a) Choosing between nominal and ordinal when comparing certain data can be crucial when constraints such as memory and/or storage come into the picture. This also does not mean that if one method consumes more memory, the other method is automatically better.

3. Nominal data can be directly encoded using numbers because it takes a finite number of distinct values.

a) If we have 4 distinct cars, we can encode them as 1, 2, 3, and 4.

However, this method of encoding finite and distinct values using numbers poses a problem as numbers have a natural order. Ideally, all entities in this model are equally important but this model may add bias by giving higher preferences to the entities encoded as larger numbers. To solve this issue we use One-Hot Encoding.

One-Hot Encoding

One-Hot Encoding can be defined as a process of transforming categorical variables into numerical features that can be fed as input to learning algorithms.

Essentially, it is a binary vector representation of a categorical variable where the length of each vector is equal to the number of distinct categories in the variable; and in this vector, all values would be 0 except the ith value which will be 1, representing the ith category of the variable.

In the above figure, we have 3 different categories for the color variable. Hence we create 3 distinct vectors and 3 distinct columns in the dataset. Encoded vectors are [1,0,0], [0,1,0], and [0,0,1] for Red, Green, and Yellow, respectively.

Food for thought: Encodings that are more Natural are better for Analysis and Visualization. Therefore, it is helpful to know about the domain while encoding.

When to use One-Hot Encoding:

One-hot Encoding should be used when:
— The categorical features in the dataset are not ordinal, i.e., there is no natural ordering to their categories.

Data Imputation

- Data imputation is a technique used for replacing the missing data with some substitute value to retain most of the data/information of the dataset.
— These techniques are used because removing the data from the dataset every time is not feasible and can lead to a reduction in the size of the dataset to a large extent, which not only raises concerns for biasing the dataset but also leads to incorrect analysis.

Some methods to fill the missing data are:

1. Filling with a suitable constant depending on the domain of the variable.
2. Filling with a statistical measure like mean, median, or mode.
3. Use a learning method that can account for missing data.

If we have quantitative data that is missing, we can fill it with the mean or median. But in the case of missing nominal data, we can fill it with the most frequent value (mode).

Resources: Data Imputation, Handling Missing Data

Food for thought:
— Which statistical measure amongst Mean/Median would be a better choice for filling missing data?
— How would you deal with a situation where you have more noise than signal? (Think along the lines of Anomaly Detection)
— Think about modeling sin x (or any known function) via Machine Learning for Sanity Check of that learning method.

Data Sample Representations

It is a relationship between facts, ideas, information, and concepts depicted in a diagram via data representation. It is a fundamental learning strategy that is simple and easy to understand.

As seen in the figure above, a dataset consists of:
— Labels: It represents an output value.
— Feature / Attribute: It is an input value that describes the characteristics of such labels in datasets.
— Sample: One observation made in a data set. Multiple samples together form a dataset.

These samples can be represented in:
![Scalar, Vector, Matrix, and Tensor](https://cdn.jsdelivr.net/gh/debanga/GPT-3/assets/scalar-vector-matrix-tensor.png)

- Scalars: It only holds a single number.
— Vectors: It is a collection of numbers and only has one axis.
— Matrix: It is a collection of vectors. It constitutes two axes, often referred to as rows and columns. A matrix can be visualized as a square box of numbers.
— Tensor: Tensors are a generalization of matrices to any number of dimensions.

All these samples together form a dataset which can be represented in Tabular form, 2-D images, Videos, and Graphs, etc.

Feature Extraction

Feature extraction is any algorithm that transforms raw data into features that can be used as input for a learning algorithm.

Bag of Words

Bag Of Words is a feature extraction method of converting the text data into numerical vectors as features. Those numbers are the count of each word (token) in a document. It is generally used in NLP but this concept is also used in images.

In this concept, the same idea is used on images. We detect features, extract descriptors from each image in the dataset, and build a visual dictionary. Next, we make clusters from the descriptors (we can use K-Means, DBSCAN, or another clustering algorithm). The center of each cluster will be used as the visual dictionary’s vocabularies. Finally, for each image, we make a frequency histogram from the vocabularies and the frequency of the vocabularies in the image. Those histograms are our bag of visual words (BOVW).

Bag of Words.

Hierarchical Data Representation

Feature-based hierarchical data representations refer to a way of organizing and representing data in a hierarchical structure, where each level of the hierarchy corresponds to different levels of abstraction of features present in the data.
— These features could be numerical values, categorical labels, or even more complex structures like images, audio, or text. The first step involves extracting these features from the raw data and using them to describe the data in a more organized manner.
— A hierarchy is a system of organizing things or concepts in a tree-like structure, where each level of the hierarchy represents a different level of abstraction or detail. In the context of data representation, a hierarchical structure involves organizing the features into levels based on their relationships and dependencies.
— At the top level of the hierarchy, you might have general features that provide a broad overview of the data. As you move down the hierarchy, each subsequent level might introduce more specific and detailed features, resulting in a refined understanding of the data. These levels can represent different aspects or dimensions of the data.

Examples:

1. In computer vision, images are often represented using a hierarchy of features. At the lower levels, you might have raw pixel values as features. Moving up the hierarchy, features could be edges, textures, shapes, and eventually more complex objects or scenes.

2. In natural language processing, textual data can be represented hierarchically. At lower levels, you might have individual words as features. As you move up, features could be phrases, syntactic structures, and eventually semantic representations.

3. In genomics, genetic data can be represented hierarchically. At lower levels, individual nucleotides could be features. As you move up, features could be codons, genes, functional elements, and finally, entire genomic pathways.

Data — a probabilistic based perspective

- It provides a basis for statistical learning theory. Statistical learning theory draws heavily from probability theory, optimization theory, and computational complexity theory. It plays a fundamental role in guiding the design of learning algorithms, providing insights into the trade-offs and limitations of various approaches, and advancing our understanding of the theoretical aspects of machine learning.
— ”Domain” refers to the set of all possible outcomes or elements that a random phenomenon can produce.
— ”Data” refers to the observed instances or measurements of the random variables in the domain. In other words, data is what we collect or observe from the real world. Each data point is an instantiation of one or more random variables.

Basic Data Transformations

Quantization

Quantization in machine learning involves converting continuous data (like real numbers) into discrete representations. This is useful for making models more efficient and suitable for devices with limited resources.

1. Continuous to Discrete (‘Rounding off’):

(a) Continuous Data: Machine learning often uses continuous numbers (e.g., decimals) which can be resource-intensive.

(b) Discrete Representation: Quantization turns continuous values into limited discrete values. It reduces precision, like using fewer bits to represent a number.

(c) Quantization Levels: The number of discrete values used determines the detail level. More levels allow more detail but need more bits.

(d) Quantization Error: Precision loss during quantization is called quantization error. The goal is to minimize this while reducing precision.

(e) Quantization Schemes: Different methods for assigning discrete values: • Uniform: Equally spaced intervals for values. Non-Uniform: Uneven intervals to match data distribution. • Vector: Grouping similar values together.

(f) Training: You can teach models to handle quantization during training, adding controlled noise.

(g) Post-Training: You can quantize a trained model afterward by mapping floating-point values to discrete ones.

(h) Fine-Tuning: After quantization, fine-tuning can recover some accuracy loss.

(i) Evaluation and Optimization: Quantized models are evaluated and optimized for desired performance. Remember, quantization balances efficiency with model accuracy, and the choice depends on the application and resource limits.

2. Binary Quantization (‘Thresholding’):

Binary quantization, often referred to as ”thresholding,” is a specific form of quantization where continuous data is mapped to just two discrete values, usually 0 and 1. This extreme form of quantization can be useful in certain applications where only the presence or absence of a certain characteristic is of interest, and the fine details of the continuous data are not important. Here’s how binary quantization or thresholding works:

(a) Threshold Selection: Choose a specific value (threshold) that separates your continuous data into two groups. Data points above the threshold are assigned one discrete value (e.g., 1), while those below are assigned another value (e.g., 0).

(b) Mapping Continuous to Binary: Compare each data point to the chosen threshold. If a data point is greater than the threshold, assign it the ”presence” value (1); if it’s less than or equal to the threshold, assign the ”absence” value (0).

(c) Loss of Information: Binary quantization simplifies data to just two values, causing significant information loss. Fine details are discarded, and only the general distinction between above and below the threshold is preserved.

Binary quantization is simple and memory-efficient, but may not preserve fine details or nuanced information. Consider trade-offs and your application’s needs when choosing binary or traditional quantization.

Data Normalization

Data normalization (or feature scaling) is vital in machine learning. It standardizes feature ranges, enhancing algorithm performance. By maintaining consistent scales, normalization boosts convergence and fairness. It doesn’t alter distribution shape, just ensures similar scales for all features. Here’s a concise explanation of data normalization:

Why Normalize: Features in a dataset often have different scales, units, and ranges. Some machine learning algorithms, like gradient descent-based methods, are sensitive to these variations, leading to slow convergence or biased results.

2. Types of Normalization:

(a) Min-Max Scaling:
i. Choose a Range: Decide on the desired range for your scaled data. The common range is [0, 1], but you can adjust it based on your specific needs.
ii. Compute Min and Max: Calculate the minimum and maximum values for each feature (column) in your dataset.
iii. Apply Min-Max Scaling Formula: For each feature X, apply the Min-Max scaling formula to scale the data:
Xscaled = (X — Xmin) / (Xmax — Xmin)
Xscaled is the scaled value of the feature X, Xmin is the minimum value of the feature, and Xmax is the maximum value of the feature.
iv. Repeat for All Features: Perform the Min-Max scaling for all features in your dataset.

(b) Standardization (Unit Normal Scaling):
i. For each feature (column) in the dataset, compute its mean (average) and standard deviation.
ii. Subtract the mean from each data point and then divide by the standard deviation. This centers the data around zero and scales it based on the spread of the data.
Z = (X — µ) / σ
µ = Σ(xi) / N
σ = sqrt(Σ(xi — x¯)² / N)
There are more methods like Robust Scaling, Unit Vector Normalization which can be learned from here.

3. When to Normalize: Normalization is essential when features have different scales or when using algorithms like k-nearest neighbors, support vector machines, and neural networks.

4. Benefits of Data Normalization:

(a) Improved Convergence: Normalization can help algorithms converge faster during training, especially those sensitive to scale.
(b) Balanced Influence: All features have a similar impact on the model, preventing some features from dominating others based solely on scale.
(c) Regularization: Some regularization techniques assume normalized data, promoting model stability.

5. Caution with Normalization:

(a) Domain Knowledge: Consider whether normalization is appropriate based on the problem domain and the algorithm being used.
(b) Outliers: Some normalization methods can be sensitive to outliers, which might require preprocessing.

Connect with me : https://www.linkedin.com/in/yash-bhaskar/
More Articles like this: https://medium.com/@yash9439