Stories by Ranjit maity on Medium

Serverless vs JobCluster in Databricks: Choosing the Right Compute Strategy

Ranjit maity — Sat, 12 Jul 2025 16:18:00 GMT

ADB🚀 Serverless vs 🧱 JobCluster in Databricks: Choosing the Right Compute Strategy

👋 Introduction
Databricks gives you multiple ways to run your workloads, and one of the biggest choices you’ll make is between Serverless Compute and Job Clusters. 🤔 Each option comes with its own trade-offs in terms of performance, cost, scalability, and complexity.

In this post, we’ll explore both models, side-by-side, and help you understand which one best fits your Databricks workflows. Whether you’re running data pipelines, notebooks, or production jobs, this guide has you covered. ✅

☁️ What is Serverless Compute in Databricks?
Serverless in Databricks (specifically Serverless SQL or Serverless Jobs Compute) means Databricks manages the compute infrastructure on your behalf. You don’t deal with VMs, clusters, or provisioning. You just write code, run it, and scale automatically. ⚡

Key Features:
- Fully managed by Databricks

Automatically scales based on demand
Very fast startup using warm pools
Charged based on usage (no costs for idle time)
Simplified security model (no VPC required)

🎯 Best For:
Heavier ETL jobs

ML model training workloads 🧠
Data transformations requiring custom libraries
Jobs that need advanced configuration or external connections

💡 Example:
A PySpark job that trains an ML model on large datasets, using a high-memory cluster with custom Python packages.

When to Use Serverless in Databricks
Opt for Serverless when:

You want instant compute without managing clusters.
Jobs are lightweight or moderate in size.
You’re looking for cost savings (pay only when jobs run).
You want faster scheduling and low-latency start times.

When to Use Job Clusters in Databricks
Choose Job Clusters when:

You need to run data-intensive or ML jobs.
You require custom libraries or environment setups.
Jobs have longer runtimes or complex dependencies.
Security or networking configurations (e.g., VPC) matter.

Tip: Use Both Strategically
You don’t have to choose just one. A smart architecture uses serverless for orchestration and lightweight tasks, and job clusters for the heavy lifting.

Example: Use a Serverless Job to monitor data drift and trigger a retraining job that runs on a GPU-enabled job cluster.

Final Thoughts
In the world of Databricks, both Serverless and Job Clusters are powerful — but they serve different needs.

Serverless is perfect when simplicity, speed, and cost savings matter most.

Job Clusters shine when you need horsepower, customization, or full control.

The best part? Databricks makes it easy to switch between the two depending on your job’s nature. So, evaluate your workloads, benchmark costs, and pick the right tool for the job.

Pandas vs PySpark DataFrame: The Ultimate Guide for Data Enthusiasts

Ranjit maity — Sat, 05 Apr 2025 14:40:38 GMT

When your data grows bigger than your memory, who do you call — Pandas or PySpark?

Whether you’re a data scientist wrangling millions of records or an engineer building pipelines for enterprise-scale systems, choosing the right data manipulation tool can make or break your workflow. Two of the most popular options in the Python ecosystem — Pandas and PySpark — offer similar functionality but serve vastly different purposes.

In this blog, we’ll dive deep into the world of Pandas DataFrame and PySpark DataFrame, comparing them on everything from performance and scalability to ease of use and ecosystem compatibility. Let’s explore how they stack up and when you should choose one over the other.

Pandas vs PySpark: Choosing the Right Tool

📦 What Is a DataFrame, Anyway?

A DataFrame is a tabular data structure — think of it like an in-memory spreadsheet with labeled rows and columns. Both Pandas and PySpark offer DataFrame APIs, but the way they operate under the hood is fundamentally different.

Understanding DataFrames

🐼 Pandas DataFrame: The In-Memory Powerhouse
Pandas is the go-to tool for data analysis in Python. It offers intuitive APIs for reading, manipulating, and analyzing small to moderately sized datasets.

🔧 Features of Pandas:
Data processing in memory Comprehensive capabilities for data manipulation User-friendly API Ideal for exploratory data analysis (EDA) and prototyping.

Pandas’ Power in Data Analysis

When to Use Pandas:
Datasets that fit into your system memory (RAM)

Quick prototyping and analysis

Data cleaning and transformation

Personal projects and small-scale jobs

When should I use Pandas?

Limitations:
Not scalable for big data

Memory-intensive operations can crash your system

Single-threaded (mostly), so not optimized for parallel processing

Should we use this technology for our project?

PySpark DataFrame: Built for Big Data
PySpark is the Python API for Apache Spark, a distributed computing engine designed for processing massive datasets across clusters.

Features of PySpark:
Lazy evaluation (optimizes execution plans)

Distributed processing across multiple machines

Fault-tolerant

Supports SQL-like operations and integrates with Hadoop/Hive

Works well with big data stored in HDFS, AWS S3, GCP, etc.

Understanding PySpark Features

When to Use PySpark:
Data is too large for local memory

Need for distributed computing

Working in enterprise environments

Integrating with big data systems like Hadoop, Hive, Kafka, etc.

When to Use PySpark

Limitations:
Steeper learning curve

More setup required (SparkSession, environment)

Slower than Pandas for small datasets

Debugging and performance tuning can be complex

Pandas vs PySpark: Head-to-Head Comparison

Head-to-Head Comparison

Benchmark Example
Let’s run a quick example to compare performance:

Dataset: 10 million rows of synthetic data
Pandas

import pandas as pd
import numpy as np
import time

df = pd.DataFrame({
    'id': range(10_000_000),
    'value': np.random.rand(10_000_000)
})

start = time.time()
mean_val = df['value'].mean()
end = time.time()

print("Pandas Mean:", mean_val)
print("Time Taken:", end - start)

PySpark

from pyspark.sql import SparkSession
from pyspark.sql.functions import avg
import time

spark = SparkSession.builder.appName("DataFrameTest").getOrCreate()

df = spark.range(0, 10_000_000).withColumnRenamed("id", "value")

start = time.time()
mean_val = df.select(avg("value")).collect()
end = time.time()

print("PySpark Mean:", mean_val)
print("Time Taken:", end - start)

Observation:

Pandas is faster for small data and less overhead.

PySpark performs better as data size increases beyond memory constraints.

Syntax Comparison
Here’s how some common operations look in both:

Reading Data

# Pandas
df = pd.read_csv("data.csv")

# PySpark
df = spark.read.csv("data.csv", header=True, inferSchema=True)

Filtering

# Pandas
df[df['age'] > 30]

# PySpark
df.filter(df.age > 30)

Group By

# Pandas
df.groupby('department')['salary'].mean()

# PySpark
from pyspark.sql.functions import mean
df.groupBy('department').agg(mean('salary'))

Real-World Use Cases

Can They Work Together?
Absolutely! A common practice is to prototype in Pandas and scale with PySpark. You can convert between them easily:

# From Pandas to PySpark
spark_df = spark.createDataFrame(pandas_df)

# From PySpark to Pandas
pandas_df = spark_df.toPandas()

Final Thoughts: Which One Should You Choose?

👉 Use Pandas if:

Your dataset fits in memory

You want simplicity and speed for local data analysis

You’re in the exploratory phase

👉 Use PySpark if:

You’re working with large-scale or distributed data

You’re building production-level data pipelines

You need scalability and integration with big data tools

In a nutshell, Pandas is your bicycle, fast and nimble for city rides. PySpark is your freight train, powerful and capable for cross-country hauls. Pick the one that best suits your journey.

Both Pandas and PySpark have their strengths. Instead of viewing them as rivals, think of them as complementary tools. As a data professional, knowing when and how to use each will level up your game.

Let me know in the comments — what’s your go-to DataFrame tool and why?

Happy coding!

Difference Between INT and BIGINT

Ranjit maity — Tue, 18 Feb 2025 08:18:26 GMT

When working with databases, choosing the right data type is crucial for performance and storage optimization. Among numeric data types, INT (Integer) and BIGINT are widely used for storing whole numbers. While they may seem similar, they have distinct differences in terms of size, range, and performance.

1. Definition of INT and BIGINT

INT (Integer)

The INT data type is a 4-byte (32-bit) signed or unsigned integer used to store whole numbers. It is suitable for values that fit within the range of a 32-bit integer.

BIGINT (Big Integer)

The BIGINT data type is an 8-byte (64-bit) signed or unsigned integer, allowing for a significantly larger range of values than INT. It is used when handling very large numbers that exceed the range of INT.

2. Storage Size and Range

Data Type Storage Size Signed Range Unsigned Range INT 4 bytes -2,147,483,648 to 2,147,483,647 0 to 4,294,967,295 BIGINT 8 bytes -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 0 to 18,446,744,073,709,551,615

3. Performance Considerations

The choice between INT and BIGINT impacts database performance in multiple ways:

Memory Usage

INT requires only 4 bytes per value, whereas BIGINT requires 8 bytes. If your dataset contains millions of records, using BIGINT unnecessarily can double the storage space.
Choosing INT when BIGINT is not needed improves memory efficiency.

Processing Speed

INT is faster to process than BIGINT because it requires fewer CPU cycles and memory fetches.
BIGINT can slow down indexing and sorting operations due to its larger size.
Using BIGINT in primary keys and indexes increases index size, potentially affecting query performance.

Index Performance

Indexes on BIGINT columns take more space and can result in slower index lookups compared to INT indexes.
INT is preferred in indexing whenever possible to optimize performance.

4. Use Cases

When to Use INT?

When the values do not exceed the INT range.
For most applications, such as user IDs, small numerical counters, and moderate-sized datasets.
When optimizing storage and performance is a priority.

When to Use BIGINT?

When dealing with large datasets where numbers exceed the INT range.
In applications that require handling large numerical IDs, such as globally unique identifiers (GUIDs), blockchain transactions, and financial systems.
When working with datasets expected to grow significantly over time.

5. Examples in SQL

Creating a Table with INT and BIGINT

CREATE TABLE example (
    id INT AUTO_INCREMENT PRIMARY KEY,
    big_value BIGINT NOT NULL
);

Checking Storage Size

To check the storage size of an INT or BIGINT column in MySQL:

SELECT DATA_TYPE, COLUMN_TYPE, CHARACTER_MAXIMUM_LENGTH
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'example';

6. Conclusion

Choosing between INT and BIGINT depends on the nature of the data and its expected growth. If your values will never exceed the INT range, using INT is more efficient. However, if your application requires handling large numbers, BIGINT is the better choice despite its increased storage cost.

In summary:

Use INT when values fit within its range and performance is a priority.
Use BIGINT when dealing with large numbers and future scalability is required.

By making an informed decision, you can optimize your database storage and performance effectively.

Logistic regression

Ranjit maity — Thu, 22 Dec 2022 03:21:23 GMT

Logistic regression is a statistical method used for classification tasks, such as predicting whether a person has a disease or not, based on a set of predictor variables. It is a widely used and well-established method in the field of machine learning and data analytics.

In logistic regression, the dependent variable is a binary variable that takes on only two values, such as “Yes” or “No”, “True” or “False”, or “1” or “0”. The independent variables (also known as predictor variables) are continuous or categorical variables that are used to predict the value of the dependent variable.

The goal of logistic regression is to find the best combination of independent variables that can predict the value of the dependent variable with the highest accuracy. To do this, logistic regression uses a special type of equation called the logistic function, which maps the values of the independent variables to probabilities between 0 and 1.

The mathematical formula for the logistic function is given by:

p = e^(b0 + b1x1 + b2x2 + … + bnxn) / (1 + e^(b0 + b1x1 + b2x2 + … + bnxn))

where p is the predicted probability, x1, x2, …, xn are the independent variables, and b0, b1, b2, …, bn are the coefficients of the equation. The coefficients are estimated using maximum likelihood estimation, which finds the values that maximize the probability.

Here is an example of how you can use Python to fit a logistic regression model to a dataset:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Load the data
data = pd.read_csv("data.csv")

# Split the data into training and testing sets
X = data[["age", "income", "education"]]
y = data["has_disease"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Fit the logistic regression model
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict the probability of having the disease
predictions = model.predict_proba(X_test)[:, 1]
print(predictions)

# Predict the class (Yes or No)
predictions = model.predict(X_test)
print(predictions)

This code uses the LogisticRegression class from the scikit-learn library to fit a logistic regression model to the data. It first splits the data into training and testing sets, using the train_test_split function from the model_selection module. It then fits the model to the training data using the fit method, and uses the predict_proba method to predict the probability of having the disease for each sample in the testing set. Finally, it uses the predict method to predict the class (Yes or No) for each sample in the testing set.

The output of this code will be an array of predicted probabilities and an array of predicted classes for the testing set.

Some advantages of using logistic regression include:

It is a simple and easy-to-understand method, making it suitable for a wide range of applications.
It can handle categorical variables, which means you don’t have to transform them into numerical variables before fitting the model.
It provides a probabilistic interpretation of the predictions, which can be useful in certain applications.
It is computationally efficient and can be applied to large datasets quickly.

Some disadvantages of using logistic regression include:

It assumes a linear relationship between the independent variables and the log odds of the dependent variable, which may not always be the case in real-world data.
It is limited to binary classification tasks

Logistic regression is a widely used and well-established statistical method for binary classification tasks. It is a simple and easy-to-understand method that can handle categorical variables and provides a probabilistic interpretation of the predictions. However, it has some limitations, such as the assumption of a linear relationship between the independent variables and the log odds of the dependent variable, and the limitation to binary classification tasks.

Despite these limitations, logistic regression is a valuable tool for data scientists and analysts, and it is widely used in a variety of applications, including marketing, finance, healthcare, and more. If you are working on a classification problem and the data meets the assumptions of logistic regression, it is worth considering as a potential solution.

To use logistic regression in Python, you can use the LogisticRegression class from the scikit-learn library. This class provides a variety of methods and functions that you can use to fit a logistic regression model to your data, predict the probability of the dependent variable, and evaluate the performance of the model. With a little bit of code and some basic knowledge of statistics, you can quickly and easily apply logistic regression to your data and solve a variety of classification problems.

Linear Regression

Ranjit maity — Thu, 22 Dec 2022 03:12:21 GMT

Linear regression is a statistical method used to model the relationship between a dependent variable (also known as the response variable) and one or more independent variables (also known as explanatory variables). It is used to predict the value of the dependent variable based on the values of the independent variables.

formula

Intuitively, linear regression assumes that there is a linear relationship between the dependent and independent variables. In other words, it assumes that the change in the dependent variable is directly proportional to the change in the independent variables.

For example, let’s say we want to predict the price of a house based on its size (in square feet). We can use linear regression to model this relationship by fitting a line to the data points. The slope of the line represents the rate at which the price of the house changes with respect to the size, and the y-intercept represents the price of the house when the size is zero.

Here is an example of how you can use to fit a linear regression model to a dataset:

import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LinearRegression

# Sample data
x = np.array([1000, 1500, 2000, 2500, 3000])  # Size of the house (in sq ft)
y = np.array([100000, 125000, 150000, 175000, 200000])  # Price of the house (in dollars)

# Fit the linear regression model
model = LinearRegression()
model.fit(x.reshape(-1, 1), y)

# Predict the price of a house with size 2500 sq ft
prediction = model.predict(np.array([2500]).reshape(-1, 1))
print(prediction)  # Output: [175000.]

# Plot the data points and the fitted line
plt.scatter(x, y, color='red')
plt.plot(x, model.predict(x.reshape(-1, 1)), color='blue')
plt.show()

Here is the output:

Some advantages of using linear regression include:

Simplicity: Linear regression is a simple and easy-to-understand method, making it suitable for a wide range of applications.
Interpretability: The coefficients of the linear regression model can be interpreted as the effect of each independent variable on the dependent variable, making it easy to understand the relationships between variables.
Speed: Linear regression is computationally efficient and can be applied to large datasets quickly.
Robustness: Linear regression is robust to outliers and does not require complex data preparation.
Widely used: Linear regression is a widely used and well-established statistical method, and it is available in many programming languages and libraries.

However, it is important to note that linear regression has some limitations, such as the assumption of a linear relationship between the dependent and independent variables and the inability to model complex relationships. In such cases, other methods, such as non-linear regression or machine learning algorithms, may be more appropriate.

Some disadvantages of using linear regression include:

Assumption of linearity: Linear regression assumes a linear relationship between the dependent and independent variables, which may not always be the case in real-world data. If the relationship is non-linear, linear regression may not be accurate or reliable.
Limited flexibility: Linear regression is a simple and straightforward method, which means it cannot model complex relationships between variables. If the data has multiple non-linear patterns or interactions between variables, linear regression may not be able to capture them.
Sensitivity to outliers: Linear regression is sensitive to outliers, which means that a few extreme or abnormal data points can have a significant impact on the fitted model. This can affect the accuracy and reliability of the model.
Limited to a single dependent variable: Linear regression is limited to modelling the relationship between a single dependent variable and one or more independent variables. It cannot model multiple dependent variables or relationships between variables.
Limitations in high-dimensional data: Linear regression may not perform well in high-dimensional datasets, where the number of variables is much larger than the number of observations. In such cases, the model may be prone to overfitting or underfitting.

Overall, it is important to carefully evaluate the assumptions and limitations of linear regression before applying it to real-world data. If the data does not meet the assumptions or the model is not suitable for the problem, you may need to use a different method or approach.

Principal Component Analysis (PCA)

Ranjit maity — Sat, 17 Dec 2022 15:29:00 GMT

PCA

Principal Component Analysis (PCA) is a statistical technique used to reduce the dimensionality of a dataset by identifying the underlying structure in the data and projecting it onto a lower-dimensional space. It is often used to visualize high-dimensional data or to extract the most important features from a dataset for further analysis.

PCA works by finding a linear combination of the original variables that explains the maximum variance in the data. The resulting combination is called the first principal component, and the process can be repeated to find additional principal components. The number of principal components is typically chosen to be less than or equal to the number of original variables, and the principal components are ranked by the amount of variance they explain.

To find the principal components, PCA performs the following steps:

Standardize the variables: The variables are transformed to have zero mean and unit variance. This step is optional but recommended, as it helps to give all the variables equal weight in the analysis.
Calculate the covariance matrix: The covariance matrix is a square matrix that shows the variance between each pair of variables. It is calculated by taking the dot product of the standardized data with its transpose.
Calculate the eigenvectors and eigenvalues: The eigenvectors of the covariance matrix correspond to the principal components, and the eigenvalues correspond to the amount of variance explained by each principal component. The eigenvectors and eigenvalues are found by solving the eigenvalue equation for the covariance matrix.
Select the number of components: The number of principal components is typically chosen to be less than or equal to the number of original variables. The number of components can be chosen based on the amount of variance explained by each component, or by using techniques such as cross-validation or the elbow method.
Transform the data: To transform the data onto the principal component space, we can multiply the standardized data by the matrix of eigenvectors. The resulting matrix will have the same number of rows as the original data, but the number of columns will be equal to the number of principal components. Each column of the transformed matrix will be a principal component, and the values in each row will be the coefficients of the original variables for that row.

Advantages of PCA:

It reduces the dimensionality of the data: One of the main advantages of PCA is that it reduces the dimensionality of the data, making it easier to visualize and analyze. This is especially useful for datasets with a large number of variables, as it can be difficult to understand the relationships between all the variables. By reducing the number of variables, PCA makes it easier to understand the underlying structure in the data.
It can identify the underlying structure in the data: PCA is able to identify the underlying structure in the data by finding the principal components that explain the most variance. This can be useful for feature selection, as the principal components are often the most important features in the data.
It is relatively simple and easy to implement: PCA is a relatively simple and straightforward technique, and it is easy to implement using tools such as scikit-learn in Python.
It is widely used: PCA is a widely used technique in many fields, including machine learning, data analysis, and image processing. This means that there is a wealth of resources and support available for using PCA, making it a popular choice for many applications.

Disadvantages of PCA:

It can only capture linear relationships in the data: PCA is based on finding linear combinations of the original variables, so it can only capture linear relationships in the data. If the relationships between the variables are non-linear, PCA may not be able to identify them.
It can be sensitive to the scaling of the variables: PCA is sensitive to the scaling of the variables, meaning that variables with a larger scale can have more influence on the resulting principal components. This can be mitigated by standardizing the variables before performing PCA.
It can be affected by correlated error terms: PCA can be affected by correlated error terms, also known as multicollinearity. This can occur when two or more variables are highly correlated and can cause the resulting principal components to be unstable.
It can lose important information: While PCA reduces the dimensionality of the data, it can also lose important information in the process. This is because PCA only keeps the principal components that explain the most variance in the data, and discards the rest. This can be a problem if the discarded components contain important information.
It can be difficult to interpret the results: The results of PCA can be difficult to interpret, as the principal components are linear combinations of the original variables and do not have a clear meaning. This can make it difficult to understand the relationships between the variables and the principal components.

BIG DATA What, why and how

Ranjit maity — Mon, 31 Oct 2022 16:10:17 GMT

BIG DATA What, why and how(3H)

What is Big Data?

The Oxford Dictionary defines Big Data as3 “Extremely large data sets that may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions.”

In its June 2011 report titled Big Data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute defined Big Data as:4

“Big Data” refers to datasets whose size is beyond the ability of typical data- base software tools to capture, store, manage, and analyze. This definition is intentionally subjective and incorporates a moving definition of how big a dataset needs to be in order to be considered Big Data — i.e., we don’t define Big Data in terms of being larger than a certain number of terabytes (thou- sands of gigabytes). We assume that, as technology advances over time, the size of datasets that qualify as Big Data will also increase. Also note that the definition can vary by sector, depending on what kinds of software tools are commonly available and what sizes of datasets are common in a particular industry. With those caveats, Big Data in many sectors today will range from a few dozen terabytes to multiple petabytes (thousands of terabytes).

Essentially, Big Data is a term which is used to mean a massive volume of both structured and unstructured data that is so large that it is difficult to process using traditional database and software techniques. In most enterprise scenarios the volume of data is too big, or it moves too fast, or it exceeds current processing capacity.5 Interestingly, it is difficult to agree upon a standard definition because different people will use the term Big Data in different contexts and that will deter- mine what they mean when they talk about “Big Data.”

When someone is talking about data-storage capacity, Big Data means the size or volume of the data. When the computing capability is under discussion, Big Data perhaps means the processing capability when discussing computing. The vendors specializing in this technology will refer to the technology and tools which will be used to analyze this data when they talk about Big Data. Organizations will talk more about data generation and accumulation when they mention Big Data. However, there are certain characteristics of Big Data which will be there most of the time and they are: size, unstructured nature and the need for processing to make sense of it.

The history

The term Big Data has been in use since the 1990s, with some giving credit to John Mashey for coining it, or at least making it popular. Big Data encompasses unstructured, semi-structured and structured data, however, the main focus is on unstructured data. Big Data “size” is a constantly moving target as storage capacity increases and processing gets faster.6

In a 2001 research report7 and related lectures, META Group (now Gartner) defined data-growth challenges and opportunities as being three-dimensional i.e. increasing volume (amount of data), velocity (speed of data in and out) and variety (range of data types and sources). Most of the industry continue to use this volume, velocity and variety (or 3Vs model) for describing Big Data. In 2012, Gartner updated its definition as follows: “Big Data is high-volume, high-velocity and/or high-variety information assets that demand cost-effective, innovative forms of information processing that enable enhanced insight, decision making, and process automation.”8

For the last few years, we have also started to see and use the concept of 4Vs (including the fourth one called Veracity implying that Big Data could have ambi- guities and uncertainties because of the nature of the data) or 5Vs (Veracity and Value, implying that Big Data will need to be associated with meaningful benefits to have value) of Big Data. However, Gartner’s definition of the 3Vs is still widely used and is in agreement with a consensual definition that states that9 “Big Data represents the Information assets characterized by such a High Volume, Velocity and Variety to require specific Technology and Analytical Methods for its trans- formation into Value.” The 3Vs have been expanded to other complementary characteristics of Big Data:10

Volume: Big Data doesn’t sample; it just observes and tracks what happens.

Velocity: Big Data is often available in real-time.

Variety: Big Data draws from text, images, audio, and video; plus it completes missing pieces through data fusion.

Machine learning: Big Data often doesn’t ask why and simply detects patterns.

Digital footprint: Big Data is often a cost-free byproduct of digital interaction.

The data must be processed with advanced tools (analytics and algorithms) to reveal meaningful information so that both visible and invisible issues with various components can be considered and taken into analysis.

The terminology

In the context of Big Data, the first term is data storage. The different units of data measurement describe disk space, or data-storage space and system memory and this is evolving very fast. From a bulky and difficult-to-handle 1.2 MB to 1.44 MB floppy disk (8-inch, 5¼-inch and 3½-inch floppy disks) up until the early years of the 21st century, to pen drives today, which can easily carry 64 GB.

According to the IBM Dictionary of Computing, when used to describe disk storage capacity, a megabyte is 1,000,000 bytes in decimal notation. But when the term megabyte is used for real and virtual storage and channel volume, 2 tothe 20th power or 1,048,576 bytes is the appropriate notation. According to the Microsoft Press Computer Dictionary, a megabyte means either 1,000,000 bytes or 1,048,576 bytes. According to Eric S. Raymond in The New Hacker’s Dictionary, a megabyte is 1,048,576 bytes on the argument that bytes should be computed in powers of 2.

The 1,000 can be replaced with 1,024 and still be correct using the other accept- able standards. For the processor or virtual storage:

1 bit = binary digit, 8 bits = 1 byte, 1,024 bytes = 1 kilobyte, 1,024 kilobytes

= 1 megabyte, etc. For the disk storage:

1 bit = binary digit, 8 bits = 1 byte, 1,000 bytes = 1 kilobyte, 1,000 kilobytes

= 1 megabyte, etc.

Bit: A Bit is the smallest unit of data that a computer uses. It can be used to represent two states of information, such as Yes or No.

Byte: A Byte is equal to 8 Bits. A Byte can represent 256 states of information, for example, numbers or a combination of numbers and letters. 1 Byte could be equal to one character. 10 Bytes could be equal to a word. 100 Bytes would equal an average sentence.

Kilobyte: A Kilobyte is approximately 1,000 Bytes, actually 1,024 Bytes, depending on which definition is used. 1 Kilobyte would be equal to this para- graph you are reading, whereas 100 Kilobytes would equal an entire page.

Megabyte: A megabyte is approximately 1,000 Kilobytes. In the early days of computing, a megabyte was considered to be a large amount of data. These days

source: https://whatsabyte.com/

with a 500-Gigabyte hard drive on a computer being common, a megabyte doesn’t seem like much anymore. One of those old 3–1/2 inch floppy disks can hold 1.44 Megabytes or the equivalent of a small book. 100 Megabytes might hold a couple volumes of Encyclopedias. 600 Megabytes is about the amount of data that will fit on a CD-ROM disk.

Gigabyte: A Gigabyte is approximately 1,000 Megabytes. A Gigabyte is still a very common term used these days when referring to disk space or drive storage. 1 Gigabyte of data is almost twice the amount of data that a CD-ROM can hold. But it’s about one thousand times the capacity of a 3–1/2 floppy disk. 1 Gigabyte could hold the contents of about ten yards of books on a shelf. 100 gigabytes could hold the entire library floor of academic journals.

Terabyte: A Terabyte is approximately one trillion bytes or 1,000 gigabytes. This was unimaginable even a few years back but now 1 and 2-terabyte drives are the normal specs for many new computers. A Terabyte could hold about 3.6 million 300 Kilobyte images or about 300 hours of good quality video. A Terabyte could hold 1,000 copies of the Encyclopedia Britannica. 10 Terabytes could hold the printed collection of the Library of Congress.

Petabyte: A Petabyte is approximately 1,000 Terabytes or one million giga- bytes. It’s hard to visualize what a Petabyte could hold. 1 Petabyte could hold approximately 20 million 4-door filing cabinets full of text. It could hold 500 billion pages of standard printed text. It would take about 500 million floppy disks to store the same amount of data.

Exabyte: An Exabyte is approximately 1,000 Petabytes. Another way to look at it is that an Exabyte is about one quintillion bytes or one billion giga- bytes. There is not much to compare an Exabyte to. It has been said that 5 Exabytes would be equal to all of the words ever spoken by mankind.

Zettabyte: A Zettabyte is approximately 1,000 Exabytes. There is nothing to compare a Zettabyte to but to say that it would take a whole lot of ones and zeroes to fill it up.

Yottabyte: A Yottabyte is approximately 1,000 Zettabytes. It would take approximately 11 trillion years to download a Yottabyte file from the Internet using high-powered broadband. You can compare it to the World Wide Web as the entire Internet almost takes up about a Yottabyte.

Brontobyte: A Brontobyte is approximately 1,000 Yottabytes. One thing we can say about a Brontobyte is that it is a 1 followed by 27 zeroes.

Geopbyte: A Geopbyte is about 1,000 Brontobytes. One way of looking at a geophyte is 15267,6504600,2283229,4012496,7031205,376 bytes!

Big Data: getting bigger and bigger, faster and faster

For Big Data applications to be useful, the data needs to be stored, processed and delivered in a comprehensible format so that it can be used effectively. Data volumes are proliferating — especially unstructured data — at a rate, typically, of around 50% annually. Josh James, Founder, CEO and Chairman of the Board at Domo, an American computer software company specializing in business intelligence tools and data visualization, recently released their fifth annual Data Never Sleeps report, which highlights the fact that 90% of all data today was created in the last two years — that is 2.5 quintillion bytes of data per day.13

There are many reasons behind this massive explosion of data. As the capabilities of hardware increase and prices decline, the digitization of information becomes cost-effective and sensible from a cost vs. benefit standpoint. June 2017 report14 from the Pew Research Center15 states that about three-quarters of U.S. adults (77%) say they own a smartphone, up from 35% in 2011, making the smartphone one of the most quickly adopted consumer technologies in history. This is natural because as people become more well-off and litre- ate, the appetite for information and the tendency to share increases. Software programming and algorithms are getting better, simpler and easier and hence more complex tasks can be achieved; things which were even unthinkable till a few years ago.

There are several techniques which are available and can be used for Big Data analysis and they are continuously evolving. There are also ready-made software packages available for Big Data and several of them are being used across industries. These technologies can be used to aggregate, manipulate, manage and analyze Big Data. Another emerging and the very interesting area is presenting information in such a way that people can use it effectively. This is a key challenge if Big Data is to lead to concrete action. There are human limitations on how much data can be visualized and understood; this increases the relevance of visualization, i.e., techniques and technologies to understand and improve the results of Big Data analyses.

7 Tactics to Combat Imbalanced Classes in Machine Learning Datase

Ranjit maity — Fri, 06 May 2022 11:42:01 GMT

You are working on your dataset. You create a classification model and get 95% accuracy immediately. “Fantastic” 👌 you think. You dive a little deeper and discover that 95% of the data belongs to one class. Damn! what to do??? 🤨🤔

Dont’t worry 🤫

will discover the tactics that you can use to deliver great results on machine learning datasets with imbalanced data.

Frustration!

Imbalanced data can cause you a lot of frustration.

You feel very frustrated when you discovered that your data has imbalanced classes and that all of the great results you thought you were getting turn out to be a lie.

The next wave of frustration hits when the books, articles and blog posts don’t seem to give you good advice about handling the imbalance in your data.

Relax, there are many options and we’re going to go through them all. It is possible, you can build predictive models for imbalanced data.

What is Imbalanced Data?

Imbalanced data typically refers to a problem with classification problems where the classes are not represented equally.

For example, you may have a 2-class (binary) classification problem with 100 instances (rows). A total of 80 instances are labeled with Class-1 and the remaining 20 instances are labeled with Class-2.

This is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 or more concisely 4:1.

You can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Most techniques can be used on either.

The remaining discussions will assume a two-class classification problem because it is easier to think about and describe.

Imbalance is Common

Most classification data sets do not have exactly equal number of instances in each class, but a small difference often does not matter.

There are problems where a class imbalance is not just common, it is expected. For example, in datasets like those that characterize fraudulent transactions are imbalanced. The vast majority of the transactions will be in the “Not-Fraud” class and a very small minority will be in the “Fraud” class.

Another example is customer churn datasets, where the vast majority of customers stay with the service (the “No-Churn” class) and a small minority cancel their subscription (the “Churn” class).

When there is a modest class imbalance like 4:1 in the example above it can cause problems.

Accuracy Paradox

The accuracy paradox is the name for the exact situation in the introduction to this post. 👉🏻

It is the case where your accuracy measures tell the story that you have excellent accuracy (such as 90%), but the accuracy is only reflecting the underlying class distribution.

It is very common, because classification accuracy is often the first measure we use when evaluating models on our classification problems.

Put it All On Red!

What is going on in our models when we train on an imbalanced dataset?

As you might have guessed, the reason we get 90% accuracy on an imbalanced data (with 90% of the instances in Class-1) is because our models look at the data and cleverly decide that the best thing to do is to always predict “Class-1” and achieve high accuracy.

This is best seen when using a simple rule based algorithm. If you print out the rule in the final model you will see that it is very likely predicting one class regardless of the data it is asked to predict.

7 Tactics To Combat Imbalanced Training Data

We now understand what class imbalance is and why it provides misleading classification accuracy.

1) Try Changing Your Performance Metric

Accuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.

There are metrics that have been designed to tell you a more truthful story when working with imbalanced classes.

I give more advice on selecting different performance measures in “Classification Accuracy is Not Enough: More Performance Measures You Can Use“.

In that post I look at an imbalanced dataset that characterizes the recurrence of breast cancer in patients.

From that post, I recommend looking at the following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:

Confusion Matrix: A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).
Precision: A measure of a classifiers exactness.
Recall: A measure of a classifiers completeness
F1 Score (or F-score): A weighted average of precision and recall.

I would also advice you to take a look at the following:

Kappa (or Cohen’s kappa): Classification accuracy normalized by the imbalance of the classes in the data.
ROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.

You can learn a lot more about using ROC Curves to compare classification accuracy in our post “Assessing and Comparing Classifier Performance with ROC Curves“.

Still not sure? Start with kappa, it will give you a better idea of what is going on than classification accuracy.

2) Can You Collect More Data?

You might think it’s silly, but collecting more data is almost always overlooked.

Can you collect more data? Take a second and think about whether you are able to gather more data on your problem.

A larger dataset might expose a different and perhaps more balanced perspective on the classes.

More examples of minor classes may be useful later when we look at resampling your dataset.

3) Try Resampling Your Dataset

You can change the dataset that you use to build your predictive model to have more balanced data.

This change is called sampling your dataset and there are two main methods that you can use to even-up the classes:

You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement), or
You can delete instances from the over-represented class, called under-sampling.

These approaches are often very easy to implement and fast to run. They are an excellent starting point.

In fact, I would advise you to always try both approaches on all of your imbalanced datasets, just to see if it gives you a boost in your preferred accuracy measures.

You can learn a little more in the the Wikipedia article titled “Oversampling and undersampling in data analysis“.

Some Rules of Thumb

Consider testing under-sampling when you have an a lot data (tens- or hundreds of thousands of instances or more)
Consider testing over-sampling when you don’t have a lot of data (tens of thousands of records or less)
Consider testing random and non-random (e.g. stratified) sampling schemes.
Consider testing different resampled ratios (e.g. you don’t have to target a 1:1 ratio in a binary classification problem, try other ratios)

4) Try Generate Synthetic Samples

A simple way to generate synthetic samples is to randomly sample the attributes from instances in the minority class.

You could sample them empirically within your dataset or you could use a method like Naive Bayes that can sample each attribute independently when run in reverse. You will have more and different data, but the non-linear relationships between the attributes may not be preserved.

There are systematic algorithms that you can use to generate synthetic samples. The most popular of such algorithms is called SMOTE or the Synthetic Minority Over-sampling Technique.

As its name suggests, SMOTE is an oversampling method. It works by creating synthetic samples from the minor class instead of creating copies. The algorithm selects two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.

Learn more about SMOTE, see the original 2002 paper titled “SMOTE: Synthetic Minority Over-sampling Technique“.

There are a number of implementations of the SMOTE algorithm, for example:

In Python, take a look at the “UnbalancedDataset” module. It provides a number of implementations of SMOTE as well as various other resampling techniques that you could try.
In R, the DMwR package provides an implementation of SMOTE.
In Weka, you can use the SMOTE supervised filter.

5) Try Penalized Models

You can use the same algorithms but give them a different perspective on the problem.

Penalized classification imposes an additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model to pay more attention to the minority class.

Often the handling of class penalties or weights are specialized to the learning algorithm. There are penalized versions of algorithms such as penalized-SVM and penalized-LDA.

It is also possible to have generic frameworks for penalized models. For example, Weka has a CostSensitiveClassifier that can wrap any classifier and apply a custom penalty matrix for miss classification.

Using penalization is desirable if you are locked into a specific algorithm and are unable to resample or you’re getting poor results. It provides yet another way to “balance” the classes. Setting up the penalty matrix can be complex. You will very likely have to try a variety of penalty schemes and see what works best for your problem.

6) Try a Different Perspective

There are fields of study dedicated to imbalanced datasets. They have their own algorithms, measures and terminology.

Taking a look and thinking about your problem from these perspectives can sometimes shame loose some ideas.

Two you might like to consider are anomaly detection and change detection.

Anomaly detection is the detection of rare events. This might be a machine malfunction indicated through its vibrations or a malicious activity by a program indicated by it’s sequence of system calls. The events are rare and when compared to normal operation.

This shift in thinking considers the minor class as the outliers class which might help you think of new ways to separate and classify samples.

Change detection is similar to anomaly detection except rather than looking for an anomaly it is looking for a change or difference. This might be a change in behavior of a user as observed by usage patterns or bank transactions.

Both of these shifts take a more real-time stance to the classification problem that might give you some new ways of thinking about your problem and maybe some more techniques to try.

5) Try Different Algorithms

As always, I strongly advice you to not use your favorite algorithm on every problem. You should at least be spot-checking a variety of different types of algorithms on a given problem.

For more on spot-checking algorithms, “Why you should be Spot-Checking Algorithms on your Machine Learning Problems”.

That being said, decision trees often perform well on imbalanced datasets. The splitting rules that look at the class variable used in the creation of the trees, can force both classes to be addressed.

If in doubt, try a few popular decision tree algorithms like C4.5, C5.0, CART, and Random Forest.

For some example R code using decision trees, “Non-Linear Classification in R with Decision Trees“.

For an example of using CART in Python and scikit-learn, “Get Your Hands Dirty With Scikit-Learn Now“.

Thank you

How to Calculate Correlation Between Variables in Python3

Ranjit maity — Wed, 30 Mar 2022 10:03:05 GMT

data correlation

There may be complex and unknown relationships between the variables in your dataset.

It is important to discover and quantify the degree to which variables in your dataset are dependent upon each other. This knowledge can help you better prepare your data to meet the expectations of machine learning algorithms, such as linear regression, whose performance will degrade with the presence of these interdependencies.

will discover that correlation is the statistical summary of the relationship between variables and how to calculate it for different types of variables and relationships.

After completing this article, will get to know:

How to calculate a covariance matrix to summarize the linear relationship between two or more variables.
How to calculate the Pearson’s correlation coefficient to summarize the linear relationship between two variables.
How to calculate the Spearman’s correlation coefficient to summarize the monotonic relationship between two variables.

This article is divided into 5 parts; they are:

What is Correlation?
Test Dataset
Covariance
Pearson’s Correlation
Spearman’s Correlation

code →sourcecode

What is Correlation?

Variables within a dataset can be related for lots of reasons.

For example:

One variable could cause or depend on the values of another variable.
One variable could be lightly associated with another variable.
Two variables could depend on a third unknown variable.

It can be useful in data analysis and modelling to better understand the relationships between variables. The statistical relationship between two variables is referred to as their correlation.

A correlation could be positive, meaning both variables move in the same direction, or negative, meaning that when one variable’s value increases, the other variables’ values decrease. Correlation can also be neutral or zero, meaning that the variables are unrelated.

Positive Correlation: both variables change in the same direction.
Neutral Correlation: No relationship in the change of the variables.
Negative Correlation: variables change in opposite directions.

The performance of some algorithms can deteriorate if two or more variables are tightly related, called multicollinearity. An example is linear regression, where one of the offending correlated variables should be removed in order to improve the skill of the model.

We may also be interested in the correlation between input variables with the output variable in order provide insight into which variables may or may not be relevant as input for developing a model.

The structure of the relationship may be known, e.g. it may be linear, or we may have no idea whether a relationship exists between two variables or what structure it may take. Depending what is known about the relationship and the distribution of the variables, different correlation scores can be calculated.

In this tutorial, we will look at one score for variables that have a Gaussian distribution and a linear relationship and another that does not assume a distribution and will report on any monotonic (increasing or decreasing) relationship.

Test Dataset

Before we look at correlation methods, let’s define a dataset we can use to test the methods.

We will generate 1,000 samples of two variables with a strong positive correlation. The first variable will be random numbers drawn from a Gaussian distribution with a mean of 100 and a standard deviation of 20. The second variable will be valued from the first variable with Gaussian noise added with a mean of 50 and a standard deviation of 10.

We will use the randn() function to generate random Gaussian values with a mean of 0 and a standard deviation of 1, then multiply the results by our own standard deviation and add the mean to shift the values into the preferred range.

code →sourcecode

Before we look at calculating some correlation scores, we must first look at an important statistical building block, called covariance.

Covariance

Variables can be related by a linear relationship. This is a relationship that is consistently additive across the two data samples.

This relationship can be summarized between two variables, called the covariance. It is calculated as the average of the product between the values from each sample, where the values haven been centered (had their mean subtracted).

he use of the mean in the calculation suggests the need for each data sample to have a Gaussian or Gaussian-like distribution.

The sign of the covariance can be interpreted as whether the two variables change in the same direction (positive) or change in different directions (negative). The magnitude of the covariance is not easily interpreted. A covariance value of zero indicates that both variables are completely independent.

The cov() NumPy function can be used to calculate a covariance matrix between two or more variables.

The diagonal of the matrix contains the covariance between each variable and itself. The other values in the matrix represent the covariance between the two variables; in this case, the remaining two values are the same given that we are calculating the covariance for only two variables.

We can calculate the covariance matrix for the two variables in our test problem.

code →sourcecode

The covariance and covariance matrix are used widely within statistics and multivariate analysis to characterize the relationships between two or more variables.

A problem with covariance as a statistical tool alone is that it is challenging to interpret. This leads us to the Pearson’s correlation coefficient next.

Pearson’s Correlation

The Pearson correlation coefficient (named for Karl Pearson) can be used to summarize the strength of the linear relationship between two data samples.

The Pearson’s correlation coefficient is calculated as the covariance of the two variables divided by the product of the standard deviation of each data sample. It is the normalization of the covariance between the two variables to give an interpretable score.

The use of mean and standard deviation in the calculation suggests the need for the two data samples to have a Gaussian or Gaussian-like distribution.

The result of the calculation, the correlation coefficient can be interpreted to understand the relationship.

The coefficient returns a value between -1 and 1 that represents the limits of correlation from a full negative correlation to a full positive correlation. A value of 0 means no correlation. The value must be interpreted, where often a value below -0.5 or above 0.5 indicates a notable correlation, and values below those values suggests a less notable correlation.

The pearsonr() SciPy function can be used to calculate the Pearson’s correlation coefficient between two data samples with the same length.

We can calculate the correlation between the two variables in our test problem.

The complete example in the code →sourcecode

The Pearson’s correlation coefficient can be used to evaluate the relationship between more than two variables.

This can be done by calculating a matrix of the relationships between each pair of variables in the dataset. The result is a symmetric matrix called a correlation matrix with a value of 1.0 along the diagonal as each column always perfectly correlates with itself.

Spearman’s Correlation

Two variables may be related by a nonlinear relationship, such that the relationship is stronger or weaker across the distribution of the variables.

Further, the two variables being considered may have a non-Gaussian distribution.

In this case, the Spearman’s correlation coefficient (named for Charles Spearman) can be used to summarize the strength between the two data samples. This test of relationship can also be used if there is a linear relationship between the variables, but will have slightly less power (e.g. may result in lower coefficient scores).

As with the Pearson correlation coefficient, the scores are between -1 and 1 for perfectly negatively correlated variables and perfectly positively correlated respectively.

Instead of calculating the coefficient using covariance and standard deviations on the samples themselves, these statistics are calculated from the relative rank of values on each sample. This is a common approach used in non-parametric statistics, e.g. statistical methods where we do not assume a distribution of the data such as Gaussian.

A linear relationship between the variables is not assumed, although a monotonic relationship is assumed. This is a mathematical name for an increasing or decreasing relationship between the two variables.

If you are unsure of the distribution and possible relationships between two variables, Spearman correlation coefficient is a good tool to use.

The spearmanr() SciPy function can be used to calculate the Spearman’s correlation coefficient between two data samples with the same length.

We can calculate the correlation between the two variables in our test problem.

We know that the data is Gaussian and that the relationship between the variables is linear.

As with the Pearson’s correlation coefficient, the coefficient can be calculated pair-wise for each variable in a dataset to give a correlation matrix for review.

How to Calculate the Bias-Variance Trade-off with Python

Ranjit maity — Thu, 30 Sep 2021 11:09:53 GMT

Bais variance

The performance of a machine learning model can be characterized in terms of the bias and the variance of the model.

A model with high bias makes strong assumptions about the form of the unknown underlying function that maps inputs to outputs in the dataset, such as linear regression. A model with high variance is highly dependent upon the specifics of the training dataset, such as unpruned decision trees. We desire models with low bias and low variance, although there is often a trade-off between these two concerns.

The bias-variance trade-off is a useful conceptualization for selecting and configuring models, although generally cannot be computed directly as it requires full knowledge of the problem domain, which we do not have. Nevertheless, in some cases, we can estimate the error of a model and divide the error down into bias and variance components, which may provide insight into a given model’s behavior.

In this tutorial, you will discover how to calculate the bias and variance for a machine learning model.

After completing this tutorial, you will know:

Model error consists of model variance, model bias, and irreducible error.
We seek models with low bias and variance, although typically reducing one results in a rise in the other.
How to decompose mean squared error into model bias and variance terms.

What do We get To Know From here? 🤟

Bias, Variance, and Irreducible Error
Bias-Variance Trade-off
Calculate the Bias and Variance

1 👉Bias, Variance, and Irreducible Error

Consider a machine learning model that makes predictions for a predictive modeling task, such as regression or classification.

The performance of the model on the task can be described in terms of the prediction error on all examples not used to train the model. We will refer to this as the model error.

Error(Model)

The model error can be decomposed into three sources of error: the variance of the model, the bias of the model, and the variance of the irreducible error in the data.

Error(Model) = Variance(Model) + Bias(Model) + Variance(Irreducible Error)

Let’s take a closer look at each of these three terms.

Model Bias

The bias is a measure of how close the model can capture the mapping function between inputs and outputs.

It captures the rigidity of the model: the strength of the assumption the model has about the functional form of the mapping between inputs and outputs.

A model with high bias is helpful when the bias matches the true but unknown underlying mapping function for the predictive modeling problem. Yet, a model with a large bias will be completely useless when the functional form for the problem is mismatched with the assumptions of the model, e.g. assuming a linear relationship for data with a high non-linear relationship.

Low Bias: Weak assumptions regarding the functional form of the mapping of inputs to outputs.
High Bias: Strong assumptions regarding the functional form of the mapping of inputs to outputs.

The bias is always positive.👍

Model Variance

The variance of the model is the amount the performance of the model changes when it is fit on different training data.

It captures the impact of the specifics the data has on the model.

A model with high variance will change a lot with small changes to the training dataset. Conversely, a model with low variance will change little with small or even large changes to the training dataset.

Low Variance: Small changes to the model with changes to the training dataset.
High Variance: Large changes to the model with changes to the training dataset.

The variance is always positive. 👍

On the whole, the error of a model consists of reducible error and irreducible error.

Model Error = Reducible Error + Irreducible Error

The reducible error is the element that we can improve. It is the quantity that we reduce when the model is learning on a training dataset and we try to get this number as close to zero as possible.

The irreducible error is the error that we can not remove with our model, or with any model.

The error is caused by elements outside our control, such as statistical noise in the observations.

As such, although we may be able to squash the reducible error to a very small value close to zero, or even zero in some cases, we will also have some irreducible error. It defines a lower bound in performance on a problem.

It is a one kind of reminder that no model is perfect.

Bias-Variance Trade-off

The bias and the variance of a model’s performance are connected.

Ideally, we would prefer a model with low bias and low variance, although in practice, this is very challenging. In fact, this could be described as the goal of applied machine learning for a given predictive modeling problem,

Reducing the bias can easily be achieved by increasing the variance. Conversely, reducing the variance can easily be achieved by increasing the bias.

This relationship is generally referred to as the bias-variance trade-off. It is a conceptual framework for thinking about how to choose models and model configuration.

We can choose a model based on its bias or variance. Simple models, such as linear regression and logistic regression, generally have a high bias and a low variance. Complex models, such as random forest, generally have a low bias but a high variance.

We may also choose model configurations based on their effect on the bias and variance of the model. The k hyperparameter in k-nearest neighbors controls the bias-variance trade-off. Small values, such as k=1, result in a low bias and a high variance, whereas large k values, such as k=21, result in a high bias and a low variance.

High bias is not always bad, nor is high variance, but they can lead to poor results.

We often must test a suite of different models and model configurations in order to discover what works best for a given dataset. A model with a large bias may be too rigid and underfit the problem. Conversely, a large variance may overfit the problem.

We may decide to increase the bias or the variance as long as it decreases the overall estimate of model error.

Calculate the Bias and Variance

👀👀👀

Code for the same

Running the example reports the estimated error as well as the estimated bias and variance for the model error.

Note: Your results may vary given the stochastic nature of the algorithm or evaluation procedure, or differences in numerical precision. Consider running the example a few times and compare the average outcome.

In this case, we can see that the model has a high bias and a low variance. This is to be expected given that we are using a linear regression model. We can also see that the sum of the estimated mean and variance equals the estimated error of the model, e.g. 20.726 + 1.761 = 22.487.

Stories by Ranjit maity on Medium

Serverless vs JobCluster in Databricks: Choosing the Right Compute Strategy

ADB🚀 Serverless vs 🧱 JobCluster in Databricks: Choosing the Right Compute Strategy

Pandas vs PySpark DataFrame: The Ultimate Guide for Data Enthusiasts

Difference Between INT and BIGINT

1. Definition of INT and BIGINT

INT (Integer)

BIGINT (Big Integer)

2. Storage Size and Range

3. Performance Considerations

Memory Usage

Processing Speed

Index Performance

4. Use Cases

When to Use INT?

When to Use BIGINT?

5. Examples in SQL

Creating a Table with INT and BIGINT

Checking Storage Size

6. Conclusion

Logistic regression

Linear Regression

Principal Component Analysis (PCA)

BIG DATA What, why and how

BIG DATA What, why and how(3H)

What is Big Data?

The history

The terminology

Big Data: getting bigger and bigger, faster and faster

7 Tactics to Combat Imbalanced Classes in Machine Learning Datase

Frustration!

What is Imbalanced Data?

Imbalance is Common

Accuracy Paradox

Put it All On Red!

7 Tactics To Combat Imbalanced Training Data

1) Try Changing Your Performance Metric

2) Can You Collect More Data?

3) Try Resampling Your Dataset

Some Rules of Thumb

4) Try Generate Synthetic Samples

5) Try Penalized Models

6) Try a Different Perspective

5) Try Different Algorithms

How to Calculate Correlation Between Variables in Python3

What is Correlation?

Test Dataset

code →sourcecode

Covariance

code →sourcecode

Pearson’s Correlation

Spearman’s Correlation

How to Calculate the Bias-Variance Trade-off with Python

What do We get To Know From here? 🤟

Model Bias

Model Variance

Bias-Variance Trade-off

Calculate the Bias and Variance

Thank You