Stories by Atul Kumar on Medium

K-Nearest Neighbors (KNN)

Atul Kumar — Tue, 30 Dec 2025 19:11:31 GMT

Definition :

KNN (K-Nearest Neighbors)is a supervised machine learning algorithm used for:

Classification (most common)
Regression

The idea behind KNN is simple:

Tell me who your neighbors are, and I will tell you who you are.

KNN does not create a mathematical model.
Instead, it stores the entire training data and makes decisions only when a prediction is needed.

What does “K” mean in KNN?

K = number of nearest neighbors to consider

Example:

K = 3 → look at 3 nearest points
K = 5 → look at 5 nearest points

Usually, odd values of K are chosen to avoid tie.

How KNN Works :

When a new data point comes, KNN follow this steps:

Calculate the distance between the new point and all training points
Select the K nearest neighbors
Take a majority vote (classification)
Assign the most common class

Example:

Real-Life Intuition

Imagine you move to a new city and want to know if a place is safe or unsafe.

You ask your 5 nearest neighbors:

If most say “safe” → you believe it is safe
If most say “unsafe” → you believe it is unsafe

This is exactly how KNN works.

Important Note About Distance:

KNN depends completely on distance.

The most common distance used is Euclidean distance:

In simple words:

Smaller distance → more similarity
Larger distance → less similarity

Let’s understand this using a problem:

Predict whether a student will Pass or Fail based on hours studied.

# A new student studies 5 hours.
Will the student pass or fail?

Solve this problem using following steps:

Step 1: Import Required Libraries

import numpy as np
from sklearn.neighbors import KNeighborsClassifier

Explanation:

numpy → handles numerical data
Kneighboursclassifier → KNN classification model

Step 2: Prepare the Data

X = np.array([[1], [2], [3], [6], [7], [8]])
y = np.array(['Fail', 'Fail', 'Fail', 'Pass', 'Pass', 'Pass'])

Explanation:

x contains input features (hours studied)
y contains output labels
Double brackets create a 2D array

Step 3: Choose the Value of K

knn = KNeighborsClassifier(n_neighbors=3)

Explanation:

We choose K = 3
Model will consider 3 nearest neighbors

Step 4: Train the Model

knn.fit(X, y)

Explanation:

KNN simply stores the dataset
No real “learning” happens here

Step 5: Make a Prediction

new_student = np.array([[5]])
prediction = knn.predict(new_student)

print("Prediction:", prediction[0])

Explanation:

Input = 5 hours studied
Model finds nearest neighbors
Majority vote decides the result

How KNN Makes This Decision

Nearest values to 5 hours:

3 → Fail
6 → Pass
7 → Pass

Votes:

Pass → 2
Fail → 1

Final Prediction: PASS

Some important points about KNN:

Pros:

Easy to understand
No training time
Works well with small datasets
No assumptions about data

Cons:

Slow for large datasets
High memory usage
Sensitive to noisy data
Needs feature scaling

conclusion:

KNN is not the fastest or smartest algorithm, but it is one of the best teachers in Machine Learning.

If you truly understand KNN:

You understand distance
You understand classification
You understand prediction logic

And that makes learning other algorithms much easier.

Decision Tree

Atul Kumar — Mon, 29 Dec 2025 18:56:53 GMT

Introduction:

A Decision Tree is a supervised machine learning algorithm used for classification and prediction.

It works by:

Asking simple questions
Splitting data based on answers
Reaching a final decision

Because of its easy structure and logic, decision trees are widely used in real-world applcation

Dfinition:

A Decision Tree is a machine learning algorithm that makes decisions step by step, just like how humans think.

It works by asking simple questions and splitting data based on the answers.

Real-life example:

Should I play cricket today?

Is it raining?
Yes → Don’t play
No →
Do I have free time?
Yes → Play
No → Don’t play

This question–answer structure is exactly how a decision tree works

Why Is It Called a “Tree”?

Because it looks like a tree

Root node → first question
Decision nodes → middle questions
Leaf nodes → final answer (Yes / No)

Visuals graph:

It is Decision Trees Used in:

Exam pass / fail prediction
Loan approval
Spam detection
Medical diagnosis
Customer behavior analysis

They are popular because they are easy to understand.

Simple Problem We Will Solve

Predict whether a student will PASS or FAIL based on hours studied

1 → Pass
0 → Fail

Now we solve a problem and explain the code:

Step 1: Import Required Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier, plot_tree

Explanation :

numpy → helps work with numbers and arrays
matplotlib → used to draw the tree
decisiontreeclassifier → decision tree model
plot_tree → shows the tree visually

Step 2: Create Dataset

# Input feature: Hours studied
X = np.array([[1], [2], [3], [4], [5], [6]])
# Output label: 0 = Fail, 1 = Pass
y = np.array([0, 0, 0, 1, 1, 1])

Explanation:

X → hours studied
y → result (fail or pass)
Each row in X matches one value in y

Step 3: Create Decision Tree Model

model = DecisionTreeClassifier()

Explanation:

This line creates the decision tree
At this point, the model is empty
It has not learned anything yet

Step 4: Train the Model

model.fit(X, y)

Explanation:

The model looks at x and y
Finds the best questions to split data
Learns rules like: If hours studied ≥ 4 → Pass”

Step 5: Make Predictions

predictions = model.predict(X)
print(predictions)

Model predicts pass/fail for given data
Output will be something like:

[0 0 0 1 1 1]

This matches our real data

Step 6: Predict for a New Student

hours = [[4.5]]
result = model.predict(hours)
print("Prediction:", result)

Explanation:

Predicts result for 4.5 hours of study
Output:
1 → Pass
0 → Fail

Step 7: Visualize the Decision Tree :

plt.figure(figsize=(10,6))
plot_tree(
    model,
    feature_names=["Hours Studied"],
    class_names=["Fail", "Pass"],
    filled=True
)
plt.show()

Explanation (Line by Line):

plt.figure(figsize=(10,6))

Sets the size of the figure

plt.figure(figsize=(10,6))

Draws the decision tree

feature_names=[“Hours Studied”]

Names the input feature

class_names=[“Fail”, “Pass”]

Names output classes

filled=True

Adds colors for easy understanding

plt.show()

Displays the tree

What the Tree Shows

Top box → first decision
Left branch → Fail
Right branch → Pass
Final boxes → predictions

How Decision Tree Makes Decisions

Looks at all possible questions
Chooses the best split
Repeats until data is clear
Gives final decision

No math tension — just logic and comparison.

Some important point about Decision Tree

Very easy to understand
No complex math
Works with numbers & categories
Can overfit data
Not good for very large datasets
Small change in data can change tree

Thank you

Confusion Matrix

Atul Kumar — Mon, 29 Dec 2025 17:40:28 GMT

Definition:

A confusion matrix is a table used to check how good a classification model is.

When a machine learning model makes predictions, it can:

Predict correctly
Predict wrongly

A confusion matrix helps us see these results clearly.

Why Is It Called “Confusion” Matrix?

Because it shows:

Where the model is confused
Where it is correct
Which type of mistake it makes more

It is mainly used in classification problems like:

Spam / Not Spam
Pass / Fail
Disease / No Disease

Let’s understand using diagram:

Example:

Understanding the 4 Terms

1. True Positive (TP)

Model says YES
Actual answer is also YES
Correct prediction

Example:
Patient has disease → Model predicts disease

2. True Negative (TN)

Model says NO
Actual answer is also NO
Correct prediction

Example:
Email is not spam → Model predicts not spam

3. False Positive (FP)

Model says YES
Actual answer is NO
Wrong prediction

Example:
Email is not spam → Model predicts spam

4. False Negative (FN)

Model says NO

Actual answer is YES
Wrong prediction

Example:
Patient has disease → Model predicts no disease

Example:

lets understand with some problem

Dataset :

Let’s say we have:

Actual results (y_true)
Model predictions (y_pred)

y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0]

Where:

1 → Positive
0 → Negative

Python Code for Confusion Matrix

Step 1: Import Libraries

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

Explanation:

matplotlib → for plotting
confusion_ matrix → creates confusion matrix
confusionmatrixdosplay → displays it visually

Step 2: Define Actual and Predicted Values

y_true = [1, 0, 1, 1, 0, 1, 0, 0]
y_pred = [1, 0, 0, 1, 0, 1, 1, 0]

Explanation:

y_true → real answers
y_pred → model predictions

Step 3: Create Confusion Matrix

cm = confusion_matrix(y_true, y_pred)
print(cm)

Explanation:

This line compares actual vs predicted
Output will look like:

[[3 1]
 [1 3]]

How to Read This Output

[[TN  FP]
 [FN  TP]]

So here:

TN = 3
FP = 1
FN = 1
TP = 3

Step 4: Visualize Confusion Matrix

disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.title("Confusion Matrix Example")
plt.show()

confusionmatrixdisplay→ prepares matrix for display
.plot → draws the matrix
plt.title → adds title
plt.shoe() → shows the graph

What the Graph Shows

Each box shows number of predictions
Diagonal boxes → correct predictions
Other boxes → mistakes

Thankyou

Logistic Regression

Atul Kumar — Sun, 28 Dec 2025 18:03:04 GMT

Introduction:

Logistic Regression is one of the most important algorithms in Machine Learning, especially for classification problems. Even though its name contains the word “regression,” Logistic Regression is actually used to predict categories, not continuous values.

Definition:

Logistic Regression is a supervised machine learning algorithm used for binary classification problems.

Binary classification means:

Output has only two possible values
Examples:
Pass / Fail
Yes / No
Spam / Not Spam
Disease / No Disease
0 / 1

Logistic Regression predicts the probability of an event happening and then converts it into a class.

# Mathematical Foundation of Logistic Regression

1️. Linear Equation (Same as Linear Regression):

for multiple features:

Where:

x → input features
w → weights
b → bias
z → linear output (can be any real number)

2. The Logistic (Sigmoid) Function:

Logistic Regression uses a special function called the Sigmoid Function.

Sigmoid Function:

The sigmoid function is a mathematical function that converts any real-valued number into a value between 0 and 1.
It has an S-shaped curve and is widely used in machine learning, especially in logistic regression and neural networks.

Graph:

How Logistic Regression Works :

Step 1: Take Input Features

Example:

Hours studied
Attendance
Previous marks

Step 2: Apply Linear Equation

The model calculates a weighted sum of inputs.

Step 3: Apply Sigmoid Function

The result is converted into a probability.

Step 4: Make Final Decision

Based on threshold, output is 0 or 1.

Graphical Interpretation

Unlike linear regression (straight line), logistic regression produces an S-shaped curve.

Why S-Shape?

Probability slowly increases at first
Then increases rapidly
Finally saturates near 1

This behavior is perfect for classification problems.

# Some important points abut logistic regression:

Simple and easy to understand
Fast to train
Outputs probabilities
Works well for linearly separable data
Less computational power required
Cannot handle complex non-linear data
Sensitive to outliers
Requires feature engineering
Not suitable for multi-class problems (without extensions)

Some real-World Applications:

Medical diagnosis (disease: yes/no)
Credit approval systems
Spam email detection
Customer churn prediction
Fraud detection

# Here the problem statement and explaination using logistic regression:

We want to predict:

Will a student pass the exam or not based on hours studied?

Input (x) → Hours studied
Output (y) → pass or fail
1 = Pass
0 = Fail

This is a binary classification problem, so we use logistic regression.

Step 1: Import Required Libraries:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

numpy → used for numerical calculations
matplotlib → used for plotting graph
logisticregression → logistic regression model from sklearn

Step 2: Create Dataset:

# Input data (Hours studied)
X = np.array([1, 2, 3, 4, 5, 6]).reshape(-1, 1)

# Output data (0 = Fail, 1 = Pass)
y = np.array([0, 0, 0, 1, 1, 1])

x contains hours studied
y contains result (fail or pass)
.reshape(-1, 1) is required because sklearn expects 2D input

🔹 Step 3: Create Logistic Regression Model:

model = LogisticRegression()

Explanation:

This line creates the logistic regression model
At this point, the model knows how to learn, but not what to learn

Step 4: Train the Model

model.fit(X, y)

Explanation:

The model learns the relationship between x and y

Internally:

Applies linear equation
Uses sigmoid function
Adjusts weights using gradient descent

Step 5: Make Predictions

y_pred = model.predict(X)

Explanation:

Predicts class values (0 or 1)
Uses probability + threshold (0.5)

Step 6: Get Prediction Probabilities

y_prob = model.predict_proba(X)

Explanation:

Gives probability for both classes
Example: [0.2 , 0.8)
20% → Fail
80% → Pass

Step 7: Visualize Logistic Regression Curve

# Generate smooth values for curve
X_test = np.linspace(0, 7, 100).reshape(-1, 1)
y_test_prob = model.predict_proba(X_test)[:, 1]
# Plot
plt.scatter(X, y, color='blue', label="Actual Data")
plt.plot(X_test, y_test_prob, color='red', label="Logistic Curve")
plt.xlabel("Hours Studied")
plt.ylabel("Probability of Passing")
plt.title("Logistic Regression Example")
plt.legend()
plt.show()

Explanation:

linespace() creates smooth input values
predict_proba(x_test)[: , 1 ] selects probability of class 1
Blue dots → real data
Red curve → logistic regression curve

Logistic Regression Curve

Curve Meaning:

Left side → low probability of passing
Middle → decision boundary
Middle → decision boundary
Right side → high probability of passing

Step 8: Predict for New Student

hours = [[4.5]]
result = model.predict(hours)
probability = model.predict_proba(hours)
print("Prediction:", result)
print("Probability:", probability)

✅ Explanation:

Predicts result for a student who studied 4.5 hours
Output:
1 → Pass
Probability shows confidence level

Final Output Meaning

Example output:

Prediction: [1]
Probability: [[0.18 0.82]]

Model says:(output):

82% chance student will pass
Final decision → PASS

Thank you

Linear regression

Atul Kumar — Sun, 28 Dec 2025 14:09:23 GMT

Introduction:

Linear Regression is the foundation of Machine Learning. It helps beginners understand how data relationships work and how predictions are made. Even though it is simple, Linear Regression is still widely used in real-world applications. Learning it properly makes advanced machine learning concepts much easier to understand

Definition :

Linear Regression is a supervised learning algorithm used to predict continuous values. Continuous values are numbers that can change smoothly, such as price, salary, marks, temperature, or distance.

The main idea behind Linear Regression is very simple:

Find a straight line that best represents the relationship between input and output data

Examples are :

Study hours → Exam marks
Area of house → House price
Experience → Salary

If we know the input, It helps us predict the output

# Equation of linear Regression :

It is based on simple mathematic equation which is straight line equation

y=mx+c

Let’s understand each term clearly:

y → Output value (what we want to predict)
x → Input value (feature)
m → Slope of the line
c → Y-intercept (value of y when x = 0)

The slope (m) tells us how much y changes when x increases by one unit.

How Linear Regression Actually Works

The algorithm follows these basic steps:

Take input data (X) and output data (Y)
Assume a straight line
Predict output values using the line
Calculate the error (difference between actual and predicted values)
Adjust the line to reduce the error
Repeat the process until the error becomes minimum
This method of reducing error is called the Least Squares Method

Visual Understanding of Linear Regression:

In the graph:

The dots represent actual data points
The straight line represents the best-fit line
The goal is to keep the line as close as possible to all points

Types of Linear Regression

1️. Simple Linear Regression

Only one input variable
Example: Study hours → Marks

2️. Multiple Linear Regression

More than one input variable
Example: Area + Rooms + Location → House price

Python Implementation of Linear Regression

Step 1: Import Required Libraries

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression

numpy → Used for numerical calculations and arrays
matplotlib.pyplot → Used for data visualization (graphs)
linear Regression→ A built-in Linear Regression model from Scikit-learn

Step 2: Create Sample Dataset:

X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 6, 8, 10])

x represents the input feature
y represents the output values
reshape(-1, 1) is required because Scikit-learn expects input data in a 2D format

This dataset represents a simple relationship:

When X increases, Y increases proportionally

Step 3: Create the Linear Regression Model:

model = LinearRegression()

This line creates a Linear Regression object
The model will automatically calculate the best values of slope (m) and intercept c

Step 4: Train the Model:

model.fit(X, y)

fit() trains the model using input data (X) and output data (y)
During training, the model learns the relationship between X and y
It finds the best-fit line by minimizing error

Step 5: Make Predictions:

y_pred = model.predict(X)

This line uses the trained model to predict output values
y_pred contains predicted values based on the best-fit line

Step 6: Visualize the Result:

plt.scatter(X, y)
plt.plot(X, y_pred)
plt.xlabel("Input Feature (X)")
plt.ylabel("Output Value (Y)")
plt.title("Simple Linear Regression")
plt.show()

scatter() plots the actual data points
plot() draws the best-fit line
Labels and title make the graph readable
show() displays the graph

Important points about Linear Regression:

Simple and easy to understand
Works well with small datasets
Fast and efficient
Good for trend prediction
Assumes a linear relationship
Sensitive to outliers
Cannot handle complex data patterns
Performance drops with non-linear data

City Traffic Signal Violation Analysis

Atul Kumar — Fri, 26 Dec 2025 18:17:58 GMT

overview :

This project analyzes city traffic signal violations using Python, Pandas, and Matplotlib to understand common traffic rule breaks and high-risk areas and vehicle-related patterns. The dataset includes details like date, city, signal ID, location, violation type, vehicle type, and violation count. After cleaning and inspecting the data, violations were analyzed by type, vehicle, signal, and location. Traffic signal violations play a significant role in causing road accidents in urban areas.
The goal of this analysis is to understand real-world traffic behavior and highlight how data-driven insights can help improve traffic safety and encourage better rule compliance.

Dataset :

A dataset is a collection of related data stored together in a structured form, usually in rows and columns, so that it can be easily analyzed

To start, I created a small dataset in CSV format.

It contains:

Date
City
Signal ID
Location
Violation type
Vehicle type
Violation count

Here the code with complete explanation :

import pandas as pd
import matplotlib.pyplot as plt

This code is used to import the required libraries for the project.

pandas :

It is used to work with the dataset. It helps in reading the data and performing analysis such as counting and summarizing traffic violations, in the form of table. The name pd is just a short form to make the code easier to write.

matplotlib.pyplot:

It is used to create graphs. It helps in drawing bar charts and other plots so that the traffic violation data can be understood visually. The name plt is a commonly used short form.
These two libraries are used together to analyze the data and show the results in a simple and clear way.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — —

df=pd.read_csv("indian_Traffic_violation.csv")

This line is used to read the dataset file.

df=pd.read_csv(“indian_Traffic_violation.csv”),loads data from a CSV file
The file (indian_Traffic_violation.csv )contains traffic signal violation records
The data is stored in a Pandas DataFrame named df
A DataFrame works like a table with rows and columns
After loading the data, it becomes easy to analyze and visualize
— — — — — — — — — — — — — — — — — — — — — — — — — — — — —

df.head()

This function is used to display the first few rows of the dataset.
By default, it shows the first 5 rows.
It helps in quickly checking whether the data has loaded correctly.
It gives an idea about the columns and sample values.

It looks like:

In this table:
→ date : shows the day when the traffic violation was recorded.
→ city : shows the name of the city where the violation occurred.
→ signal_id : represents the unique ID of the traffic signal.
→ location : shows the area where the traffic signal is located.
→ violation_type : tells the type of traffic rule that was broken.
→ vehicle_type : shows the type of vehicle involved in the violation.
→ violation_count : shows the total number of violations recorded.
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

df.info()

This function is used to :

Shows the total number of rows and columns in the dataset
Displays the names of all columns
Shows the data type of each column
Tells how many non-null values are present in each column
Helps in identifying missing values
Gives a clear overview of the dataset structure

It looks like:

→ The dataset is a Pandas DataFrame
→ It contains a total of 20 rows
→ The row index starts from 0 and ends at 19
→ There are 7 columns in the dataset
→ All columns have 20 non-null values, so there are no missing values
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

violation_counts = df['violation_type'].value_counts()
violation_counts

This code counts how many times each type of traffic violation appears -in the dataset
value_count is used to calculate the frequency of each violation type
The result is stored in a variable named violation_count
Writing violation_count displays the counted values
It helps identify the most common traffic violations

It looks like:

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

vehicle_counts = df['vehicle_type'].value_counts()
vehicle_counts

This code counts how many times each type of vehicle appears in the dataset

value_count calculates the frequency of each vehicle type
The result is stored in a variable named vehicle_counts
Writing vehicle_count displays the counted values
It helps understand which vehicles are involved in more traffic violations

The output is:

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

signal_counts = df['signal_id'].value_counts()
signal_counts

This code counts how many times each traffic signal ID appears in the dataset

value_count is used to calculate the frequency of each signal ID
The result is stored in a variable named signal_counts
Writing signal_counts displays the counted values
It helps identify which traffic signals have more violations

The output is:

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

location_counts = df['location'].value_counts()
location_counts

This code counts how many times each location appears in the dataset

value_counts is used to calculate the frequency of each location
The result is stored in a variable named location_counts
Writing Location_counts displays the counted values
It helps identify locations where traffic violations occur more frequently

The output is:

— — — — — — — — — — — — — — — — — — — — — — — — — — — —-

violation_counts.plot(kind='bar')
plt.title("Traffic Violations by Type")
plt.xlabel("Violation Type")
plt.ylabel("Number of Violations")
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

violation_counts.plot(kind=’bar’)
Creates bar chart
plt.title(“Traffic Violations by Type”)
Sets chart title
plt.xlabel(“Violation Type”)
Labels X-axis
plt.ylabel(“Number of Violations”)
Labels Y-axis
plt.xticks(rotation=45)
Rotates labels
plt.tight_layout()
Adjusts spacing
plt.show()
Displays chart

Graph:

The graph shows different types of traffic violations.

Speeding is the most common violation.
Red light jumping is the second most common violation.
No helmet, signal jumping, and wrong lane violations occur at similar levels.
No seatbelt is the least common violation.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

vehicle_counts.plot(kind='bar')
plt.title("Violations by Vehicle Type")
plt.xlabel("Vehicle Type")
plt.ylabel("Count")
plt.show()

vehicle_counts.plot(kind=’bar’)
Creates a bar chart using vehicle violation data
plt.title(“Violations by Vehicle Type”)
Sets the title of the graph
plt.xlabel(“Vehicle Type”)
Labels the X-axis as vehicle type
plt.ylabel(“Count”)
Labels the Y-axis as number of violations
plt.show()
Displays the graph on the screen

Graph :

The graph shows traffic violations by vehicle type.

Bikes are involved in the highest number of violations.
Cars are the second highest in violation count.
Autos have fewer violations compared to bikes and cars.
Buses have the least number of violations.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

violation_counts.idxmax()

This code tells us which traffic violation happens the most.

according to graph speeding is the most traffic violation happend
-— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

location_counts.idxmax()

This code tells us which location has the most traffic violations.
And according to graph this location is ‘Connaught Place’
— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Conclusions :

Red Light Jumping is the most common traffic violation.
Bikes are involved in the highest number of violations.
Connaught Place and Kashmere Gate are high-risk locations.
Repeated violations at the same signals indicate the need for stricter monitoring.
Traffic safety awareness and enforcement should be improved during peak hours.

- — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —