Stories by Santhosh Kannan on Medium

Text Classification Using Naive Bayes

Santhosh Kannan — Mon, 08 May 2023 15:26:26 GMT

What is Text Classification?

Text classification is the process of categorizing text data into predefined classes based on its content. It finds its use in many NLP applications like sentiment analysis, spam detection, and more. Text Classification is a challenging task due to the complex and nuanced nature of human language. But there are many models available to analyze and classify unstructured text data like Naive Bayes, Support Vector Machines, and neural networks.

What is Bayes’ Theorem?

Bayes’ theorem is used to find the probability of an event based on prior knowledge or evidence. It is based on conditional probability, which is the probability of an event occurring given that another event has occurred. Mathematically, Bayes’ theorem can be expressed as

Bayes’ Theorem

where, P(A | B) is the conditional probability of A given B, P(B | A) is the conditional probability of B given A, P(A) is the prior probability of A, and P(B) is the prior probability of B.

What is Naive Bayes?

Naive Bayes is a probabilistic machine learning model that is based on Bayes’ theorem. It works by calculating the probability that a given data point belongs to a class and then selects the class with the highest probability as the predicted output.

Text classification using Naive Bayes

import numpy as np
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline

categories = ['rec.motorcycles', 'sci.electronics',
              'comp.graphics', 'sci.med']

train_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test_data = fetch_20newsgroups(subset='test',
                               categories=categories, shuffle=True, random_state=42)

text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_clf.fit(train_data.data, train_data.target)
docs_test = test_data.data
predicted = text_clf.predict(docs_test)
print('We got an accuracy of',np.mean(predicted == test_data.target)*100, '% over the test data.')

The 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split into two subsets: one for training and the other one for testing.

CountVectorizer is used to transform a given text into a vector on the basis of the frequency of each word that occurs in the entire text.

TF-IDF stands for Term Frequency — Inverse Document Frequency of records. It can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data set).

What are the advantages of using Naive Bayes for text classification?

Can be implemented quickly and efficiently
Can handle very large volumes of data and high dimensional data
Good performance on many real-world problems, often outperforming complex models
Interpretable results

What are the disadvantages of using Naive Bayes for text classification?

Assumes that the features are independent of each other which is often not the case for real-world problems.
Faces the ‘zero-frequency problem’ where it assigns zero probability to a categorical variable whose category in the test data set wasn’t available in the training dataset.

Text Classification Using Naive Bayes was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

Combating overfitting with L1 and L2 regularization

Santhosh Kannan — Mon, 08 May 2023 15:24:44 GMT

What is overfitting?

Overfitting is a very common problem in machine learning where a model performs impressively on training data but performs poorly on new, unseen data. This is because the model has learned patterns in the training data which don’t generalize to the whole population.

How to deal with overfitting?

Use simpler models
Collect more data
Use cross-validation
Early stopping
Regularization

What is Regularization?

Regularization is the technique in which the model parameters are constrained or regularized, which reduces the risk of overfitting. It works by adding a penalty to the cost function the model is trying to minimize. By doing so, regularization forces the model to pay more attention to the overall structure of the model, rather than fitting the training data as closely as possible. The two common types of regularization are L1(Lasso) and L2(Ridge).

What is L1 Regularization?

L1 Regularization, also known as Lasso regularization, adds a penalty to the cost function that is proportional to the absolute value of the model weights.

Loss function during L1 regularization

This results in a sparse model in which many weights are exactly zero. This can be helpful when the number of features is large and only a subset of important ones is to be selected. L1 regularization is also robust to noise in data, as it is less sensitive to small, outlying values.

L1 regularization is commonly used with linear models, such as linear regression. L1 regularization is sensitive to the scaling of data, hence it is a good idea to standardize the features before using L1 regularization.

What is L2 Regularization?

L2 Regularization, also known as Ridge regularization, adds a penalty to the cost function proportional to the square of the model weights.

Loss function during L2 regularization

This results in a model with small, non-zero weights. L2 regularization is sensitive to outliers. hence outliers should be handled before using L2 regularization.

L2 Regularization is commonly used with linear models, such as linear regression. Like L1 regularization, L2 regularization is also sensitive to the scaling of data.

Implementation of Regularization using Scikit-learn

import pandas as pd
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.metrics import mean_squared_error

bos = load_boston()
df = pd.DataFrame(bos.data, columns = bos.feature_names)
df['PRICE'] = bos.target

x_train, x_test, y_train, y_test = train_test_split(df.drop("PRICE",axis=1),df["PRICE"],test_size=.2)

lreg = LinearRegression()
lreg.fit(x_train,y_train)
lreg_y_pred = lreg.predict(x_test)

lrid = Ridge()
lrid.fit(x_train,y_train)
lrid_y_pred = lrid.predict(x_test)

lasso = Lasso()
lasso.fit(x_train,y_train)
lasso_y_pred = lasso.predict(x_test)

fig, ax = plt.subplots(figsize=(20,7))
x = np.arange(13)
ax.bar(x-0.2, lreg.coef_, width=0.2, color='b', align='center',label="Linear Regression")
ax.bar(x_train.columns, lrid.coef_, width=0.2, color='g', align='center', label="Ridge Regression")
ax.bar(x+0.2, lasso.coef_, width=0.2, color='r', align='center',label="Lasso Regression")
ax.spines['bottom'].set_position('zero')
plt.style.use('ggplot')
plt.legend()
plt.show()

Coefficients for linear, ridge, and lasso regressions

What are the disadvantages of Ridge Regularization?

It may not perform well when there are a lot of features.
Doesn’t perform feature selection — it keeps all the features in the model and only reduces the magnitude of their coefficients.

What are the disadvantages of Lasso Regularization?

Eliminates some of the features by setting their coefficients to zero, potentially excluding important features.
It may not perform well when there is multicollinearity between features.

Combating overfitting with L1 and L2 regularization was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

Beautify your Python code with Decorators

Santhosh Kannan — Mon, 08 May 2023 15:23:20 GMT

What are Decorators?

Decorators are functions that modify or add additional functionality to the behavior of other functions or classes without changing the source code of the other functions. In Python, decorators can be identified by the “@” symbol followed by the decorator's name. This is placed immediately before the definition of a function or class that is being decorated.

What is the need for Decorators?

Consider the user info endpoint in a website that displays the user’s information. This endpoint must display the information only if the user is logged in. If the user is not logged in, it should redirect to the login page.

One can modify the source code of this endpoint to check if the user is logged in and then decide whether to display the info or redirect. Although it seems like an easy problem to fix, consider that there are many endpoints that must work only if the user is logged in. Changing the source code for each endpoint to check and redirect leads to code duplication.

Decorators provide the best solution to this problem and allow not only code re-usability and maintenance but also various other functionalities. Let’s look at how decorators can be used to solve the above problem and a few other examples where decorators can be used.

What is the syntax for Decorators?

import functools

def decorator_function(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        # do something before the function call
        value = func(*args, **kwargs)
        # do something after the function call
        return value
    
    return wrapper

@decorator_function
def new_func():
    # do something

This is the general syntax for a decorator function. The decorator_function is defined just like any other function except that it takes a function reference that needs to be wrapped as the argument.

It is a good practice to use the functools.wraps decorator to preserve the information about the original function or class, otherwise, information such as the original function’s name, docstring, arguments, etc., will be overwritten with the decorator's information.

A wrapper function that takes any number of positional and keyword arguments is defined that calls the original function with the same arguments. It does some functionality before and/or after calling the original function.

The “@” symbol along with the decorator function name can then be used to wrap any function and provide it with additional functionalities.

What are some examples where decorators can be used?

1. Logging

A simple logging decorator can be created that logs all the arguments and the return value when a function is called.

import functools

def log_function(func):
    @functools.wraps(func)
    def wrapper(*args, **kwargs):
        args_repr = [repr(a) for a in args]
        kwargs_repr = [f"{k}={v!r}" for k, v in kwargs.items()]
        signature = ", ".join(args_repr + kwargs_repr)
        
        print(f"Calling function {func.__name__}({signature})")
        
        value = func(*args, **kwargs)
        print(f"function {func.__name__!r} returned {value!r}")
        
        return value
    
    return wrapper

@log_function
def is_prime(n):
    if n < 2: 
         return False
    if n % 2 == 0:
        if n==2:
            return True
        return False
    k = 3
    while k*k <= n:
         if n % k == 0:
             return False
         k += 2
    return True

@log_function
def hello():
    print("Hello World")
    
hello()
print(is_prime(25))

#########################
# Output: 
# 
# Calling function hello()
# Hello World
# function 'hello' returned None
# Calling function is_prime(25)
# function 'is_prime' returned False
# False
#########################

2. Timing

A timing decorator can be created that outputs the time is taken to complete the function call

import functools
import time

def timer(func):
    @functools.wraps(func)
    def wrapper(*args, **kargs):
        start_time = time.perf_counter()
        value = func(*args, **kargs)
        end_time = time.perf_counter()
        run_time = end_time - start_time
        print(f"Finished {func.__name__!r} in {run_time:.4f} secs")
        return value
    
    return wrapper

@timer
def sum_upto(n):
    s = 0
    for i in range(1,n+1):
        s+=i
    print(s)

sum_upto(100000)

#########################
# Output: 
# 
# 5000050000
# Finished 'sum_upto' in 0.0029 secs
#########################

3. Login verification

import functools
from flask import *

app = Flask(__name__)
app.secret_key = "dsajdnsakj"

def require_login(func):
    @functools.wraps(func)
    def wrapper(*args, **kargs):
        if "logged_in" in session:
            return func(*args, **kargs)
        return abort(401)
    
    return wrapper

@app.route('/user_info')
@require_login
def user_info():
    return "User Info"

@app.route('/')
def index():
    return "Index Page"

@app.route('/login')
def login():
    session["logged_in"] = True
    return "Logged in"


@app.route('/logout')
@require_login
def logout():
    session.pop("logged_in")
    return "Logged out"

if __name__== "__main__":
    app.run(host="0.0.0.0", debug = True, port = 8000)

Beautify your Python code with Decorators was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

Modeling spread of infectious disease — SIR Model

Santhosh Kannan — Fri, 07 Apr 2023 06:47:24 GMT

Predicting the Spread of Infectious Disease: An SIR Model Approach

What is Mathematical Modeling?

Mathematical Modeling is the process of describing a real-world system using mathematical structures, equations, and functions. It is used to describe, analyze and predict various real-world systems like the spread of infectious disease, the interaction of species in an ecosystem, design and optimization in engineering, etc.

How is mathematical modeling used in Epidemiology?

Mathematical modeling is used in epidemiology to study the spread of infectious diseases through the population. This helps in understanding the transmission of the disease and predicting the extent of the outbreak. It can also be used to evaluate the impact of vaccination programs, social distancing, and the use of masks.

What is the SIR Model?

The SIR Model is one of the most commonly used models to study the spread of infectious diseases in a population. The model gets its name from the three groups of the population according to this model — Susceptible, Infected, and Recovered. The model tracks the number of people in each group over time.

Susceptible refers to people who have not been infected yet but can get infected if they come into contact with an infected person. Infected refers to people who are currently infected with the disease and can spread it to susceptible people. Recovered refers to the people who have recovered from the disease and are no longer infectious.

What are the equations for the SIR Model?

Let N be the total population. Let S(t), I(t), and R(t) represent the number of susceptible, infected, and recovered people at time t respectively. The total population is spread across these three groups.

Let Beta and Gamma represent the rate of infection of the susceptible population and the rate of recovery of the infected population respectively.

The rate of change of the susceptible population, infected population, and recovered population concerning time is as follows:

The rate of change of the susceptible population depends on the interaction between the susceptible population and the infected population and the infection rate. The negative time indicates that the susceptible population decreases over time.

The rate of change of recovered people depends on the number of infected population and the rate of recovery. The positive sign indicates that the recovered population increases over time.

The rate of change of the infected population depends on the number of people who get infected and number of people who get recovered.

Implementation of the SIR Model in Python

import numpy as np
import matplotlib.pyplot as plt
from scipy.integrate import odeint

def dAdt(A,t,beta,gamma,N):
    S = A[0]
    I = A[1]
    R = A[2]
    return [
        -beta*S*I/N,
        beta*S*I/N - gamma*I,
        gamma*I
    ]

days = np.arange(0, 100, 1)
GAMMA = 0.1
N = 1e7
BETA = 0.39
S0, I0, R0 = N-500, 500, 0

sol = odeint(dAdt,y0=[S0,I0,R0], t=days,args=(BETA,GAMMA,N))

S,I,R = sol.T[0],sol.T[1],sol.T[2]

plt.plot(days,S)
plt.plot(days,I)
plt.plot(days,R)
plt.show()

Here, we run the model for 100 days. The total population is taken as 10 million, the initial infected population is 500 and the initial recovered population is null.

How does the SIR Model help?

Predicting the extent of the spread of the disease will help public health officials make informed decisions to slow or stop the spread of the disease.
It can be used to evaluate the effectiveness of measures like vaccination camps, social distancing, and quarantine policies.
Help healthcare officials to allocate resources such as beds and medical supplies to respond to the disease appropriately.

What are the limitations of the SIR Model?

Assumes that all individuals have the same likelihood of getting infected. In reality, different groups of people have different levels of susceptibility to the disease.
Assumes that the population is closed, that is, no new individuals enter or leave the population.
Assumes that all infected individuals recover from the disease and become immune to it. It does not take into consideration that people may never recover and also may get re-infected in the future.

Modeling spread of infectious disease — SIR Model was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

Pandas Performance: Optimizing Your Data Analysis Workflow

Santhosh Kannan — Sun, 05 Mar 2023 13:20:09 GMT

Pandas Performance: Optimizing Your DataAnalysis Workflow

What are Pandas?

Pandas is an open-source library for data manipulation and analysis. Its DataFrame and Series provide an easy and simple way of handling numerical tables and time series data. It has methods for filtering, grouping, and reshaping data, as well as handling missing data, and merging and visualizing the data.

Why your Pandas code may need to be faster?

Although Pandas is a powerful library for handling data, it can perform slower than expected in various scenarios. These include working with large datasets, complex data transformation, iterative operations, and non-vectorized operations.

But, most of the time, the slowness of Pandas arises from using unoptimized code and techniques. Here, we will look at such examples and provide the necessary correction that needs to be made to reduce the run-time of your data analysis.

Dataset used

The CSV files of the datasets used in the examples can be generated by running this python code.

import pandas as pd
import numpy as np
import random

companies_list = ["Amazon", "Google", "Infosys", "Mastercard", "Microsoft", 
            "Uber", "IBM", "Apple", "Wipro", "Cognizant"]

jobs_list = ["Software Development Engineer", "Machine Learning Engineer",
            "Data Scientist", "Data Analyst", 
            "Artificial Intelligence Engineer", "Back-end Developer", 
            "Front-end Developer", "Research Scientist", 
            "IOS Developer", "Android Developer"]

cities_list = ["Alberta", "British Columbia", "Manitoba", "New Brunswick", "Newfoundland and Labrador",
             "Ontario", "Quebec", "Nunavut", "Prince Edward Island", "Northwest Territories"]

data1 = []
data2 = []
for i in range(4_096_000):

    emp_id = i+1
    company = random.choice(companies_list)
    job = random.choice(jobs_list)
    city = random.choice(cities_list)
    salary = int(round(np.random.rand(), 3)*10**6)
    employment = random.choices(["Full Time", "Intern"])[0]
    rating = round((np.random.rand()*5), 1)
    
    data1.append([emp_id,company, job, city])
    data2.append([emp_id, salary, employment, rating])
    
data = pd.DataFrame(data1, columns=["Employee Id", "Company Name", "Title",
                                   "Location"])
data.to_csv("dataset1.csv",index=False)
data = pd.DataFrame(data2, columns=["Employee Id","Employee Salary", "Employment Status", "Employee Rating"])
data.to_csv("dataset2.csv",index=False)

1. Filtering categorical data

We frequently have to filter out data frames and select only a part of it that satisfies a particular condition. Normally filtering is done by indexing the data frame with a boolean mask as follows

%%timeit
dataset1[dataset1["Location"] == "Ontario"]

############################################################
# 175 ms ± 3.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

An optimized way of doing the same is to use the group by() method to group the dataset based on a column and obtain the individual group using the get_group() method

data_grp = dataset1.groupby("Location")

%%timeit
data_grp.get_group("Ontario")

############################################################
# 18.5 ms ± 287 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
############################################################

2. Combining DataFrames

Two data frames can be combined based on a common column to get a single data frame. This is similar to the join statement in SQL. Two data frames can be merged as follows:

%%timeit
pd.merge(dataset1,dataset2,on="Employee Id", how="inner")

############################################################
# 970 ms ± 32.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

A faster way to do the same is to use the join() method by setting the index of both the DataFrames as the common column to join on.

%%timeit
dataset1.set_index("Employee Id").join(dataset2.set_index("Employee Id"))

############################################################
# 386 ms ± 6.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

3. Iterative over a DataFrame

The process of visiting every row in a data frame and performing an operation on it is called iterating or looping over a data frame. Iterating over a data frame is a costly operation and is mainly avoided altogether and vectorized operations are instead used.

%%timeit
lst = []
for row in dataset2["Employee Salary"]:
    lst.append(row/12)

############################################################
# 412 ms ± 17.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

%%timeit
lst = dataset2["Employee Salary"]/12

############################################################
# 6.08 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
############################################################

However, there are situations where looping may be necessary for the execution of the operation. Pandas provide two methods to iterate through the DataFrame — arrows () and itertuples(). Although both methods can be used to do the same operation, iter tuples() are much faster than iterrows()

%%timeit
lst = []
for row in dataset2.itertuples():
    lst.append(row._2/12)

############################################################
# 2.48 s ± 160 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
############################################################

4. Mentioning the datatypes

When Pandas reads a CSV file and converts it into a DataFrame, it infers the data type of the column and assigns the largest suitable datatype(int64,float64, etc.). In many cases, this large of a datatype is unnecessary. Hence, the datatypes for the rows can be specified when reading the csv itself by passing a datatype dict.

dataset2 = pd.read_csv("dataset2.csv")
dataset2.info()

"""

RangeIndex: 4096000 entries, 0 to 4095999
Data columns (total 4 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Employee Id        int64  
 1   Employee Salary    int64  
 2   Employment Status  object 
 3   Employee Rating    float64
dtypes: float64(1), int64(2), object(1)
memory usage: 125.0+ MB
"""

dtypes = {
    "Employee Id":"uint8",
    "Employee Salary":"uint8",
    "Employment Status":"object",
    "Employee Rating":np.float16
}
dataset2 = pd.read_csv("dataset2.csv",dtype=dtypes)
dataset2.info()

"""

RangeIndex: 4096000 entries, 0 to 4095999
Data columns (total 4 columns):
 #   Column             Dtype  
---  ------             -----  
 0   Employee Id        uint8  
 1   Employee Salary    uint8  
 2   Employment Status  object 
 3   Employee Rating    float16
dtypes: float16(1), object(1), uint8(2)
memory usage: 46.9+ MB
"""

By mentioning smaller datatypes, we were able to reduce the size of the DataFrame by nearly 3 times.

Pandas Performance: Optimizing Your Data Analysis Workflow was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

Factory Method Pattern in Python

Santhosh Kannan — Sun, 26 Feb 2023 13:34:50 GMT

What are Design Patterns?

A Design Pattern is a general solution applicable to commonly occurring problems in software design. In simpler words, it is a template for how to solve a problem, which can be used in several situations. By using design patterns, the code can be made more flexible, reusable and maintainable.

What are the types of design patterns?

Creational Design Patterns
Structural Design Patterns
Behavioral Design Patterns

What is Creational Design Pattern?

Creational Design Patterns deal with object creation mechanisms. This involves creating the object based on the current situation. They reduce complexities and instability by creating objects in a controlled manner.

What is Factory Method Pattern?

Factory design pattern provides a common interface to create objects in a superclass, but allows child classes to change the type of object that will be created. Using this, the process of creating an object is separated from the code that depends on the interface of the object.

The main object of a Factory method is to encapsulate the object creation logic in a seperate component. This component can then be used to create objects without knowing the object’s implementation details.

How to implementation of Factory Method?

Create an abstract base class for the factory method. This class is used to implement a common interface for creating objects of a specific type, but not the actual implementation.
Create two or more child classes that implement the factory method. These classes should have the actual implementation for creating the objects.
The client code will use the factory method to create the object of a specific type without actually knowing the implementation details of how the object is created.

Example — File Writer in Python

Let’s look at an example implementation of factory method. In this example, a file writer will be implemented which will save data in text, csv or json format.

Creating the abstract base class

from abc import ABC, abstractmethod

class FileWriter(ABC):
    def get_writer_type(self):
        return self.writer_type
    
    @abstractmethod
    def save(self, filepath, data):
        pass

The abc library provides the necessary tools to create an abstract base class. Here, we define an abstract base class FileWriter with two methods — the factory method save() and writer_type().

The factory method save() will be defined by the child classes with the actual implementation to create the files. The method writer_type() will be used to return the type of writer that was created.

Creating the subclasses

import json

class TextFileWriter(FileWriter):
    def __init__(self):
        self.writer_type = "txt"
    
    def save(self, filepath, data):
        with open(filepath,"w") as f:
            f.write(data)

class CsvFileWriter(FileWriter):
    def __init__(self):
        self.writer_type = "csv"
    
    def save(self, filepath, data):
        with open(filepath,"w") as f:
            for row in data:
                f.write(",".join(row)+"\n")

class JSONFileWriter(FileWriter):
    def __init__(self):
        self.writer_type = "json"
    
    def save(self, filepath, data):
        with open(filepath,"w") as f:
            json.dump(data, f)

Here three child classes are created which inherit from the abstract base class FileWriter — TextFileWriter, CsvFileWriter and JsonFileWriter. These child classes provide the actual implementation of the factory method to create the necessary files. The writer_type is also defined in the __init__() method with the respective format.

Creating the Factory Class

class WriterFactory:
    
    def get_writer(self,format):
        if format == "txt":
            return TextFileWriter()
        elif format == "csv":
            return CsvFileWriter()
        elif format == "json":
            return JSONFileWriter()
        else:
            raise TypeError("Specify a valid file format")

The WriterFactory will take care of creating the necessary FileWriter from the argument of the file format. The get_writer() method will create and return the respective file writer objects. This will be used by the client code to create the file writer objects.

The whole code

from abc import ABC, abstractmethod
import json

class FileWriter(ABC):
    def get_writer_type(self):
        return self.writer_type
    
    @abstractmethod
    def save(self, filepath, data):
        pass
    
class TextFileWriter(FileWriter):
    def __init__(self):
        self.writer_type = "txt"
    
    def save(self, filepath, data):
        with open(filepath,"w") as f:
            f.write(data)

class CsvFileWriter(FileWriter):
    def __init__(self):
        self.writer_type = "csv"
    
    def save(self, filepath, data):
        with open(filepath,"w") as f:
            for row in data:
                f.write(",".join(row)+"\n")

class JSONFileWriter(FileWriter):
    def __init__(self):
        self.writer_type = "json"
    
    def save(self, filepath, data):
        with open(filepath,"w") as f:
            json.dump(data, f)

class WriterFactory:
    
    def get_writer(self,format):
        if format == "txt":
            return TextFileWriter()
        elif format == "csv":
            return CsvFileWriter()
        elif format == "json":
            return JSONFileWriter()
        else:
            raise TypeError("Specify a valid file format")

The client code

from FileWriterFactory import WriterFactory

writer_factory = WriterFactory()

text_writer = writer_factory.get_writer("txt")
print(text_writer.get_writer_type())
text_writer.save("data.txt","hello world!")

The client code will import the class WriterFactory. After creating an object for the WriterFactory, the get_writer() method can be used to create the respective file writer based on the argument of the file type. The save() method can then be used to save the file with the argument of the file path and the data to be saved.

What are the advantages of Factory Method?

Hides the actual implementation of the objects
Provides a common interface for the creation for objects with different implementations
Ensures reusability and uniformity while minimising potential code errors

What are the disadvantages of Factory Method?

Difficult to read as all the code is behind an abstraction that in turn may have more abstractions
Produces too many parallel hierarchy of classes when more number of object has to be implemented

Factory Method Pattern in Python was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

Speeding up image processing using multiprocessing in Python

Santhosh Kannan — Fri, 03 Feb 2023 14:33:05 GMT

When building a machine learning model, the key factor that decides how well the model will perform is the data. So it is a necessary step to always preprocess the data before using it to train the model. In the case of images, their size, orientation, color, etc., have to be adjusted into the desired format to get better results.

However, image preprocessing is a computationally expensive task. But it can be sped up with various techniques. One of the techniques is to process the images concurrently than processing them one by one. Let us look at how to process the images concurrently using multi-threading and multiprocessing using python.

Why do we have to go for multi-threading or multiprocessing?

Let’s say you have a function process_image that takes the image as input and returns the processed image as output. You are looping through all the images in your dataset and calling the process_image function with the image.

processed_images = [ process_images(image) for image in dataset]

This will process the images one by one, i.e, the next image will be processed only after the previous image has been processed.

Normal flow

However, this can be sped up by processing the images concurrently with several images being processed simultaneously.

What is multithreading?

Multi-threading is the ability of the processor to break a single process into multiple threads and run them concurrently. In our example, processing all the images in the dataset is the process and it is broken into several threads where each thread processes one image.

Multi-threading

How to implement multi-threading in python?

Threading can be implemented by using the built-in threading module which contains the methods for creating and working with threads. However, an easier way to start a group of threads is by using the ThreadPoolExecutor which is part of the built-in concurrent. futures library.

from concurrent.futures import ThreadPoolExecutor

# import the dataset
# define the process_images() function

with ThreadPoolExecutor() as executor:
    futures = []
    for image in dataset:
        futures.append(executor.submit(process_images, image))

processed_images = [] # To store the processed images
for future in futures:
    processed_images.append(future.result())

What is Python Global Interpreter Lock?

Python Global Interpreter Lock or GIL is a lock that allows only one thread to hold control of the python interpreter. That means that only one thread can be in the state of execution at a time.

The GIL was introduced due to a thread-safety issue in some Python versions. The widely used version of Python(downloadable from python.org) is the C implementation. C doesn’t include built-in thread-safety mechanisms which mean that multiple threads access the same data concurrently which leads to race conditions and synchronization issues.

Therefore, even if multi-threading was implemented, GIL limits the ability of Python to fully exploit multiple processors.

What is multi-processing?

Multi-processing refers to the execution of a single process by multiple processors by breaking it down into multiple sub-processes, each with its own memory space to run independently of the other.

Since multi-processing doesn’t share memory and data like multi-threading, it is free from GIL and can significantly boost performance.

Multi-processing

How to implement multi-processing in python?

Multi-processing can be implemented by using the built-in multiprocessing module which contains the methods for creating and working with processes. However, an easier way to start a group of processes is by using the ProcessPoolExecutor which is part of the built-in concurrent. futures library.

from concurrent.futures import ProcessPoolExecutor

# import the dataset
# define the process_images() function

with ProcessPoolExecutor() as executor:
    futures = []
    for image in dataset:
        futures.append(executor.submit(process_images, image))

processed_images = [] # To store the processed images
for future in futures:
    processed_images.append(future.result())

Speeding up image processing using multiprocessing in Python was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

OpenCV — Basic image processing in Python

Santhosh Kannan — Tue, 17 Jan 2023 14:30:47 GMT

OpenCV — Basic image processing in Python

What is OpenCV?

Open Source Computer Vision(OpenCV) is one of the widely used library for computer vision and image processing tasks. It includes various algorithms for image and video analysis, such as object detection, motion analysis, image processing, feature detection, etc.

How to install OpenCV library?

OpenCV can be easily install from the pip package manager as follows:

pip install opencv-python

The installation can be verified by running the following in python

import cv2

print(cv2.__version__)

How to read an image?

An image can be represented as a multidimensional array of pixel information, where each element is a numeric value representing the color of the pixel.

img = cv2.imread("cat.jpg")
print(img)

The output of the above code is an array of numeric values. If you print the variable type of img, it will be a numpy.ndarray. Hence the image can be manipulated by numpy methods too.

How to display an image?

An image which has been read by OpenCV can be displayed using matplotlib library.

import matplotlib.pyplot as plt
fig = plt.imshow(img)
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)
plt.show()

This will display the image but the color of the image will be off. This is because OpenCV reads the image in the BGR format. To convert it into RGB format:

img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
fig = plt.imshow(img)
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)
plt.show()

How to get the dimensions of the image?

The height, width and color channels of an image can be obtained by

height, width, channels = img.shape
print((height, width, channels)) # (2500, 2392, 3)

How to resize an image?

An image can be resized by using the resize function of OpenCV by providing the new dimensions

resized_img = cv2.resize(img,(500,500),interpolation=cv2.INTER_AREA)
print(resized_img.shape) # (500, 500, 3)

Images can also be resized by providing a scaling factor instead of the new dimensions.

resized_img = cv2.resize(img, dsize = None,fx=0.5, fy=0.5, interpolation=cv2.INTER_AREA)
print(resized_img.shape) # (1250, 1196, 3)

Here, the fx and fy represent the scaling factor by with the image will be resized in the width and height respectively. Therefore, in this example, the image is resized to half it’s original size.

How to crop an image?

Since an image is represented as a numpy array, the image can be easily cropped by using array slicing.

x, y = 600, 100
h, w = 1500, 1300
cropped_img = img[y:y+h, x:x+w]
fig = plt.imshow(cropped_img)
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)
plt.show()

Here, (x, y) represent the coordinate of the top-left pixel of the cropped image and h and w are the height and width of the cropped image respectively.

How to convert a color image to grayscale?

A color image has 3 color channels but a grayscale image has only 1 color channel.

img = cv2.imread("cat.jpg")
print("color image dimensions: ",img.shape)  # (2500, 2392, 3)
gray_img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
print("grayscale image dimensions: ",gray_img.shape) # (2500, 2392)

In order to display the grayscale image using imshow(), the cmap parameter has to be set to “gray”.

fig = plt.imshow(gray_img, cmap="gray")
fig.axes.get_xaxis().set_visible(False)
fig.axes.get_yaxis().set_visible(False)
plt.show()

How to save an image?

Images can be save by using the imwrite() function.

cv2.imwrite("gray_cat.jpg",gray_img)

OpenCV — Basic image processing in Python was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

Activation functions in Neural Networks

Santhosh Kannan — Tue, 03 Jan 2023 13:50:24 GMT

What is an Activation Function?

An activation function decides whether a neuron’s input is important to the neural network or not in the output prediction. It’s main function is to transform the summed input of the node into an output value that will be fed to the next layer.

Activation function

Why is an Activation Function needed?

Activation function adds non-linearity to the neural network. Without activation function, a neuron performs only linear transformation on the inputs using the weights and bias. Thus our model will be just a linear regression model and will not be able to solve complex problems

What are the different activation functions?

Linear activation function

Linear activation function, also called the identity function or no activation, multiplies the weighted sum of inputs by 1. Thus it doesn’t transform the input and the output is same as the input.

Linear activation function

Limitations
1. Not possible to use backpropagation as derivative of the function is constant.
2. All layers of the neural network will collapse into one. Even if there are 100 layers, the last layer will be a linear function of the first layer, essentially turning the network into just 1 layer.

Binary step function

Binary step function decides whether the output is 0 or 1 based on a threshold value. If the weighted sum is greater than threshold, it outputs 1 else 0.

Binary step function

Limitations
1. Cannot be used for multi-class classification problems
2. Hinders backpropagation process as the gradient is zero

Logistic Activation function

Logistic Activation function transforms the input into a value between 0 and 1. The larger the input(more positive), the closer the output will be to 1 and the smaller the input(more negative), the closer the output will be to 0.

Logistic Activation function

Advantages
1. Commonly used in models where probability is the output.
2. Prevents jumps in output values since the function has a smooth gradient.

Limitations
1. Suffers from Vanishing Gradient Problem where the network is unable to backpropagate useful information

Tanh Function

Tanh function or hyperbolic tangent function is similar to logistic function with the only main difference being that the output of tanh function is between -1 and 1. The larger the input(more positive), the closer the output will be to 1 and the smaller the input(more negative), the closer the output will be to -1.

Tanh activation function

Advantages
1. Output can be mapped as strongly negative, neutral and strongly positive
2. Has gradient 4 times greater than that of logistic function — thus giving rise to bigger learning steps when training.
3. Symmetric around 0, leading to faster convergence.

Limitations
1. Suffers from Vanishing Gradient Problem where the network is unable to backpropagate useful information

Rectified Linear Unit(ReLU)

Rectified Linear Unit transforms the input into 0 if it is negative or returns the input itself if it is positive. Although it seems like a linear function, ReLU has a derivative function and allows complex relationships in the data to be learned.

Rectified Linear Unit

Advantages
1. Doesn’t activate neurons with negative inputs, thus being computationally efficient.
2. Tends to show better convergence
3. Faster to compute than some other activation functions like logistic function

Limitations
1. Tends to blow up activation since there is no constraint to the output if the input is positive.
2. Dying ReLU problem — if too many activation get below zero, most of the neurons in a layer will output zero, creating dead neurons whose weights and biases are not updated.

Leaky ReLU function

Leaky ReLU is an improved version of ReLU function to solve the Dying ReLU problem. Instead of transforming negative values into 0 like ReLU, Leaky ReLU transforms it by multiplying with a small, non-zero constant parameter a(Normally 0.01).

Leaky ReLU activation function

Advantages
1. Prevention of Dying ReLU problem by allowing a small gradient for negative inputs
2. Faster to compute than some other activation functions like logistic function

Limitations
1. Sensitive to the parameter a — A value that is too small may result in slow convergence, while a value that is too large may result in unstable behaviour
2. Prediction for negative input values may not be consistent

How to choose the right activation function?

Choosing the right activation function is an important decision in the design of a neural network, as it can significantly impact the network’s performance. Some general guidelines for choosing an activation function are

The characteristics of the data and the requirements of the task: The logistic function may be more suitable for tasks that involve binary classification, while the ReLU function may be more suitable for tasks that involve large, positive input values.
The computational complexity: Activation functions with higher computational complexity may require more time and resources to compute, which can impact the overall performance of the network.
The type of layer: ReLU activation function is mostly used in the hidden layers whereas Logistic and Tanh functions are mostly used in output layers.
Trail and error: It is often a good idea to try out different activation functions and compare their performance on the specific task at hand. This can help to identify the best activation function for the task.

Activation functions in Neural Networks was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.

Metrics to evaluating classification models

Santhosh Kannan — Sat, 03 Dec 2022 12:08:09 GMT

Photo by Afif Ramdhasuma on Unsplash

Classification is a type of supervised machine learning technique used to predict the class data points belong to.

One of the most important step in any machine learning workflow is evaluation of the trained model. At this step, the trained model is used to make prediction of unseen(not used in training) labelled data. The model is judged based on how many prediction it got right.

But how many predictions a model got right will not always be a good metric to judge the model’s performance. We have to taken into account how many predictions were wrong and how they were wrong — whether a positive class was predicted as negative or vice-versa.

For example, if we are predicting if a tumour is cancerous or not, we would be okay if the model incorrectly predicted a tumour as cancerous than it missing to diagnose a cancerous one. On the other hand, if we are predicting if a email is spam or not, a model would be considered as good if it missed to identify a spam mail than to identify an important mail as spam.

So, we need to use different metrics in order to evaluate the model based on the problem at hand and optimise the trade-off between different outcomes.

Let’s look at various classification metrics and how they can be used to evaluate the model on different cases.

Accuracy

The accuracy of a model is simply the number of correct predictions divided by the total number of predictions.

Accuracy. Image by author

Accuracy will be a value between 0 and 1 and a value of 1 indicates that all the predictions made by the model are correct.

Accuracy can often be misleading, as on imbalanced datasets, where the one class has large amount of entries compared to the other. For example, if our tumour dataset contains only 1% of cancerous data, then a model can predict all the data as benign and get a score of 99% accuracy. This model is useless and highly dangerous.

Confusion Matrix

Confusion matrix is a tabular summary of the number of correct and incorrect predictions made by a model. Confusion matrix is widely used due to the fact that they give a better understanding of the model’s performance than accuracy does.

Confusion matrix. Image by author

True Positive: Predicted value is true and the actual value is also true
False Positive: Predicted value is true but the actual value is false
False Negative: Predicted value is false but the actual value is true
True Negative: Predicted value is false and actual value is also false

Confusion Matrix is extremely useful as it can be used to calculate other classification metrics like precision, recall, F1-score, etc.

Precision

Precision is used to evaluate how good a model is at identifying the positive class. In simpler terms, out of all predictions for the positive class, how many were actually right?

Precision. Image by author

Precision can be used to optimise the model to reduce the number of false positives. Thus, this metric can be used in the case of email spam detection example.

Recall

Recall or Sensitivity measures how good a model is at correctly predicting all positive observations in the dataset. In simpler terms, out of all the actual positive class, how many were correctly identified as positive?

Recall. Image by author

Recall can be used to optimise the model to reduce the number of false negatives. Thus, this metric can be used in the case of cancerous tumour prediction example.

Usually, precision and recall are used together by plotting a precision-recall graph to visualise the trade-off between them.

F1-Score

F1-Score combines the information provided by precision and recall into a single value. It is the harmonic mean between precision and recall.

F1-Score. Image by author

F1-Score is a value between 0 and 1. High F1-Score indicates high precision and recall.

F1-score is used when there is an imbalanced dataset. It is also used to compare performance of different machine learning algorithms.

Cohen’s Kappa statistics

Kappa statistics compares the predictions made by the model to random guessing based on frequency of each class.

Cohen’s Kappa Statistics. Image by author

Kappa’s value is always less than 1 and can also be negative. Although there is no standardised way to interpret its value, Landis and Koch provided a way to characterise the value.

Landis and Koch kappa interpretation. Image by author

Kappa value is used when there is an imbalanced dataset and in multi-class classification problems.

Metrics to evaluating classification models was originally published in featurepreneur on Medium, where people are continuing the conversation by highlighting and responding to this story.