Stories by Dr Sujoy K Goswami, hc on Medium

RAG vs. Fine-Tuning in LLM

Dr Sujoy K Goswami, hc — Sun, 10 Nov 2024 15:40:57 GMT

In recent years, as large language models (LLMs) have grown in size and complexity, two prominent techniques — Retrieval-Augmented Generation (RAG) and Fine-Tuning — have emerged to improve their relevance, accuracy, and applicability across diverse fields. These methods address key limitations in LLMs: RAG enables real-time data retrieval, providing contextually accurate information from external knowledge bases, while fine-tuning specializes LLMs for specific tasks or domains, resulting in responses that align with specialized terminology and task requirements.

These methods not only improve the models’ utility and domain expertise but also extend their lifespans and reduce the need for frequent, costly retraining, giving them significant advantages in production environments.

Retrieval-Augmented Generation (RAG)

Definition and Purpose:
Retrieval-Augmented Generation (RAG) is a technique that combines information retrieval with text generation, enhancing language models by giving them access to external knowledge bases, documents, or databases. This approach helps the model provide more accurate, up-to-date, and contextually relevant information. In RAG, an initial retrieval step is used to pull relevant documents or snippets from a knowledge base, and then a generative model (such as GPT) synthesizes the retrieved information into a coherent answer.

How it Works:

Retrieval Step: The RAG model uses a retriever (often based on models like BERT or specialized retrieval models) to identify the top-k most relevant documents or pieces of information in response to a user’s query.
Generation Step: A generative model then uses this retrieved information to generate a response, often improving the factual accuracy and relevance of the generated output.

Example:
Imagine you’re asking about the history of the Eiffel Tower. A RAG model would first retrieve relevant passages from a knowledge database or documents on Paris landmarks. Then, it generates an answer combining these details, which could result in a more accurate and informative response than a standalone LLM (which might be limited by the static training data available to it).

Fine-Tuning

Definition and Purpose:
Fine-tuning is the process of training a pre-existing large language model (LLM) on additional, domain-specific data to improve its performance for a specific application or task. By exposing the model to custom data, fine-tuning enables it to learn the language, style, terminology, and nuances of the target domain. Fine-tuning can be either supervised, where the model learns from labeled data, or unsupervised, using relevant but unlabeled text.

How it Works:

Data Preparation: Curate a dataset specific to the target task or domain (e.g., medical records for a healthcare model).
Training: The model is trained on this dataset, adjusting its parameters to better align with the new, specific data.
Evaluation and Tuning: The model is evaluated, and additional adjustments are made as needed to improve its accuracy and alignment with the target use case.

Example:
Suppose a model is being fine-tuned to assist medical professionals. The base LLM is exposed to medical literature, guidelines, and research data, allowing it to become proficient in understanding and responding with medical terminology and evidence-based information. This fine-tuned model would then deliver responses that are more accurate for medical questions, compared to a general-purpose LLM.

Comparison Between RAG and Fine-Tuning

RAG vs. Fine-Tuning

Summary:
RAG is highly effective for tasks requiring current, context-specific responses and can adapt quickly to new information. Fine-tuning, however, excels when there is a need for deep, domain-specific expertise that a model must consistently demonstrate.

RAG vs. Fine-Tuning in LLM was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

MediaPipe with Python for Dummies

Dr Sujoy K Goswami, hc — Mon, 22 Aug 2022 12:20:46 GMT

MediaPipe is a project by Google that offers “open-source, cross-platform, customizable ML solutions for live and streaming media”. In other words, MediaPipe provides access to a wide variety of powerful Machine Learning models built with the hardware limitations of mobile devices in mind.

MediaPipe is available for C++, Android, and more; but, in this tutorial, we will be working only with Python. For basic ideas, you can see reference [1]. Here, we will present a few examples with simple codes. Please note that, we have used MediaPipe version 0.8.3.

Example-1: 3D Face Mesh

Here we will capture the face-mesh (3D); & redraw it in a blank canvas to get an output like below:

#Code with comments
import cv2 as cv
import mediapipe as mp
import numpy as np

mpfacemesh = mp.solutions.face_mesh
FaceMesh = mpfacemesh.FaceMesh(max_num_faces=1)
mpdraw = mp.solutions.drawing_utils
drawspec1 = mpdraw.DrawingSpec(color = (255,255,0), circle_radius = 0, thickness = 1)
drawspec2 = mpdraw.DrawingSpec(color = (0,255,0), circle_radius = 0, thickness = 1)
webcam = cv.VideoCapture(0)

while True:
  
 scc,img = webcam.read()
 img = cv.flip(img,1)
 h,w,c = img.shape
 blank_img = np.zeros((h,w,c), np.uint8)
 results = FaceMesh.process(img)
 
 if results.multi_face_landmarks:
  for face_lm in results.multi_face_landmarks:
   img = blank_img
   mpdraw.draw_landmarks(img,face_lm,
         mpfacemesh.FACE_CONNECTIONS,
         drawspec1,drawspec2)
 k = cv.waitKey(1)
 if k == ord('q'):
  break
 cv.imshow('face mesh 3d', img)

webcam.release()  
cv.destroyAllWindows()

Example-2: Simple Augmented Reality

Here first we will detect the eyes & eyebrows; then finally draw a virtual spectacles (2D) to get an output like below:

#Code with comments
import cv2 as cv
import mediapipe as mp
import numpy as np

mpfacemesh = mp.solutions.face_mesh
FaceMesh = mpfacemesh.FaceMesh(max_num_faces=1)
mpdraw = mp.solutions.drawing_utils
drawspec1 = mpdraw.DrawingSpec(color = (255,255,0), circle_radius = 0, thickness = 1)
drawspec2 = mpdraw.DrawingSpec(color = (0,255,0), circle_radius = 0, thickness = 1)
webcam = cv.VideoCapture(0)

#following indices are available in mediapipe dev site
EYE_LEFT_CONTOUR = [
    249, 263, 362, 373, 374,
    380, 381, 382, 384, 385,
    386, 387, 388, 390, 398, 466]
EYE_RIGHT_CONTOUR = [
    7, 33, 133, 144, 145,
    153, 154, 155, 157, 158,
    159, 160, 161, 163, 173, 246]
LEFT_EYEBROW = [
    276, 282, 283, 285, 293, 295, 296, 300, 334, 336]
 
RIGHT_EYEBROW = [
    46, 52, 53, 55, 63, 65, 66, 70, 105, 107]

while True:
  
 scc,img = webcam.read()
 img = cv.flip(img,1)
 h,w,c = img.shape
 results = FaceMesh.process(img)
 
 if results.multi_face_landmarks:
  for face_lm in results.multi_face_landmarks:
   X=[]
   Y=[]
   for lm in face_lm.landmark:
    X.append(int(lm.x*w))
    Y.append(int(lm.y*h))
   #left eye center
   xl = int(np.mean([X[i] for i in EYE_LEFT_CONTOUR]))
   yl = int(np.mean([Y[i] for i in EYE_LEFT_CONTOUR]))
   cv.circle(img,(xl,yl),9,(255,0,255),7)
   #right eye center
   xr = int(np.mean([X[i] for i in EYE_RIGHT_CONTOUR]))
   yr = int(np.mean([Y[i] for i in EYE_RIGHT_CONTOUR]))
   cv.circle(img,(xr,yr),9,(255,0,255),7)
   cv.line(img,(xl,yl),(xr,yr),(255,0,255),3)
   #eyebrows
   xlb = int(np.mean([X[i] for i in LEFT_EYEBROW]))
   ylb = int(np.mean([Y[i] for i in LEFT_EYEBROW]))
   xrb = int(np.mean([X[i] for i in RIGHT_EYEBROW]))
   yrb = int(np.mean([Y[i] for i in RIGHT_EYEBROW]))
   #final drawing
   cv.putText(img,'*',(xl-9,yl+9),cv.FONT_HERSHEY_SIMPLEX,1,(0,255,0),3)
   cv.putText(img,'*',(xr-9,yr+9),cv.FONT_HERSHEY_SIMPLEX,1,(0,255,0),3)
   cv.putText(img,'^',(xlb-9,ylb),cv.FONT_HERSHEY_SIMPLEX,1,(0,255,0),3)
   cv.putText(img,'^',(xrb-9,yrb),cv.FONT_HERSHEY_SIMPLEX,1,(0,255,0),3)
   
 k = cv.waitKey(1)
 if k == ord('q'):
  break
 cv.imshow('augmented reality', img)

webcam.release()  
cv.destroyAllWindows()

Note that, Example-1 gives 3D ouput while, Example-2 gives 2D output. If you like the post, please do clap. Stay connected for more posts on Vision. Thanks.

References:

[1] https://google.github.io/mediapipe/

Mlearning.ai Submission Suggestions

MediaPipe with Python for Dummies was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

SUJOY Filter: A Generic First- Derivative Filter For Image Edge Detection

Dr Sujoy K Goswami, hc — Sun, 12 Jun 2022 16:11:23 GMT

SUJOY filter gives a better approach (first derivative) for image edge detection than the other commonly used first derivative methods (like Robert operator, Prewitt operator, Sobel operator etc.).

The most general masks for SUJOY filter to detect image edges are given below:

horizontal & vertical masks of SUJOY filter

Full paper can be found here. Open-source code is available here.

[Android App for Sujoy Filter]

Also note, averages, medians or weighted-averages of the neighbors around pixel (r-1,c) & pixel (r+1,c) (for horizontal mask; similarly, pixel (r,c-1) & pixel (r,c+1) for vertical mask; see figure below) make SUJOY filter generic & robust.

(r,c) — candidate pixel

Note: [1] SUJOY filter has been accepted by a few open-source communities. [2] Please cite the publication as given here. [3] I am seeking developers proficient in any programming language to help contribute this algorithm to various other open-source platforms.

SUJOY Filter: A Generic First- Derivative Filter For Image Edge Detection was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Multilevel thresholding for image segmentation

Dr Sujoy K Goswami, hc — Tue, 07 Sep 2021 11:52:48 GMT

Thresholding techniques can be divided into bi-level and multi-level category, depending on number of image segments. In bi-level thresholding, image is segmented into two different regions. The pixels with gray values greater than a certain value T are classified as object pixels, and the others with gray values lesser than T are classified as background pixels.

Multilevel thresholding is a process that segments a gray level image into several distinct regions. This technique determines more than one threshold for the given image and segments the image into certain brightness regions, which correspond to one background and several objects. The method works very well for objects with colored or complex backgrounds, on which bi-level thresholding fails to produce satisfactory results.

The full paper can be found here. Here the authors used mean and the variance of the image to find optimum thresholds for segmenting the image into multiple levels. The algorithm is applied recursively on sub-ranges computed from the previous step so as to find a threshold and a new sub-range for the next step.

The Python (>3.0) code for the above approach for n Thresholds is given below:

import cv2
import numpy as np
import math

img = cv2.imread('path-to-image')
img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
a = 0
b = 255
n = 6 # number of thresholds (better choose even value)
k = 0.7 # free variable to take any positive value
T = [] # list which will contain 'n' thresholds

def multiThresh(img, a, b):
    if a>b:
        s=-1
        m=-1
        return m,s

    img = np.array(img)
    t1 = (img>=a)
    t2 = (img<=b)
    X = np.multiply(t1,t2)
    Y = np.multiply(img,X)
    s = np.sum(X)
    m = np.sum(Y)/s
    return m,s

for i in range(int(n/2-1)):
    img = np.array(img)
    t1 = (img>=a)
    t2 = (img<=b)
    X = np.multiply(t1,t2)
    Y = np.multiply(img,X)
    mu = np.sum(Y)/np.sum(X)

    Z = Y - mu
    Z = np.multiply(Z,X)
    W = np.multiply(Z,Z)
    sigma = math.sqrt(np.sum(W)/np.sum(X))

    T1 = mu - k*sigma
    T2 = mu + k*sigma

    x, y = multiThresh(img, a, T1)
    w, z = multiThresh(img, T2, b)

    T.append(x)
    T.append(w)

    a = T1+1
    b = T2-1
    k = k*(i+1)

T1 = mu
T2 = mu+1
x, y = multiThresh(img, a, T1)
w, z = multiThresh(img, T2, b)    
T.append(x)
T.append(w)
T.sort()
print(T)

You can find another approach, Multi-Otsu Thresholding, by scikit-image library. Several other approaches are there. Thank you !!!

Multilevel thresholding for image segmentation was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

“AI Based COVID Social Distance Monitoring System”: Cost effective & easy deployable approach!

Dr Sujoy K Goswami, hc — Fri, 03 Sep 2021 07:49:00 GMT

The Coronavirus pandemic posed an unprecedented global threat, making physical distancing and mask-wearing essential in curbing the virus’s spread. Last year, I developed and deployed an AI-Based COVID Social Distance Monitoring and Mask Detection System across various facilities of my employer, TVS Motor and its Group of Companies, both in India and abroad. Today, this system continues to function effectively, alerting teams to any safety violations in real-time. I am deeply honored to have received multiple awards and recognition globally for this work, especially given its importance during such a critical time.

Prerequisites: Computer Vision, Pedestrian Detection, Deep Learning, YOLO, OpenCV, COVID Protocol

The base for Social Distance Monitoring System was taken from Andrew Ng’s Landing-AI. There, it is mentioned to go for bird’s eye view by morphing the perspective view; to find the actual distances between persons.
However, this is an expensive process due to high computation; as every frame need to be morphed. We don’t need very accurate distance between two persons, right? It should be roughly 6 ft.
Also, there camera calibration needs presence of the deployment team at the site.
So, to remove above challenges, I tweaked the idea a bit as below; & with this I deployed at multiple places (India/ Abroad) remotely i.e. without going to the site actually.
The fundamental is, as the person goes away from the camera, his height appears smaller & it varies linearly (mostly) with distance. Off course, I am assuming that, world coordinate system & camera coordinate system have same axes directions, which happens usually.

Let, h-> minimum height of the person in screen to monitor (i.e. person with height in screen less than h would not be considered); d-> minimum safe distance in screen (== 6 ft. in real world);
H-> average of heights in screen of the 2 persons detected;
D-> projected safe distance in screen between 2 above persons detected;

Then,

(d/h) == (D/H)

We have bounding boxes’ coordinates, so H is available. So, if we know the ratio (d/h), we can get D. For most of the scenerios (I worked with ~15 different types/ brands of cameras placed at different locations), I found (h/d) ~2.5 (assuming that no kid is present in working area; kids have less height) gives pretty good results. You may fine tune this value after observing a few alerts’ snaps.
Now if, E-> Euclidean distance (we can get it from bounding boxes’ coordinates) between the above 2 persons, then when, E < D, alert for safe distance violation will arise.

I have deployed the solution with this tweak to ~100 cameras & succeeded; still the system is working fine in all the places giving alerts for violations. No need to say, there will be a few false positives (Could you say the reason? Write in comment.); but, in this system, we are worried about false negatives, right? False negetives will be none, as long as the persons get detected.

The python class for “PeopleDetector” using OPENCV DNN module is given below:

import itertools
import cv2
import numpy as np

class PeopleDetector:
    flag = 0
    def __init__(self, mindist, minheight,
                yolocfg='yolo_weights/yolov3.cfg',
                yoloweights='yolo_weights/yolov3.weights',
                labelpath='yolo_weights/coco.names',
                confidence=0.5,
                nmsthreshold=0.5,
                ):
        self._yolocfg = yolocfg
        self._yoloweights = yoloweights
        self._confidence = confidence
        self._nmsthreshold = nmsthreshold
        self._labels = open(labelpath).read().strip().split("\n")
        self._colors = np.random.randint(
            0, 255, size=(len(self._labels), 3), dtype="uint8")
        self._net = None
        self._layer_names = None
        self._boxes = []
        self._confidences = []
        self._classIDs = []
        self._centers = []
        self._layerouts = []
        self._MIN_DIST = mindist
        self._mindistances = {}
        self._heights = []
        self._MIN_HEIGHT = minheight

    def load_network(self):
        self._net = cv2.dnn.readNetFromDarknet(
            self._yolocfg, self._yoloweights)
        self._net.setPreferableBackend(cv2.dnn.DNN_BACKEND_OPENCV)
        self._net.setPreferableTarget(cv2.dnn.DNN_TARGET_CPU)
        self._layer_names = [self._net.getLayerNames()[i[0] - 1]
                             for i in self._net.getUnconnectedOutLayers()]
        print("people-detector model loaded successfully\n")

    def predict(self, image):
        blob = cv2.dnn.blobFromImage(image, 1 / 255.0, (416, 416),
                                     [0, 0, 0], 1, crop=False)
        self._net.setInput(blob)
        self._layerouts = self._net.forward(self._layer_names)
        return(self._layerouts)

    def process_preds(self, image, outs, bbox_flag):
        (frameHeight, frameWidth) = image.shape[:2]
        for out in outs:
            for detection in out:
                scores = detection[5:]
                classId = np.argmax(scores)
                if classId != 0:  # filter person class
                    continue
                confidence = scores[classId]
                if confidence > self._confidence:
                    center_x = int(detection[0] * frameWidth)
                    center_y = int(detection[1] * frameHeight)
                    width = int(detection[2] * frameWidth)
                    height = int(detection[3] * frameHeight)
                    left = int(center_x - width / 2.0)
                    top = int(center_y - height / 2.0)
                    if height>self._MIN_HEIGHT and width                        self._classIDs.append(classId)
                        self._confidences.append(float(confidence))
                        self._boxes.append([left, top, width, height])
                        #self._centers.append((center_x, center_y))
                        #self._heights.append(height)
        indices = cv2.dnn.NMSBoxes(
            self._boxes, self._confidences, self._confidence, self._nmsthreshold)

        for j in indices:
            i = j[0]
            box = self._boxes[i]
            left = box[0]
            top = box[1]
            width = box[2]
            height = box[3]
            center_x = int(left + width/2.0)
            center_y = int(top + height/2.0)
            self._centers.append((center_x, center_y))
            self._heights.append(height)
            self.find_min_distance(self._centers, self._heights)
            if len(self._mindistances)>0: PeopleDetector.flag = 1
            else: PeopleDetector.flag = 0
            if bbox_flag:
                self.draw_pred(image, self._classIDs[i], self._confidences[i], left,
                           top, left + width, top + height)

        return PeopleDetector.flag #self._centers

    def clear_preds(self):
        self._boxes = []
        self._confidences = []
        self._classIDs = []
        self._centers = []
        self._layerouts = []
        self._mindistances = {}
        self._heights = []
        PeopleDetector.flag = 0

    def draw_pred(self, frame, classId, conf, left, top, right, bottom):
        cv2.rectangle(frame, (left, top), (right, bottom), (255, 178, 50), 2)
        for k in self._mindistances:
            cv2.line(frame, k[0], k[1], (0, 0, 255), 3)

    def find_min_distance(self, centers, heights):
        centers = self._centers
        heights = self._heights
        temp = list(itertools.combinations(heights, 2))
        comp = list(itertools.combinations(centers, 2))
        ecdist = []
        avghgt = []
        for pts in comp:
            ecdist.append(np.linalg.norm(np.asarray(pts[0])-np.asarray(pts[1])))
        for hts in temp:
            avghgt.append((hts[0]+hts[1])/2.0)
        for i in range(len(avghgt)):
            rel_dist = self._MIN_DIST*avghgt[i]/self._MIN_HEIGHT
            if ecdist[i] < rel_dist:
                self._mindistances.update({comp[i]: ecdist[i]})

Sample output results can be found here. Please note that the RED lines/ boxes in images/ videos are representing COVID protocol violations.

One can take width of the person instead of height, however, that is not recommended; could you say the reason? Write in comment!

“AI Based COVID Social Distance Monitoring System”: Cost effective & easy deployable approach! was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Video Classification Based On Action (from scratch without GPU)

Dr Sujoy K Goswami, hc — Thu, 16 Apr 2020 08:56:07 GMT

Video Classification Based On Action (from scratch & without GPU support)

NO GPU!! NO EXTERNAL HEAVY DATA-SET!! Read to learn & implement the basic video classification technique based on temporal action in any machine.

Here I shall create own video data where, a rectangle is moving in different directions. The sample code (use Jupyter Notebook) is below:

import numpy as np
import skvideo.io as sk

# creating sample video data
num_vids = 5
num_imgs = 100
img_size = 50
min_object_size = 1
max_object_size = 5
 
for i_vid in range(num_vids):
 imgs = np.zeros((num_imgs, img_size, img_size)) # set background to 0
 vid_name = ‘vid’ + str(i_vid) + ‘.mp4’
 w, h = np.random.randint(min_object_size, max_object_size, size=2)
 x = np.random.randint(0, img_size — w)
 y = np.random.randint(0, img_size — h)
 i_img = 0
 while x>0:
 imgs[i_img, y:y+h, x:x+w] = 255 # set rectangle as foreground
 x = x-1
 i_img = i_img+1
 sk.vwrite(vid_name, imgs.astype(np.uint8))

# play a video
from IPython.display import Video
Video(“vid3.mp4”) # the script & video should be in same folder

Now I shall create 4 different types of videos where, a rectangle is moving in 4 directions: left, right, up, down. Accordingly there will be 4 classes which I shall classify based on these video data by Deep Learning. Go through the below code (with python 3.6.9, keras 2.2.4 in Jupyter Notebook); read the comments for sure.

import numpy as np

# preparing dataset
X_train = []
Y_train = []
labels = enumerate([‘left’, ‘right’, ‘up’, ‘down’]) #4 classes

num_vids = 30
num_imgs = 30
img_size = 20
min_object_size = 1
max_object_size = 5

# video frames with left moving object
for i_vid in range(num_vids):
 imgs = np.zeros((num_imgs, img_size, img_size)) # set background to 0
 #vid_name = ‘vid’ + str(i_vid) + ‘.mp4’
 w, h = np.random.randint(min_object_size, max_object_size, size=2)
 x = np.random.randint(0, img_size — w)
 y = np.random.randint(0, img_size — h)
 i_img = 0
 while x>0:
 imgs[i_img, y:y+h, x:x+w] = 255 # set rectangle as foreground
 x = x-1
 i_img = i_img+1
 X_train.append(imgs)
for i in range(0,num_imgs):
 Y_train.append(0)

# video frames with right moving object
for i_vid in range(num_vids):
 imgs = np.zeros((num_imgs, img_size, img_size)) # set background to 0
 #vid_name = ‘vid’ + str(i_vid) + ‘.mp4’
 w, h = np.random.randint(min_object_size, max_object_size, size=2)
 x = np.random.randint(0, img_size — w)
 y = np.random.randint(0, img_size — h)
 i_img = 0
 while x imgs[i_img, y:y+h, x:x+w] = 255 # set rectangle as foreground
 x = x+1
 i_img = i_img+1
 X_train.append(imgs)
for i in range(0,num_imgs):
 Y_train.append(1)

# video frames with up moving object
for i_vid in range(num_vids):
 imgs = np.zeros((num_imgs, img_size, img_size)) # set background to 0
 #vid_name = ‘vid’ + str(i_vid) + ‘.mp4’
 w, h = np.random.randint(min_object_size, max_object_size, size=2)
 x = np.random.randint(0, img_size — w)
 y = np.random.randint(0, img_size — h)
 i_img = 0
 while y>0:
 imgs[i_img, y:y+h, x:x+w] = 255 # set rectangle as foreground
 y = y-1
 i_img = i_img+1
 X_train.append(imgs)
for i in range(0,num_imgs):
 Y_train.append(2)
 
# video frames with down moving object
for i_vid in range(num_vids):
 imgs = np.zeros((num_imgs, img_size, img_size)) # set background to 0
 #vid_name = ‘vid’ + str(i_vid) + ‘.mp4’
 w, h = np.random.randint(min_object_size, max_object_size, size=2)
 x = np.random.randint(0, img_size — w)
 y = np.random.randint(0, img_size — h)
 i_img = 0
 while y imgs[i_img, y:y+h, x:x+w] = 255 # set rectangle as foreground
 y = y+1
 i_img = i_img+1
 X_train.append(imgs)
for i in range(0,num_imgs):
 Y_train.append(3)

# data pre-processing
from keras.utils import np_utils
X_train=np.array(X_train, dtype=np.float32) /255
X_train=X_train.reshape(X_train.shape[0], num_imgs, img_size, img_size, 1)
print(X_train.shape)
Y_train=np.array(Y_train, dtype=np.uint8)
Y_train = Y_train.reshape(X_train.shape[0], 1)
print(Y_train.shape)
Y_train = np_utils.to_categorical(Y_train, 4)

(120, 30, 20, 20, 1)
(120, 1)

# building model
from keras.models import Sequential
from keras.layers import Dense, Conv2D, Flatten, Dropout
from keras.layers.pooling import MaxPooling2D
from keras.layers.recurrent import LSTM
from keras.layers.wrappers import TimeDistributed

model = Sequential()
# TimeDistributed layer is to pass temporal information to the n/w
model.add(TimeDistributed(Conv2D(8, (3, 3), strides=(1, 1), activation=’relu’, padding=’same’), input_shape=(num_imgs, img_size, img_size, 1)))
model.add(TimeDistributed(Conv2D(8, (3,3), kernel_initializer=”he_normal”, activation=’relu’)))
model.add(TimeDistributed(MaxPooling2D((1, 1), strides=(1, 1))))
model.add(TimeDistributed(Flatten()))
model.add(Dropout(0.3))
model.add(LSTM(64, return_sequences=False, dropout=0.3))
model.add(Dense(4, activation=’softmax’))
model.compile(optimizer=’adam’, loss=’categorical_crossentropy’, metrics=[‘accuracy’])
model.summary()

# model training
model.fit(X_train, Y_train, nb_epoch=50, verbose=1)

# model testing with new data (4 videos)
X_test=[]
Y_test=[]
for i_vid in range(2):
 imgs = np.zeros((num_imgs, img_size, img_size)) # set background to 0
 w, h = np.random.randint(min_object_size, max_object_size, size=2)
 x = np.random.randint(0, img_size — w)
 y = np.random.randint(0, img_size — h)
 i_img = 0
 while x imgs[i_img, y:y+h, x:x+w] = 255 # set rectangle as foreground
 x = x+1
 i_img = i_img+1
 X_test.append(imgs)
# 2nd class — ‘right’

for i_vid in range(2):
 imgs = np.zeros((num_imgs, img_size, img_size)) # set background to 0
 w, h = np.random.randint(min_object_size, max_object_size, size=2)
 x = np.random.randint(0, img_size — w)
 y = np.random.randint(0, img_size — h)
 i_img = 0
 while y imgs[i_img, y:y+h, x:x+w] = 255 # set rectangle as foreground
 y = y+1
 i_img = i_img+1
 X_test.append(imgs)
# 4th class — ‘down’

X_test=np.array(X_test, dtype=np.float32) /255
X_test=X_test.reshape(X_test.shape[0], num_imgs, img_size, img_size, 1)

pred=model.predict_classes(X_test)
pred

array([1, 1, 3, 3], dtype=int64)

Here the 4 test videos are getting classified correctly.

Thanks for reading. Also go through my very first related post here.

Video Classification Based On Action (from scratch without GPU) was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Ensemble Learning : Simple Techniques Implemented On Image Data

Dr Sujoy K Goswami, hc — Fri, 10 Apr 2020 19:44:41 GMT

Ensemble Learning : Simple Techniques Implemented On Image Data

Ensemble models in machine learning combine the decisions from multiple models to improve the overall performance. This can be achieved in various ways. Here I will implement two simple ways (on Image Data):

Averaging: Multiple models are used to make predictions for each data point. Average of predictions from all the models is used to make the final prediction
Max Voting: Multiple models are used to make predictions for each data point. The predictions by each model are considered as a ‘vote’. The predictions which we get from the majority of the models are used as the final prediction.

Implementation on MNIST data (python 3.6.9, keras 2.2.4)

#CNN models

from keras.callbacks import ModelCheckpoint
from keras.datasets import mnist
from keras.layers import Conv2D, MaxPooling2D, GlobalAveragePooling2D, Dropout, Activation, Average
from keras.losses import categorical_crossentropy
from keras.models import Model, Input
from keras.optimizers import Adam
from keras.utils import to_categorical

from tensorflow.python.framework.ops import Tensor
from scipy.stats import mode
from typing import List
import glob
import numpy as np
import os

# data processing
def load_data():
    
    (x_train, y_train), (x_test, y_test) = mnist.load_data()
    x_train = x_train / 255.
    x_test = x_test / 255.
    y_train = to_categorical(y_train, num_classes=10)
    return x_train, x_test, y_train, y_test

x_train, x_test, y_train, y_test = load_data()
x_train = x_train.reshape((x_train.shape[0], 28, 28, 1))
x_test = x_test.reshape((x_test.shape[0], 28, 28, 1))
input_shape = x_train[0].shape
model_input = Input(shape=input_shape)

# models(3) building
def first(model_input: Tensor):
    
    x = Conv2D(96, kernel_size=(3, 3), activation='relu', padding = 'same')(model_input)
    x = Conv2D(96, (3, 3), activation='relu', padding = 'same')(x)
    x = Conv2D(96, (3, 3), activation='relu', padding = 'same')(x)
    x = MaxPooling2D(pool_size=(3, 3), strides = 2)(x)
    x = Conv2D(192, (3, 3), activation='relu', padding = 'same')(x)
    x = Conv2D(192, (3, 3), activation='relu', padding = 'same')(x)
    x = Conv2D(192, (3, 3), activation='relu', padding = 'same')(x)
    x = MaxPooling2D(pool_size=(3, 3), strides = 2)(x)
    x = Conv2D(192, (3, 3), activation='relu', padding = 'same')(x)
    x = Conv2D(192, (1, 1), activation='relu')(x)
    x = Conv2D(10, (1, 1))(x)
    x = GlobalAveragePooling2D()(x)
    x = Activation(activation='softmax')(x)
    
    model = Model(model_input, x, name='first')
    return model

def second(model_input: Tensor):
    
    x = Conv2D(96, kernel_size=(3, 3), activation='relu', padding = 'same')(model_input)
    x = Conv2D(96, (3, 3), activation='relu', padding = 'same')(x)
    x = Conv2D(96, (3, 3), activation='relu', padding = 'same', strides = 2)(x)
    x = Conv2D(192, (3, 3), activation='relu', padding = 'same')(x)
    x = Conv2D(192, (3, 3), activation='relu', padding = 'same')(x)
    x = Conv2D(192, (3, 3), activation='relu', padding = 'same', strides = 2)(x)
    x = Conv2D(192, (3, 3), activation='relu', padding = 'same')(x)
    x = Conv2D(192, (1, 1), activation='relu')(x)
    x = Conv2D(10, (1, 1))(x)
    x = GlobalAveragePooling2D()(x)
    x = Activation(activation='softmax')(x)
        
    model = Model(model_input, x, name='second')
    return model

def third(model_input: Tensor):
    
    #mlpconv block 1
    x = Conv2D(32, (5, 5), activation='relu',padding='valid')(model_input)
    x = Conv2D(32, (1, 1), activation='relu')(x)
    x = Conv2D(32, (1, 1), activation='relu')(x)
    x = MaxPooling2D((2,2))(x)
    x = Dropout(0.5)(x)
    
    #mlpconv block2
    x = Conv2D(64, (3, 3), activation='relu',padding='valid')(x)
    x = Conv2D(64, (1, 1), activation='relu')(x)
    x = Conv2D(64, (1, 1), activation='relu')(x)
    x = MaxPooling2D((2,2))(x)
    x = Dropout(0.5)(x)
    
    #mlpconv block3
    x = Conv2D(128, (3, 3), activation='relu',padding='valid')(x)
    x = Conv2D(32, (1, 1), activation='relu')(x)
    x = Conv2D(10, (1, 1))(x)
    
    x = GlobalAveragePooling2D()(x)
    x = Activation(activation='softmax')(x)
    
    model = Model(model_input, x, name='third')
    return model

first_model = first(model_input)
second_model = second(model_input)
third_model = third(model_input)

# models compilation & training
def compile_and_train(model: Model, num_epochs: int): 
    
    model.compile(loss=categorical_crossentropy, optimizer=Adam(), metrics=['acc']) 
    filepath = 'weights/' + model.name + '.hdf5'
    checkpoint = ModelCheckpoint(filepath, monitor='loss', verbose=0, save_weights_only=True,
                                                 save_best_only=True, mode='auto', period=1)
    history = model.fit(x=x_train, y=y_train, batch_size=32, 
                     epochs=num_epochs, verbose=1, callbacks=[checkpoint], validation_split=0.2)
    return filepath

NUM_EPOCHS = 5
first_weight_file = compile_and_train(first_model, NUM_EPOCHS)
second_weight_file = compile_and_train(second_model, NUM_EPOCHS)
third_weight_file = compile_and_train(third_model, NUM_EPOCHS)

# models evaluation
def evaluate_error(model: Model):
    pred = model.predict(x_test, batch_size = 32)
    pred = np.argmax(pred, axis=1)
    error = np.sum(np.not_equal(pred, y_test))/ y_test.shape[0]  
    return error

e1=evaluate_error(first_model); print(e1)
e2=evaluate_error(second_model); print(e2)
e3=evaluate_error(third_model); print(e3)

Output errors:

0.0083
0.0112
0.0113

#Ensemble models

all_models = [first_model, second_model, third_model]
first_model.load_weights(first_weight_file)
second_model.load_weights(second_weight_file)
third_model.load_weights(third_weight_file)

def ensemble_average(models: List [Model]): # averaging
    
    outputs = [model.outputs[0] for model in all_models]
    y = Average()(outputs)
    
    model = Model(model_input, y, name='ensemble_average')
    E = evaluate_error(model)
    return E

def ensemble_vote(models: List [Model]): # max-voting
    
    pred = []
    yhats = [model.predict(x_test) for model in all_models]
    yhats = np.argmax(yhats, axis=2)
    yhats = np.array(yhats)
    #print(yhats.shape)
    for i in range(0,len(x_test)):
        m = mode([yhats[0][i], yhats[1][i], yhats[2][i]])
        pred = np.append(pred, m[0])
    E = np.sum(np.not_equal(pred, y_test))/ y_test.shape[0]  
    return E

E1 = ensemble_average(all_models); print(E1)
E2 = ensemble_vote(all_models); print(E2)

Output errors:

0.0061
0.0068

Clearly Ensemble Learning gives better accuracy here.

References:

Ensemble Learning : Simple Techniques Implemented On Image Data was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Learning CNN (with Image Data) using Simple PYTHON Programs

Dr Sujoy K Goswami, hc — Sat, 07 Apr 2018 14:58:18 GMT

CNN

[Edited & revised on July, 2023]

Here I shall try to share my experience while I was learning CNN. I have put simple small examples (codes) to get understood quickly. Python (≥3.6) & Tensorflow (≥2.3) are used. Jupyter notebook is necessary to run these examples. What’s more? Run the codes & have fun!

1. Handwriting Recognition

Here MNIST dataset is getting downloaded. After training & validating the model, performance is getting estimated using test data. GPU/ higher-RAM is required to run the code. Internet connection is also required.

#importing libraries
import numpy
from tensorflow.keras.datasets import mnist
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras import utils
from tensorflow.keras import backend as K
from random import *

#loading MNIST data & reshaping
(X_train, y_train), (X_test, y_test) = mnist.load_data()
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1).astype('float32')
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32')

#data pre-processing
X_train = X_train / 255
X_test = X_test / 255
y_train = utils.to_categorical(y_train)
y_test = utils.to_categorical(y_test)
num_classes = y_test.shape[1]

#function for creating deep network model
def create_model():
 model = Sequential()
 model.add(Conv2D(32, (3, 3), input_shape=(28, 28,1), activation='relu'))
 model.add(MaxPooling2D(pool_size=(2, 2)))
 model.add(Dropout(0.2))
 model.add(Flatten())
 model.add(Dense(128, activation='relu'))
 model.add(Dense(num_classes, activation='softmax'))
 model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
 return model

#training, validating & testing
model = create_model()
model.summary()
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=200, verbose=1)
scores = model.evaluate(X_test, y_test, verbose=1)
print("CNN Error: %.2f%%" % (100-scores[1]*100))

2. Object Recognition

Here VGG16 network pre-trained with IMAGENET dataset is used to recognize an object (real life common object). GPU is not required. Internet connection is required.

#importing libraries
import numpy as np
from IPython.display import Image, display
from tensorflow.keras.applications import VGG16, imagenet_utils
from tensorflow.keras.preprocessing.image import img_to_array, load_img

#pre-processing input
inputShape = (224, 224)
preprocess = imagenet_utils.preprocess_input

#loading VGG16 with 'imagenet' pre-trained weights
model = VGG16(weights="imagenet")

#displaying, loading & pre-processing test image (one needs to give path for his test image)
display(Image('./test.jpg'))
image = load_img("./test.jpg", target_size=inputShape)
image = img_to_array(image)
image = np.expand_dims(image, axis=0)
image = preprocess(image)

#predicting the output
preds = model.predict(image)
P = imagenet_utils.decode_predictions(preds)
for (i, (imagenetID, label, prob)) in enumerate(P[0]):
 print("{}. {}: {:.2f}%".format(i + 1, label, prob * 100))

3. Single Object Detection (with Bounding Box)

Here dataset is getting created. Each image contains a rectangle as the object. A simple Neural Network is used. GPU/ Internet is not required.

#importing libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib

#creating database
num_imgs = 1000
img_size = 8
min_object_size = 1
max_object_size = 4
num_objects = 1
bboxes = np.zeros((num_imgs, num_objects, 4))
imgs = np.zeros((num_imgs, img_size, img_size))  # set background to 0
for i_img in range(num_imgs):
    for i_object in range(num_objects):
        w, h = np.random.randint(min_object_size, max_object_size, size=2)
        x = np.random.randint(0, img_size - w)
        y = np.random.randint(0, img_size - h)
        imgs[i_img, x:x+w, y:y+h] = 1.  # set rectangle to 1
        bboxes[i_img, i_object] = [x, y, w, h]
        
imgs.shape, bboxes.shape

#plotting sample data
i = 0
plt.imshow(imgs[i].T, cmap='Greys', interpolation='none', origin='lower', extent=[0, img_size, 0, img_size])
for bbox in bboxes[i]:
    plt.gca().add_patch(matplotlib.patches.Rectangle((bbox[0], bbox[1]), bbox[2], bbox[3], ec='r', fc='none'))
    
#reshaping input
X = (imgs.reshape(num_imgs, -1) - np.mean(imgs)) / np.std(imgs)
X.shape, np.mean(X), np.std(X)

#reshaping output
y = bboxes.reshape(num_imgs, -1) / img_size
y.shape, np.mean(y), np.std(y)

#final training & testing data
i = int(0.8 * num_imgs)
train_X = X[:i]
test_X = X[i:]
train_y = y[:i]
test_y = y[i:]
test_imgs = imgs[i:]
test_bboxes = bboxes[i:]

#creating deep network model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Convolution2D, MaxPooling2D 
from tensorflow.keras.optimizers import SGD
model = Sequential([
        Dense(500, input_dim=X.shape[-1]),
        Activation('relu'),
        Dense(300), 
        Activation('relu'), 
        Dense(100), 
        Activation('relu'), 
        Dropout(0.2), 
        Dense(y.shape[-1])
    ])
model.compile('adadelta', 'mse')

#training & validating
model.fit(train_X, train_y, epochs=100, validation_data=(test_X, test_y), verbose=2)

#predicting on test data
pred_y = model.predict(test_X)
pred_bboxes = pred_y * img_size
pred_bboxes = pred_bboxes.reshape(len(pred_bboxes), num_objects, -1)
pred_bboxes.shape

#plotting the prediction
plt.figure(figsize=(12, 3))
for i_subplot in range(1, 6):
    plt.subplot(1, 5, i_subplot)
    i = np.random.randint(len(test_imgs))
    plt.imshow(test_imgs[i].T, cmap='Greys', interpolation='none', origin='lower', extent=[0, img_size, 0, img_size])
    for pred_bbox, exp_bbox in zip(pred_bboxes[i], test_bboxes[i]):
        plt.gca().add_patch(matplotlib.patches.Rectangle((pred_bbox[0], pred_bbox[1]), pred_bbox[2], pred_bbox[3], ec='r', fc='none'))

Sample Outputs

4. Multiple Objects Detection (with Shapes)

Here dataset is getting created. Read the comments carefully. GPU/ Internet is not required.

# importing libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
# creating dataset
# here 0-4 black objects (different shapes with random sizes) are placed in a noisy image (24 x 24). 
# the image is divided into 4 quadrants (w.r.t. image center) & each quadrant contains 0-1 object randomly.
# 4000 such images are taken.
# the objects with rectangular & lower-triangular shapes are of our interest.
# the upper-traingular shapes are dummy.
# due to randomness few images may be blank or with upper-triangular shape (dummy object) only.
# bounding boxes of the interested objects are also saved.
num_imgs = 4000
img_size = 24
min_rect_size = 3
max_rect_size = 9
max_num_objects = 5
bboxes = np.zeros((num_imgs, max_num_objects, 4))
imgs = np.random.rand(num_imgs, img_size, img_size)
shapes = np.zeros((num_imgs, max_num_objects, 1))
for i_img in range(num_imgs):
    i_object = 0
    if np.random.choice([True, False]):
        width, height = np.random.randint(min_rect_size, max_rect_size, size=2)
        x = np.random.randint(0, img_size/2 - width)
        y = np.random.randint(0, img_size/2 - height)
        imgs[i_img, x:x+width, y:y+height] = 1.
        bboxes[i_img, i_object] = [x, y, width, height]
        shapes[i_img, i_object] = [0]
        i_object += 1
    if np.random.choice([True, False]):
        size = np.random.randint(min_rect_size, max_rect_size)
        x, y = np.random.randint(img_size/2, img_size - size, size=2)
        mask = np.tril_indices(size)
        imgs[i_img, x + mask[0], y + mask[1]] = 1.
        bboxes[i_img, i_object] = [x, y, size, size]
        shapes[i_img, i_object] = [1]
        i_object += 1
    if np.random.choice([True, False]):
        width, height = np.random.randint(min_rect_size, max_rect_size, size=2)
        x = np.random.randint(img_size/2, img_size - width)
        y = np.random.randint(0, img_size/2 - height)
        imgs[i_img, x:x+width, y:y+height] = 1.
        bboxes[i_img, i_object] = [x, y, width, height]
        shapes[i_img, i_object] = [0]
        i_object += 1
    if np.random.choice([True, False]):
        size = np.random.randint(min_rect_size, max_rect_size)
        x = np.random.randint(0, img_size/2 - size)
        y = np.random.randint(img_size/2, img_size - size)
        mask = np.triu_indices(size)
        imgs[i_img, x + mask[0], y + mask[1]] = 1.
        #bboxes[i_img, i_object] = [x, y, size, size]
        #shapes[i_img, i_object] = [1]
        #i_object += 1
    for i in range(i_object, max_num_objects):
        bboxes[i_img, i] = [-1, -1, -1, -1]
        shapes[i_img, i] = [-1]
            
imgs.shape, bboxes.shape
# plotting sample input data
# see 5 randomly chosen input images. the bounding boxes of interested objects are marked red.
plt.figure(figsize=(24, 8))
for i_subplot in range(1, 6):
    plt.subplot(1, 5, i_subplot)
    i = np.random.randint(num_imgs)
    plt.imshow(imgs[i].T, cmap='Greys', interpolation='none', origin='lower', extent=[0, img_size, 0, img_size])
    for bbox, shape in zip(bboxes[i], shapes[i]):
        plt.gca().add_patch(matplotlib.patches.Rectangle((bbox[0], bbox[1]), bbox[2], bbox[3], ec='r', fc='none'))
# pre-processing data
X = (imgs.reshape(num_imgs, img_size, img_size, 1) - np.mean(imgs)) / np.std(imgs)
y = np.concatenate([bboxes / img_size, shapes], axis=-1).reshape(num_imgs, -1)
X.shape, y.shape
# final training & testing data
i = int(0.8 * num_imgs)
train_X = X[:i]
test_X = X[i:]
train_y = y[:i]
test_y = y[i:]
test_imgs = imgs[i:]
test_bboxes = bboxes[i:]
# creating deep network model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Activation, Dropout, Convolution2D, MaxPooling2D, Flatten
from tensorflow.keras.optimizers import SGD
model = Sequential([
        Convolution2D(8, (3, 3), activation='relu', input_shape=(24, 24, 1)),
        Convolution2D(8, (3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),
        Convolution2D(8, (3, 3), activation='relu'),
        MaxPooling2D(pool_size=(2, 2)),
        Flatten(),
        Dense(3000),
        Activation('relu'),
        Dropout(0.3),
        Dense(1500), 
        Activation('relu'), 
        Dense(500), 
        Activation('relu'),
        Dropout(0.3),
        Dense(50),
        Activation('relu'),
        Dense(y.shape[-1])
    ])
model.compile('adadelta', 'mse')
# training the model & validating
model.fit(train_X, train_y, epochs=100, validation_data=(test_X, test_y), verbose=2)
# predicting on test data
pred_y = model.predict(test_X)
pred_y = pred_y.reshape(len(pred_y), max_num_objects, -1)
pred_bboxes = pred_y[..., :4] * img_size
pred_shapes = pred_y[..., 4:5]
pred_bboxes.shape, pred_shapes.shape
# plotting the predictions
# see 5 randomly chosen output predictions (in blue/ green shapes). 
# note that no upper-triangular shape has got predicted.
# accuracy could be improved by other Deep Models or/and by tuning the various associated parameters/ variables/ methods.
plt.figure(figsize=(24, 8))
for i_subplot in range(1, 6):
    plt.subplot(1, 5, i_subplot)
    i = np.random.randint(len(test_X))
    plt.imshow(test_imgs[i].T, cmap='Greys', interpolation='none', origin='lower', extent=[0, img_size, 0, img_size])
    for pred_bbox, pred_shape in zip(pred_bboxes[i], pred_shapes[i]):
        if pred_shape[0] <= 0.5:
            plt.gca().add_patch(matplotlib.patches.Rectangle((pred_bbox[0], pred_bbox[1]), pred_bbox[2], pred_bbox[3], fc='b', alpha=0.5))
        else:
            xy = ([[pred_bbox[0]+pred_bbox[2], pred_bbox[1]+pred_bbox[3]],
                    [pred_bbox[0]+pred_bbox[2], pred_bbox[1]],
                    [pred_bbox[0], pred_bbox[1]]])
            plt.gca().add_patch(matplotlib.patches.Polygon(xy, True, fc='g', alpha=0.5))

Sample Random Inputs

Sample Random Outputs

References:

- https://towardsdatascience.com/object-detection-with-neural-networks-a4e2c46b4491

Please CLAP for the post if you like it, & also share it. Stay connected, I will add more codes soon… Thanks!

Learning CNN (with Image Data) using Simple PYTHON Programs was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.