Understanding Convolutional Neural Networks (CNNs) Through Image Classification — Analyzing surveillance footage to identify pedestrians

6 min readApr 30, 2024

In real-life scenarios, such as monitoring a busy intersection to detect pedestrians and potentially alert drivers or traffic control systems, the following steps can help the Convolutional Neural Network (CNN) to focus on the most relevant features, ignore noise and variations in lighting or weather conditions, and make accurate, real-time decisions that can prevent accidents and improve traffic flow. These steps allow the CNN to learn from vast amounts of data and apply this learning to the critical task of ensuring pedestrian safety in traffic environments.

Step 1: Image Acquisition

Here is an image of a surveillance camera capturing a busy street. This demonstrates the initial step in using a CNN for analyzing surveillance footage to identify pedestrians. The image shows RGB color channels highlighted, which are essential for the next steps in image processing.

Step 2: Preprocessing

2.1 Convert to RGB

Ensure the image is in RGB format (Red, Green, Blue), where each pixel’s color is represented by a combination of these three primary colors. This is crucial because most CNN models expect input images in RGB format.
This step ensures the image is in RGB format for uniform processing.

2.2 Resize Image

Resize the image to the dimensions required by the CNN (e.g., 224x224 pixels for models like VGG16). This standardization is necessary to match the input shape that the CNN architecture expects.

2.3 Normalize the Pixel Values

Normalize pixel values typically to a range of 0–1 by dividing each pixel value by 255. This step helps in reducing model complexity and improving convergence speeds.

Step 3: Create Matrices from RGB Values

Matrix Representation: An RGB image of size 𝑚×𝑛 will be represented as three 𝑚×𝑛 matrices, one for each color channel

Step 4: Define the CNN Architecture

Architecture Setup: A suitable architecture might include several layers designed to progressively extract more complex features from the basic shapes and colors of pixels to the specific features of pedestrian forms, such as heads, arms, and legs.
VGG-16 is a deep CNN originally developed for image classification but can be adapted for pedestrian detection. It consists of 13 convolutional layers, 5 pooling layers, and 3 fully connected layers. The deep nature of this network allows it to learn very high-level features in images, making it suitable for accurately detecting pedestrians even in complex urban scenes.

Step 5: Convolution Operations

5.1 Filter Definition

Feature Extraction: Filters also known as kernels, applied during convolution operations can detect edges and textures. For example, vertical edges might indicate the sides of bodies, while horizontal lines might delineate the horizon or a crosswalk.

5.2 Stride Settings

The stride determines the step size the filter moves each time it slides over the image. A larger stride reduces the spatial dimensions of the output feature map, which can decrease computational complexity.

5.3 Padding

Padding is adding layers of zeros outside the original image to allow the filter to be applied to the bordering elements of the input image matrix, preserving the spatial dimensions of the output.

Activation Functions: These help introduce non-linearity, enabling the network to learn complex patterns. ReLU (Rectified Linear Unit) is widely used because it speeds up training and helps maintain important gradient information by activating only positive inputs.

5.4 Performing Convolution

Step 6: Pooling (Down sampling)

Apply pooling layers to reduce the spatial dimensions of the feature maps. This reduces the computational load, controls overfitting, and helps the CNN to generalize better.
Reduction of Spatial Size: This step emphasizes the most prominent features while discarding irrelevant details, which is crucial in busy traffic scenes where background elements can distract from the key task of pedestrian detection.

Step 7: Flattening

The output from convolution and pooling layers is flattened into a vector for the fully connected layers.

Step 8: Fully Connected Layers

These layers learn non-linear combinations of the features to make final predictions about the presence of pedestrians. Determines if an image contains a pedestrian by analyzing combined features like body shapes and movement patterns. After feature extraction, these layers classify the image based on detected features.

Step 9: Output Layer

The output layer classifies the segments of the image as containing pedestrians or not. For classification tasks, this typically includes a softmax activation function that outputs a probability distribution over the classes. The softmax function ensures that the probabilities of all possible outcomes (pedestrian or no pedestrian) sum up to one.

Step 10: Compile and Train the Model, and Model Evaluation

Model Compilation

Loss Function: The choice of a loss function depends on the specific task. For binary classification (pedestrian or not), Binary Cross-Entropy is a common choice.
Optimizer: Commonly used optimizers include SGD (Stochastic Gradient Descent), Adam, or RMSprop, which help to minimize the loss function by adjusting the weights of the network.

Model Training

During training, the model learns by adjusting the weights to minimize the loss. This is done by feeding forward the input data through the network, calculating the loss, and then propagating the error back through the network (backpropagation) to update the weights.
Example: If a batch of images (input data) includes various images of urban scenes, the model learns by adjusting to correctly identify images containing pedestrians.

Step 11: Model Evaluation

Evaluating the Model

To assess the model’s performance on unseen data, which helps to gauge its generalization capability. Common metrics include accuracy, precision, recall, and F1-score. For imbalanced datasets, precision, recall, and F1-score are particularly informative.

If the model is tested on a separate dataset from a different city and achieves high precision and recall, it indicates effective learning and generalization.

Step 12: Prediction

Use the trained and validated model to make predictions on new, real-world data.
For a new surveillance video feed, the model can predict the presence of pedestrians in each frame, potentially triggering alerts or informing traffic management systems.
In smart city infrastructure, the model could be integrated into traffic cameras to continuously monitor pedestrian traffic and enhance pedestrian safety at intersections.