Addressing The Effect of Bad Data Quality on Computer Vision: A Case Study

Joshua Sirusstara
5 min readJun 10, 2024

--

My eyes at the moment of the apparitions by August Natterer

Introduction

In this case study, we are going to breakdown how an overfitting could occur in an computer vision modelling task, showcasing its impact through a classical model — the convolutional neural network (CNN). We explore how the utilization of poor-quality data, characterized by limited variation, can lead to misleadingly high performance metrics, ultimately resulting in a subpar model when tested in dynamic environments. To illustrate this concept, we focus on a quintessential task: American Sign Language (ASL) alphabet classification. ASL classification poses a unique challenge due to its tendency for small variations in hand posing, making it susceptible to the pitfalls of overfitting effects when trained on insufficiently diverse datasets.

Through this case study, we highlight the crucial role of data quality and its significant impact on the reliability and robustness of AI systems in real-world applications.

Dataset

The dataset we will be using is the ASL alphabet dataset which contains shots of real human hands doing the ASL alphabet signs. The dataset is split with a 4:1 ratio into 20800 training data and 5200 validation data each stratified to maintain distribution between 26 categories of alphabet. For the testing set we will be using human baseline to compare the final performance.

Before splitting, we noticed that there are similar or possibly redundant shots of the same image data in the training set. This redundancy in the trainig set could compromise data quality, leading to leakage into the validation set.

Redundant sample found in the dataset

To quickly reduce the leakage effect, we’ll apply data augmentation after splitting the training and validation sets.

Modelling

The model we will be using is a CNN base with the architecture shown in the code snippet below

model = Sequential()
model.add(Conv2D(64, (5, 5), padding='same',
input_shape=IMG_SHAPE))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(32, (3, 3), padding='same'))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(256))
model.add(Activation('relu'))
model.add(Dense(26))

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-4),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])

tf.keras.backend.clear_session()

history = model.fit(train_generator,
validation_data=valid_generator,
epochs=100,
callbacks=[checkpoint_cb, early_stopping_cb]
)
...
loss: 0.1105 - accuracy: 0.9675 - val_loss: 0.1015 - val_accuracy: 0.9721

From the early stop result we can see that the first model act as the baseline already perform really well. Next we want to improve the model by tuning the complexity while also adding regularization to avoid overfitting on the data.

When tuning the complexity of a model, such as adjusting the number of layers or neurons in the architecture, these are some idea to help guide to convergence :

  • The deeper the convolutional network the more detailed feature gets pick up increasing the number of feature extracted
  • Pooling helps reduce the image size making affordable to have more channels

Other than addressing model complexity, it is also a good idea to apply batch normalization and Monte Carlo Dropout to our use case. Batch normalization helps normalize the contribution of each neuron during training, while dropout forces different neurons to learn various features rather than having each neuron specialize in a specific feature. We use Monte Carlo Dropout, which is applied not only during training but also during validation, as it improves the performance of convolutional networks more effectively than regular dropout.

The final tuned model’s architecture is shown on the code snippet below :

class MCDropout(tf.keras.layers.Dropout):
def call(self, inputs):
return super().call(inputs, training=True)

model = Sequential()
model.add(Conv2D(64, (5, 5), padding='same',
input_shape=IMG_SHAPE))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MCDropout(.25))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(128, (3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MCDropout(.25))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Conv2D(256, (3, 3), padding='same'))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(MCDropout(.25))
model.add(MaxPooling2D(pool_size=(2, 2)))
model.add(Flatten())
model.add(Dense(512))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(.25))
model.add(Dense(256))
model.add(BatchNormalization())
model.add(Activation('relu'))
model.add(Dropout(.25))
model.add(Dense(26))

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=1e-5),
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])

tf.keras.backend.clear_session()

history = model.fit(train_generator,
validation_data=valid_generator,
epochs=100,
callbacks=[checkpoint_cb, early_stopping_cb]
)

Training Result

...
loss: 0.0495 - accuracy: 0.9883 - val_loss: 0.0150 - val_accuracy: 0.9973

From the result we can see that the early stopped tuned model perform better than the early stopped baseline. There some possibility for overfitting which we will try to validate from the training history data.

Tuned model training history data, left image training and validation accuracy, right image training and validation loss

From the training history plot it seems that tuned model have generalize well on both the training set and the validation set without overfitting, which we will next investigate on the test set.

Test Result

            Image Id Target
0 A_test.jpg J
22 B_test.jpg B
14 C_test.jpg C
5 D_test.jpg D
1 E_test.jpg E
24 F_test.jpg F
6 G_test.jpg G
12 H_test.jpg H
7 I_test.jpg I
26 J_test.jpg J
27 K_test.jpg K
2 L_test.jpg L
9 M_test.jpg M
3 N_test.jpg N
21 O_test.jpg O
16 P_test.jpg P
13 Q_test.jpg Q
23 R_test.jpg R
4 S_test.jpg S
15 T_test.jpg T
19 U_test.jpg U
17 V_test.jpg V
8 W_test.jpg W
11 X_test.jpg X
18 Y_test.jpg Y
25 Z_test.jpg Z

From the test result the tuned model seems to be off by 1 image out of 26 compare to human baseline. From quick investigation we found that the test data contain extreme difference in lighting compare to the other training data. This could validate one of the weakness of convolutional network in dynamic environment unlike contextual model. The solution for this can be in form of image pre-processing, by equalizing the histogram distribution of pixel intensities, or by using a contextual model that is able to attend to a certain point of interest.

Other than the difference in lighting, we also found that the dataset used for training contains very little to no variation in images. As a result, our convolutional model easily overfit, which explains the very high training and validation scores but the lower score in testing.

Conclusion

We have trained a convolutional model on an ASL handsign dataset. We found that data variation is really important for the outcome of the model especially on a convolutional network. We have tried data augmentation to help increase the variation of the data and found some things to explore along the way such as :

  • More dataset variation
  • Image preprocessing
  • Try contextual models thats capable of attending points of interest

We hope this article emphasizes the importance of good data quality and the problems that arise from poor data quality in convolutional models, which can lead to misleading results.

Acknowledgement

The full notebook can be accesed

Other references

--

--