Finding NFL Helmets (and attempting to find impacts) with Object Detection

David Bartholomew
Feb 20 · 17 min read
Image for post
Image for post
Photo by Adrian Curiel on Unsplash

The NFL continues its quest for improving player safety, working closely with Amazon Web Services to develop the “digital athlete” in order to virtually analyze, predict and prevent player injuries. Preventing concussions remains top on the list of priorities. As most recently reported by the NFL, the 2019–2020 season involved 224 total concussions. As part of the initiative to reduce concussions and related injuries, the NFL created a Kaggle data science competition for finding helmet impacts with computer vision models using a combination of thousands of images, 120 labeled videos with end zone and sideline views, and player tracking data provided by Next Gen Stats, powered by AWS.

The provided data included images and videos with labeled bounding boxes surrounding each helmet with a goal of using the location of the bounding boxes to predict the location of the helmets in unlabeled videos and images. Some of the descriptions of the data are below along with a preview of the labels (modified version from the original data) and a specific frame in the video with labeled bounding boxes.

- gameKey: the ID code for the game.
- playID: the ID code for the play.
- view: the camera orientation (either end zone or sideline view)
- video: the filename of the associated video.
- frame: the frame number for this play.
- labels: whether an impact occurred or not
- [left/width/top/height]: the specification of the bounding box of the prediction. the right/bottom columns were added to be used later in the model.
- impactType: a description of the type of helmet impact: helmet, shoulder, body, ground, etc.

Image for post
Image for post
Image for post
Image for post
Labeled bounding boxes. Image provided by the NFL.

Detecting helmet impacts is no easy task. Before detecting impacts, the initial task is detecting helmets. Technically, this is considered the “easier” part of the project, and including player tracking information and considering 3D aspects of impacts would be more difficult. To be upfront, I didn’t use the player tracking information due to the fact that object detection in and of itself was completely new to me when I began the project, but I certainly aspire to improve on the work I’ve done. Below I’ll share my process of detecting helmets as well as my attempt at detecting impacts.

Preprocessing data

On a high level, detecting objects in an image involves two central tasks, localization and classification, i.e. identifying where objects are located in the image as well as identifying which objects they are (and how many). There are multiple algorithms that can be used to accomplish this, but I chose the RetinaNet model as it is a single-stage object detector that optimizes both speed and accuracy. The majority of the original structure of the model I used is available at this link.

Significant preprocessing was required to create an efficient input data pipeline for the RetinaNet model (without using PyTorch). In retrospect, some of the steps I took may have been unnecessary, but it was part of the learning process of what type of structure was required. I won’t cover every step taken, but here are some of the important ones.

Capturing Video Frames

Rather than using the provided images by the NFL, I used the CV2 module in Python to capture images from frames of the video files as arrays and then compressed the images to a memory cache and encoded as JPEG files (eliminates the need to download the 10,000 images). Below is the code for capturing video frames (images) as arrays using a DataFrame with previously created unique image ids for both end zone and sideline images.

#Capture images from videos
video_list = []
missing_list = []
for i, j in tqdm.tqdm_notebook(enumerate(unique_df.video)):
cap = cv2.VideoCapture('train/' + j)
cap.set(cv2.CAP_PROP_POS_FRAMES, unique_df.frame[i])
success, img_array = cap.read()
if success == True:
video_list.append(img_array)
else:
missing_list.append(i)
#Convert the video_list to an array
train_arrays = np.array(video_list)

Converting Annotations

The original bounding box annotations in the video CSV file were defined as Left, Top, Width, Height (x1, y1, w, h). The width and height needed to be converted to Right, Bottom (x1+w=right, y1+h=bottom) and normalized. To normalize the annotations, the x values were divided by the image width and y values were divided by the image height. Since all images were 1280x720, this was a simple calculation. Additionally, the final order of annotations needed to be y1 (top), x1 (left), y2 (bottom), x2 (right).

Train/Test Split

To properly evaluate the model, the image arrays as well as the annotations needed to be split into training and testing sets. I used sklearn’s train_test_split to do so, with 80% allocated to training and 20% allocated to the test set.

#Train/Test Split
X_train, X_test, train_df, test_df = train_test_split(training_images, unique_df, test_size=.20,
random_state=set_seed)

Skipping some steps in between, the final image count used was 1,081 and the overall annotations count was 21,843.

Converting Data to COCO format

RetinaNet requires the data format to be in COCO format, which stands for Common Objects in Context (https://cocodataset.org/#detection-2020). I first converted the data to multiple JSON files in the following format:

Images:
Image filenames and details of the image (height, width, id)

#Example image
{'file_name': '57992301426.png',
'height': 720,
'width': 1280,
'x_train_id': 0,
'id': 57992301426}

Categories:
One category: helmet

[{'id': 0, 'name': 'helmet'}]

Objects:
Bounding box annotations.

{'id': 0,
'category_id': 0,
'iscrowd': 0,
'image_id': 580942819185,
'objects': 17,
'bbox': [0.42083333333333334, 0.65, 0.4513888888888889, 0.66484375]}

Annotations:
Combining all above JSON files as nested dictionaries

#Combine files as nested dictionary
annotations = {"images": train_images, "annotations": train_annotations, "categories": train_categories}
Image sample:
{'file_name': '57992301426.png', 'height': 720, 'width': 1280, 'x_train_id': 0, 'id': 57992301426}
Annotations sample:
{'id': 0, 'category_id': 0, 'iscrowd': 0, 'image_id': 580942819185, 'objects': 17, 'bbox': [0.42083333333333334, 0.65, 0.4513888888888889, 0.66484375]}
Categories sample:
{'id': 0, 'name': 'helmet'}

Using pycocotools API

The pycocotools API is used to assist in parsing, loading, and visualizing annotations in the COCO format.

#Load train annotations file for parsing, loading, visualizing
train_annotations_file = ‘data/coco/2020/annotations/train/annotations.json’
train_coco=COCO(train_annotations_file)loading annotations into memory...
Done (t=0.04s)
creating index...
index created!

Creating TFRecords (TensorFlow)

Converting data to TFRecords has multiple advantages, the main being that TFRecords are TensorFlow’s binary file storage format, which can be read extremely efficiently from disk, and can be distributed to multiple TPUs (Tensor Processing Units) when training, which significantly reduces training time. Leaving out the code as it is somewhat lengthy and a multi-step process, but it is included in my preprocessing notebook in my GitHub repo.

Intersection over Union & Anchor Boxes

Prior to looking at the structure of RetinaNet, it’s important to understand Intersection over Union (IoU) and Anchor Boxes. As described by the open source project, “[a]nchor boxes are fixed sized boxes that the model uses to predict the bounding box for an object. It does this by regressing the offset between the location of the object’s center and the center of an anchor box, and then uses the width and height of the anchor box to predict a relative scale of the object. In the case of RetinaNet, each location on a given feature map has nine anchor boxes (at three scales and three ratios)” (Srihari Humbarwad, https://keras.io/examples/vision/retinanet/). Intersection over union is the calculation of the overlap of anchor boxes to ground truth boxes (our original annotations). It is calculated as the area where an anchor box intersects with the ground truth box divided by the combined total area of the anchor box and ground truth box.

While it is common to set the IoU threshold to 0.5, I altered the threshold to be less strict at 0.3 with an ignore threshold value at 0.2. This entails the following:

  • An anchor box with an IOU less than 0.2 is considered background and the class label predicted is 0.
  • If the anchor box predicts an object with an IOU less than 0.2, it is penalized by the loss function
  • An anchor box with an IOU threshold between 0.2 and 0.3 is ignored by the loss function entirely.

Setting the IoU threshold to 0.3, retrieving a statistical summary of bounding box width and height as well as plotting a few distributions gives some intuition for determining the scales, aspect ratios and areas of the anchor boxes.

Image for post
Image for post

Plotting Width / Height helps determine that the majority of bounding boxes are relatively square (with some clear outliers), so keeping the aspect ratios at 0.5, 1.0, and 1.5 would cover most of the ratios appropriately.

Image for post
Image for post

Plotting the width * height gives us a general idea of the areas of the bounding boxes. Scales were left the same as the open source project (2^x for x in 0, 1/3, 2/3). The original areas were set to x**2 for x in [32, 64, 128, 256, 512]. Since the helmets in most cases are very small objects, the areas were altered to a range closer to the dataset: x**2 for x in [12, 24, 36, 48, 60].

Model Structure

With RetinaNet, a Feature Pyramid Network (FPN) creates a subnetwork for classification (predicting class labels) and a subnetwork for regression (localization, size and shape of bounding boxes) using the backbone’s output. Some details regarding the FPN:

  • Bottom-up pathway: last feature map of each group of consecutive layers (c3_output, c4_output, c5_output shown below) are extracted from ResNet50, a prebuilt convolutional neural network. One important note is that the original open source did not include freezing the backbone layers so that the model would not train them as well. Freezing the backbone significantly reduces training time.
def get_backbone():"""Builds ResNet50 with pre-trained imagenet weights"""backbone = keras.applications.ResNet50(include_top=False, input_shape=[None, None, 3])backbone.trainable = Falsec3_output, c4_output, c5_output = [backbone.get_layer(layer_name).output 
for layer_name in ["conv3_block4_out", "conv4_block6_out", "conv5_block3_out"]]
return keras.Model(inputs=[backbone.inputs], outputs=[c3_output, c4_output, c5_output])
  • Top-down pathway: using nearest neighbor sampling, the last feature map from the bottom-up pathway is expanded to the same scale as the second-to-last feature map. The two feature maps are then merged by element-wise addition to form a new feature map until each feature map from the bottom-up pathway has a corresponding feature map connected with lateral connections.
  • There are 5 levels in the pyramid (P3 through P7), and each level generates predictions using the classification and regression subnetwork
  • The output for classification is the probability distribution object classes
  • The output for regression is the offset of anchor boxes and ground truth boxes (4 values for each object)

Higher level feature maps are proficient at detecting larger objects, covering larger areas of an image, while lower level feature maps are more proficient at detecting smaller objects. The original number of filters in the open source were set to 256. Although increasing the amount of filters could potentially improve accuracy of the model, it would also increase training time, so I left them as is.

When originally training with this structure, I found the model was significantly overfitting. In the end, the model was still slightly overfit to the training data, but adding 2D Spatial Dropout layers at outputs p3 through p6 helped reduce overfitting and reduce the loss for the test set.

def call(self, images, training=False):c3_output, c4_output, c5_output = self.backbone(images, training=training)
p3_output = self.conv_c3_1x1(c3_output)
p4_output = self.conv_c4_1x1(c4_output)
p5_output = self.conv_c5_1x1(c5_output)
p4_output = p4_output + self.upsample_2x(p5_output)
p3_output = p3_output + self.upsample_2x(p4_output)
p3_output = keras.layers.SpatialDropout2D(rate=0.4)(p3_output)
p3_output = self.conv_c3_3x3(p3_output)
p4_output = self.conv_c4_3x3(p4_output)
p4_output = keras.layers.SpatialDropout2D(rate=0.4)(p4_output)
p5_output = self.conv_c5_3x3(p5_output)
p5_output = keras.layers.SpatialDropout2D(rate=0.4)(p5_output)
p6_output = self.conv_c6_3x3(c5_output)
p6_output = keras.layers.SpatialDropout2D(rate=0.4)(p6_output)
p7_output = self.conv_c7_3x3(tf.nn.relu(p6_output))
return p3_output, p4_output, p5_output, p6_output, p7_output

Non-Max Suppression

Another important technique utilized in the original open source is non-max suppression. With multiple anchor boxes, you could potentially have several predict the same ground truth bounding box. Each prediction generates a probability (between 0 and 1) whether the anchor box contains and object. This probability is also called the confidence score. The parameters and how non-max suppression works are as follows:

Parameters:

  • Number of classes: (in this case, 1)
  • Confidence threshold: A value between 0 and 1. It’s the the lowest class probability you want to keep. The confidence threshold can be adjusted later, so keeping this low is ideal (0.05).
  • NMS IoU Threshold: A value between 0 and 1. This value determines the NMS operation, which I’ll explain.
  • Max Detections Per Class: Max detections you want to keep per class.
  • Max Detections: The total number of detections you want to keep. One suggestion is to look at training data and determine the max count of objects in your dataset for a specific image.
  • Box variance: Scaling factors used to scale the bounding box predictions.

The NMS operation discards predictions under the confidence threshold and selects the prediction with highest confidence score. Other predictions above the confidence threshold are compared to the highest confidence score using intersection over union. If the IoU is above the set NMS IoU Threshold, then that prediction is also discarded.

The problem with this method is that there are multiple overlapping ground truth bounding boxes in the dataset, so selecting an NMS IoU Threshold may eliminate overlapping anchor boxes that actually match ground truth boxes, thus eliminating true positives. Rather than returning the NMS results after decoding predictions, this was altered to return all results (RetinaNet limits results to 1k for each prediction). While this returns more predictions than desired in the end, ultimately, a higher confidence threshold will be used after training the model to filter out unwanted results. The original NMS code has been commented out so the difference is clear.

def call(self, images, predictions):image_shape = tf.cast(tf.shape(images), dtype=tf.float32)
anchor_boxes = self._anchor_box.get_anchors(image_shape[1], image_shape[2])
box_predictions = predictions[:, :, :4]
cls_predictions = tf.nn.sigmoid(predictions[:, :, 4:])
boxes = self._decode_box_predictions(anchor_boxes[None, ...], box_predictions)
scores = cls_predictions
# return tf.image.combined_non_max_suppression(
# tf.expand_dims(boxes, axis=2),
# cls_predictions,
# self.max_detections_per_class,
# self.max_detections,
# self.nms_iou_threshold,
# self.confidence_threshold,
# clip_boxes=False)
return boxes, scores

Loss Function

The loss function for RetinaNet is a duel-task loss function that includes a term for localization and a term for classification. A smooth l1 loss is used for the regression/localization task of matching ground truth boxes to anchor boxes. The regression subnet predicts 4 numbers, the first two numbers being the offset of the centers of ground truth and anchor boxes and the second two numbers being the offset of width and height.

Similar to a categorical cross entropy loss function, the focal loss function implements two additional parameters of alpha and gamma, which help address class imbalance. Especially with small object detection, there is a significant imbalance between the background class and the objects being detected. The gamma parameter is used to down-weight the loss of objects that are easy to classify and forces the network to focus on harder detections. On the other hand, alpha is used to down-weight the loss of examples in the background class.

Both the alpha and gamma parameters were left as the original open source project with alpha = 0.25 and gamma = 2.0. The num_classes parameter was changed to 1 since we are only detecting helmets. The loss functions are added together, combined as a single loss function when training.

The regression loss:

class RetinaNetBoxLoss(tf.losses.Loss):"""Implements Smooth L1 loss"""def __init__(self, delta):super(RetinaNetBoxLoss, self).__init__(reduction="none", name="RetinaNetBoxLoss")self._delta = deltadef call(self, y_true, y_pred):difference = y_true - y_predabsolute_difference = tf.abs(difference)squared_difference = difference ** 2loss = tf.where(
tf.less(absolute_difference, self._delta),
0.5 * squared_difference,
absolute_difference - 0.5,)
return tf.reduce_sum(loss, axis=-1)

The classification loss:

class RetinaNetClassificationLoss(tf.losses.Loss):"""Implements Focal loss"""def call(self, y_true, y_pred):cross_entropy = tf.nn.sigmoid_cross_entropy_with_logits(
labels=y_true, logits=y_pred)
probs = tf.nn.sigmoid(y_pred)alpha = tf.where(tf.equal(y_true, 1.0), self._alpha, (1.0 - self._alpha))pt = tf.where(tf.equal(y_true, 1.0), probs, 1 - probs)loss = alpha * tf.pow(1.0 - pt, self._gamma) * cross_entropyreturn tf.reduce_sum(loss, axis=-1)

The wrapper to combine both loss functions:

class RetinaNetLoss(tf.losses.Loss):"""Wrapper to combine both the losses"""def __init__(self, num_classes=1, alpha=0.25, gamma=2.0, delta=1.0):super(RetinaNetLoss, self).__init__(reduction="auto", name="RetinaNetLoss")self._clf_loss = RetinaNetClassificationLoss(alpha, gamma)
self._box_loss = RetinaNetBoxLoss(delta)
self._num_classes = num_classes
def call(self, y_true, y_pred):y_pred = tf.cast(y_pred, dtype=tf.float32)
box_labels = y_true[:, :, :4]
box_predictions = y_pred[:, :, :4]
cls_labels = tf.one_hot(
tf.cast(y_true[:, :, 4], dtype=tf.int32),
depth=self._num_classes,
dtype=tf.float32,)
cls_predictions = y_pred[:, :, 4:]
positive_mask = tf.cast(tf.greater(y_true[:, :, 4], -1.0), dtype=tf.float32)
ignore_mask = tf.cast(tf.equal(y_true[:, :, 4], -2.0), dtype=tf.float32)
clf_loss = self._clf_loss(cls_labels, cls_predictions)
box_loss = self._box_loss(box_labels, box_predictions)
clf_loss = tf.where(tf.equal(ignore_mask, 1.0), 0.0, clf_loss)
box_loss = tf.where(tf.equal(positive_mask, 1.0), box_loss, 0.0)
normalizer = tf.reduce_sum(positive_mask, axis=-1)
clf_loss = tf.math.divide_no_nan(tf.reduce_sum(clf_loss, axis=-1), normalizer)
box_loss = tf.math.divide_no_nan(tf.reduce_sum(box_loss, axis=-1), normalizer)
loss = clf_loss + box_lossreturn loss

Model Training

Using TPUs (Tensor Processing Units) is necessary to avoid extremely long training times. This model was run in Google Colab, so here is the list of steps to ensure the input pipeline was efficient:

  • Set the Hardware Accelerator to TPU (Runtime>Change Runtime Type)
  • Initialize the TPU System
resolver = tf.distribute.cluster_resolver.TPUClusterResolver(tpu='grpc://' + os.environ['COLAB_TPU_ADDR'])tf.config.experimental_connect_to_cluster(resolver)# This is the TPU initialization code that has to be at the beginning.
tf.tpu.experimental.initialize_tpu_system(resolver)
print("All devices: ", tf.config.list_logical_devices('TPU'))
  • Set the distribution strategy
strategy = tf.distribute.TPUStrategy(resolver)
  • Establish model training parameters. Important note: a batch size that is divisible by 8 takes advantage of distributing training amongst 8 TPU cores. The model directory is set to my Google Cloud bucket path since that is where I am storing data.
#Create output directory for the model weightsmodel_dir = "gs://nfl-object-detection123/"#Label encoder object
label_encoder = LabelEncoder()
#Number of classes and batch size for training/preprocessing
num_classes = 1
batch_size = 8
  • Compiling the model. Compiling the model must be included within strategy.scope() to take advantage of training across multiple TPUs.
#Establishing the resnet50_backbone, loss function, and model structure
resnet50_backbone = get_backbone()
loss_fn = RetinaNetLoss(num_classes)
model = RetinaNet(num_classes, resnet50_backbone)#The Stochastic Gradient Descent optimizer is used with a learning rate of 0.001 and momentum of 0.95
optimizer = tf.optimizers.SGD(learning_rate=0.001, momentum=0.95)
#Compile the model
with strategy.scope():
model.compile(loss=loss_fn, optimizer=optimizer)
  • Setting up the input pipeline

Setting up the input pipeline involves the following steps as mentioned in the original open source project:

  • Apply the preprocessing function to the samples
  • Create batches with fixed batch size. Since images in the batch can have different dimensions, and can also have different number of objects, we use padded_batch to the add the necessary padding to create rectangular tensors
  • Autotune will automatically determine the appropriate level of parallelism and dynamically tune the value at runtime (preprocessing and encoding labels is a costly operation, so utilizing all cores in parallel is much more efficient)
  • Create targets for each sample in the batch using LabelEncoder
#Preprocess data, autotune defines appropriate number of threads to read the data
autotune = tf.data.experimental.AUTOTUNE
train_dataset = train_dataset.map(preprocess_data, num_parallel_calls=autotune)#Randomly shuffles the dataset
train_dataset = train_dataset.shuffle(8*batch_size)
# Creates batches with fixed batch size and pads image
train_dataset = train_dataset.padded_batch(batch_size=batch_size, padding_values=(0.0, 1e-8, -1), drop_remainder=True)
#Encode labels
train_dataset = train_dataset.map(label_encoder.encode_batch, num_parallel_calls=autotune)
train_dataset =
train_dataset.apply(tf.data.experimental.ignore_errors())
train_dataset = train_dataset.prefetch(autotune)
test_dataset = test_dataset.map(preprocess_data, num_parallel_calls=autotune)
test_dataset = test_dataset.padded_batch(batch_size=batch_size, padding_values=(0.0, 1e-8, -1), drop_remainder=True)
test_dataset = test_dataset.map(label_encoder.encode_batch, num_parallel_calls=autotune)
test_dataset = test_dataset.apply(tf.data.experimental.ignore_errors())
test_dataset = test_dataset.prefetch(autotune)
  • Finally, training the model. I used 25 epochs and established train_steps and test_steps per epoch by dividing the total number of samples in each by the batch size of 8, then using the repeat() function to ensure the model doesn’t run out of data.
#Set number of epochs
epochs = 25
#Establish steps per epoch for train and test sets
train_steps = np.ceil(train_samples // batch_size)
test_steps = np.ceil(test_samples // batch_size)
#Fit the model
model.fit(
train_dataset.repeat(),
validation_data=test_dataset.repeat(),
epochs=epochs,
steps_per_epoch=train_steps,
validation_steps=test_steps,
callbacks=callbacks_list,
verbose=1)
Image for post
Image for post

Implementing Soft Non-Max Suppression

Since I replaced Non-Max Suppression earlier with predicting all possible results, I needed an alternate method to filter undesired results. Using TensorFlow’s tf.image.non_max_suppression_with_scores function, I was able to implement Soft-NMS, which works very similarly to the normal NMS function. Rather than eliminating predictions over a certain IoU threshold, Soft-NMS uses a sigma value (float between 0 and 1) to down-weight the confidence score of detections over the IoU threshold. Using an IoU threshold of 1.0 allowed me to keep all confidence scores as is and select the predictions for each object based on the confidence threshold alone. Ultimately, this eliminates any predictions under the specified confidence score and keeps potential true positives where bounding boxes are intersecting.

Evaluating Results

Since it’s difficult to evaluate results based on the loss function alone, I built a Pandas DataFrame including the model predictions and then built a residual DataFrame, labeling true positives (any detections above the confidence threshold), false positives (any detections above the confidence threshold but not in the original labeled data), and false negatives (comparing the counts of detections in the original labeled data to the predicted counts).

I compared the results of several confidence thresholds ranging from 0.15 to 0.90, allowing me to check which confidence threshold had the highest Precision, Recall, and F1 Score. I decided to use the highest F1 Score (the harmonic balance between precision and recall) as my determining evaluation metric, since that was also recommended in the Kaggle competition. With a confidence threshold of 0.22, the highest F1 Score was 0.949 for detecting helmets.

Image for post
Image for post

Although a little convoluted due to the amount of overall objects, you can get an idea of how the model performed on helmet detection using a scatterplot:

Image for post
Image for post

Here is a visualization of an image with confidence scores:

Image for post
Image for post

Detecting Impacts

I set a strict filter for impacts using IoU between bounding box predictions for each image, knowing that helmets with impacts were extremely less likely than helmets without impacts. Any IoU between 0.1 and 0.4 was labeled as an impact. Since I was only dealing with 2D data and didn’t attempt to convert to 3D, this wasn’t overly accurate as mentioned before. There were several helmets without impact that had an IoU between .1 and .4 as shown below. The F1 Score was extremely low for impact detections.

Predicting Unseen Test Videos

I used the model weights to predict helmet and impact detections on unseen test videos for matching end zone and sideline views. In order to leave open the possibility of evaluating results on these unseen test videos, I set aside 2 videos that were included in the original training set with labels (they were not included when training the model) rather than using the provided test videos without labels. For both views, I used the IoU filter for impacts of 0.1 to 0.4 based on previous EDA.

Here is the scatterplot of helmet detections:

Image for post
Image for post

The scatterplot for impact detections:

Image for post
Image for post

Finally, I created a side-by-side video of the end zone and sideline views showing the predicted helmet and impact detections.

Conclusion/Further Work

Although the model did reasonably well with detecting helmets with an F1 Score of .957 for the end zone view and .96 for the sideline view, it is obvious that further work is necessary to improve the detection of helmet impacts. If you’re interested in the notebook with additional details, please check out my GitHub repository. If you fork this notebook, note that the links for writing files to Google Cloud will need to be changed since you will only have reader access. If you want to download the entire dataset, it is best to download directly from Kaggle. Further work can include but is not limited to the following:

  • Combining previous data with player tracking data and potentially using homography to locate players in videos based on tracking key points.
  • Potentially using 3D Convolutional Neural Network models to identify impacts (computationally expensive).
  • Using a Temporal Shift Module (the winner of the Kaggle competition used TSM). Videos have an additional dimension over images, the temporal dimension, which is the shift of objects in images between frames. Using this information, the TSM predicts actions based on a specified range of frames. This is a less computationally-expensive model as it only utilizes 2D CNNs. (With ResNet, this could be added after each convolutional block).

References

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

David Bartholomew

Written by

Aspiring Data Scientist, recent graduate of Flatiron School’s online data science program.

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

David Bartholomew

Written by

Aspiring Data Scientist, recent graduate of Flatiron School’s online data science program.

The Startup

Medium's largest active publication, followed by +775K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store