Written by Iris Fu and Divyaa Ravichandran, Computer Vision Scientists, on April 19, 2018
GumGum Sports processes enormous amounts of media each day. They come from a variety of sources and forms, including social media posts and broadcast/streaming videos. Our goal is to identify media that is relevant to our clients to estimate the value of their sponsorships and placements.
The challenge is in processing massive volumes of posts in a short period of time. For example, to estimate the value of each brand’s exposure in a basketball game, we would need to consider all of the available data and metadata to identify sports, players, teams, locations of the signage, arenas, and many more such criteria. Many of these can be described as an object detection problem within images. One possible way to solve this problem would be through the use of neural networks trained under supervised learning.
To train an object detection model under supervised learning, annotated training data is necessary. However, for many tasks, annotated data does not exist.To circumvent this issue, semi-supervised and unsupervised learning approaches can identify and aggregate media that are visually similar. This approach is only able to produce groups of similar media; however, it is not able to give names or labels to those groups. Despite this, it can be used — a) to improve annotation efficiency and b) to identify duplicate posts.
K-means clustering is a popular clustering algorithm that classifies a data set with a predetermined number (k; could be randomly chosen at the beginning of the process) of centroids that correspond to each of the k clusters. The number of centroids is randomly or empirically chosen, and considerations can be made to allow for maximum separation amongst clusters in the data set. Next, each data point is compared to each of the centroids and based on a nearest-distance metric, would be grouped with the centroid that is most similar. Each iteration over the data set would require a recalculation of the centroid until all data points are sorted into a cluster and k centroids are fixed.
In equation format, k-means algorithm is to minimize the following output:
is the Euclidean distance between data point xi and cluster centroid vj .
i iterates through the clusters (of which there are a total of C), while j iterates through each cluster member. J(V) is the objective function we want to minimize, which is the intra-cluster distance with respect to the cluster centroid.
To recalculate the new cluster center for a given cluster, average the data points associated with that cluster:
This iterative process of computing distances within each cluster will continue until a) the cluster centroids no longer change or b) change in cluster centroids is below a pre-set threshold . This implies that an optimum clustering has been reached and the intra-cluster distance is no longer able to be minimized.
K-means clustering is a good starting approach but it requires some intuition regarding the dataset to select k-clusters. Instead, we turn towards an alternative clustering method: Affinity Propagation algorithm that does not require knowledge ahead of time to establish the number of clusters necessary, but its fundamental principles are similar to K-means. Most importantly, this algorithm is already implemented in Scikit Learn, a module available in Python.
The Affinity Propagation algorithm finds representative cluster centers from the dataset. It assigns elements to those clusters based on similarity of data points to one another, and continues to cluster the dataset over a specified number of iterations. Since it is a data driven mode of finding clusters, it doesn’t need an initial cluster number to be specified. Based on a threshold of change, known as a damping factor, two points in n-dimensional space are determined to be similar to each other if the distance between them falls within the threshold. The distance metric used can theoretically be anything that suits the particular application; Euclidean distance is however the simplest and most popular metric used.
Cluster a set of sports themed images (obtained from social media posts) using Affinity Propagation.
We have about 100 images each for NBA, NHL and MLB. Many of these are not images of the sport itself, but graphic posters regarding a game, selfies of people at venues, or even press conferences related to the sport (Figure 1).
While using a neural network to classify these images would be straightforward, in practice because of the varied nature of the images, a neural network classifier is not performant. Instead, we try a clustering approach here to cluster similar kinds of images, and make the annotation task easier — based on the premise that all images in a cluster should have similar annotations.
The feature vectors for each image are derived from a scene classifier. The scene classifier is a neural network that uses the GoogleNet architecture, trained on the Places365 dataset. This was done to have a representation for each image in the domain that enables grouping them. Each sport is assumed to have different contextual information that would differentiate them.
The penultimate (before softmax) layer of this scene classification network is a 365-dimensional vector, and a n x 365 matrix with n sample images forms the dataset for clustering.
For clustering, we use the Affinity Propagation algorithm bundled in Python’s scikit-learn module. The implementation of the clustering is straightforward, and we are finally left with 25 clusters. Some representative clusters are illustrated in Figure 2.
As expected , similar looking images cluster together. For instance, Figure 2 shows the blue cluster is a cluster of press conferences, the green cluster groups basketball graphic posters with a predominantly black background and so on.
It is important to bear in mind that there are some clusters that will have fewer than optimum samples to form a regular sized cluster; some clusters with 2–3 images that we can otherwise classify as outliers in feature-space. What makes this useful in an annotation pipeline is that more cohesive clusters can be labelled/annotated quickly and painlessly, while more time can be devoted afterwards to annotate more mixed clusters.
The same approach can be extended to attack another major problem: With the influx of thousands of social media posts, the likelihood of having duplicate posts is high. Filtering exact duplicate posts is a simpler problem but what about posts that have been slightly altered?
Consider posts such as a retweet of a 1-minute long highlight clip that have been split to a few seconds showcasing a key powerplay or clips that have been edited to add on a different introduction. If one could quickly identify these types of posts, redundant processing can then be avoided. Of course, one could spin up more instances on the cloud or request for larger and more powerful machines but emphasis here is placed on algorithmic efficiency at minimal resource cost.
The clustering approach exemplified above will also cluster similar/exact same media into the same cluster, and a simple examination of the clusters can be done and duplicates deleted, to avoid unnecessary processing overhead. This can also be used to remove smaller videos that are part of bigger videos that are already in the dataset.
Processing both these videos will result in duplicate processing of the overlapping frames, and scaling to consuming millions of media per day, the compute and cost overhead would be significant. This can simply be avoided by clustering and removing duplicates/near-duplicates.