How to Classify Data Without Markup
iFunny users upload about 1,000,000 pieces of content to the app every day, including not only memes but also racism, violence, pornography, and other inappropriate material.
Previously, we checked all this manually, but now we are developing automatic moderation based on convolutional neural networks. We have already trained the system to divide content into three classes: it recognizes what can be included in user feeds, what needs to be removed, and what is hidden from the shared feed. To make the algorithms more accurate, we decided to add a specification for removing content that did not have such markup before.
I will show you how we did it under the cut with the help of an illustrative example. This post is aimed at people familiar with Python (not necessarily standard with Data Science and Machine Learning).
Classification without markup
Task: To implement object classification.
Initial data and conditions: A lot of data without markup or any details.
Solution: To begin with, we will upload the data and conduct the initial analysis:
We have a dataset of the following size: (1,797, 64). This is a relatively small data set — less than 2,000; however, this may be enough if we have a representative sample that reflects the characteristics of the entire set under consideration. In this case, each object has 64 features, and if they are all binary (if their values are 0 or 1), we need 2⁶⁴ examples to cover all possible options. The total sample size will be even more significant for parts that take one out of three or more possible values. Only a few features carry essential information about the object in real life, and they take far fewer values from the permissible set.
To begin with, let’s display a few lines from the set on the screen:
Sometimes it is helpful to analyze raw data without additional aggregations. In this example, we can see that the array is saved in float format, but we don’t see a single element with a number after a period, as if they are all integers.
Before processing any data, you should look at the statistics on different features (columns). Let’s take a look at a few random columns. We will take the 30th to the 35th columns and display the statistics using the panda’s library.
The “describe” method allows you to view a set of the most commonly used statistics from the table below. The values of the features are grouped around zero, as indicated by their average. There are also features with zero values for all objects in the sample, so they are uninformative and can be excluded from further analysis.
There are a large number of methods for analyzing data, many of which are related to graphic representation. Data Scientists like to use pairwise correlation graphs. They allow you to detect the relationship between the features, which can lead to a decrease in the feature space. Also, we can use them to find a correlation between the feature and the target (the desired value), but we do not have a markup, so this scenario is not feasible for us.
In our case, we can only see that all the features take integer values. The absence of pairwise correlations does not exclude the presence of a relationship between a large number of segments simultaneously. But it is impossible to see such data features since we have a 64-dimensional feature space. Even if there are areas where objects are grouped, it will be extremely difficult or impossible to detect this by any graphical method.
In such a situation, we need to reduce the dimensionality of the space of features and display it in a two- or three-dimensional form, with which our consciousness can cope.
First, let’s get rid of constant features. We’ve already revealed the presence of features with a value of 0 for all objects, so we calmly remove them from the entire sample. Our goal is to separate entities, which means that the main information will be used to distinguish them from each other.
There are many ways to reduce the dimensionality of the feature space while keeping it informative. We will take the UMap algorithm for this post since we are already using it in our tasks. One of its advantages over other nonlinear dimensional reduction algorithms is to train the model on a single dataset and then use it later for new data using the same transformation.
To do this, we use a ready-made library. The most important parameter here is the number of components you want to get as part of the output (to what dimension is necessary to compress the current feature space). We choose two because the 2D plane can be visually displayed in a diagram:
Next, we perform the training using the “fit” command. We don’t have a lot of data, so we train the model on the entire set, but as mentioned earlier, it may be less than the final one:
Then, we convert all the data:
As a result, we get a reduced dimensionality, where the number of samples is the same, but there are only two features: (1797, 2).
Let’s briefly tell you how this works: UMap builds a weighted graph by connecting the edges of the nearest neighbors in the n-dimensional space and then creates another chart in the low-dimensional area and brings it closer to the original to preserve the relative positions of the objects. It leaves close things closer to each other, and distant objects will remain farther from each other, all in reduced dimensionality.
Let’s plot the resulting 2D vectors:
The graph shows ten large groups of points and several smaller ones. Next, we will perform clustering, that is, we will break the space into areas based on a parameter or rule.
Let’s use the k-mean algorithm (KMeans), which is based on minimizing the total quadratic deviation of cluster points from the centers of these clusters.
Let’s set a search for ten clusters (there are ten clusters in the previous graph), do the training and prediction for the final classes:
Let’s color the picture with the clusters since the algorithm divided them up very well:
The resulting cluster sequence numbers can be considered classes of an unmarked sample. To classify new data, you need to sequentially apply the pre-trained UMap and KMeans algorithms to them and get the cluster number for these objects.
And now I’ll reveal a little secret: it wasn’t just data.
In our training example, the data are 8×8 pixel images with handwritten numbers. If the intensity values of all pixels are left-to-right and top-down in one line, we get a vector of length 64 — precisely the one we’ve worked with before. The pixel intensity is specified in the uint8 format and may take only integer values from 0 to 255, which means that our observations at the very beginning were correct.
In total, the dataset contains digits from 0 to 9. That is, it has just ten classes (we managed to define the same number of clusters):
Now we have the actual class, and we know which string corresponds to which label. If we depict the proper distribution of classes in a smaller space using the transformation that has been found, we get the following:
The figure above shows that only the colors responsible for the cluster number are different in most cases. The k-mean method labeled images randomly, and class 0 did not imply zeros in its images. If you change the numbering, you will see how many examples were classified correctly.
Many metrics show how good the method is with a single number. The most well-known metric is accuracy, the ratio of correct responses to all examples in a test set. This approach has a significant drawback since it does not specify what exactly the error is. The use of this and other integral metrics will be incredibly inconvenient for multiclass classifications, where one number will not clearly show which classes are confused with each other.
This is precisely the situation we are in now, so we should use an error matrix. To build it, we will use the pycm library:
In this code, the y_pred includes the renumbered cluster values that we found earlier. The most common actual class was used in it as a new value. The resulting error matrix is shown below:
- On the horizontal are the classes predicted by our method.
- The proper classes are shown vertically.
- The intersection cells show the number of objects that satisfy two conditions.
Twenty-seven samples from the true class of ones were defined as sixes for some reason. Let’s see why this happened and look at the images from the dataset.
At first glance, these objects do not look like sixes. But if we go back to the actual class markup, we see a small group of ones that is very far from the rest of them and is closer to the sixes.
And the real sixes are really similar to ones, so in this case, the problems are not related to our model but to those who have such handwriting:
Similarly, we will ensure that risky content, such as landscapes, guns, and girls in swimsuits, is not available to all users but only to those who do not mind such content. However, instead of pixel values, as was the case in our example, specific patterns are considered.
These patterns are defined by a neural network that was pretrained on a large dataset. However, in its original state, this neural network does not suit us for solving the main task of removing unwanted content because it does not know our three classes:
- Approved: images are placed in the collective section of the application;
- Not suitable: images are not displayed in the general feed but remain in the user’s feed (girls in swimsuits and men in swimming trunks, selfies, and anything that is not memes);
- Risked: images are banned and no longer available to any iFunny users (racism, pornography, dismemberment, and anything that falls under the definition of “illegal content”).
We had to retrain the network further on these classes. But we will talk about this in detail in the next post.
Author: Yaroslav Murzaev, Data Scientist at FUNCORP