Lookism in TikTok

17 min readSep 26, 2022

Intro
Collecting the data
Processing the data
Predictive model: first attempt
Slice and factor analysis
Adding more signals
Bonus: face tiktokifier
Conclusion
Pointers

Intro

TikTok is a platform that gained enormous popularity over the last several years, some reports indicating that it has more than 1 Billion users, as well as close to 1 Billion of content creators (although few of them are uploading much). It faced a lot of criticism over years, including privacy concerns and promoting addiction (e.g.[1], [2]).

However, what surprised me when I first looked into it is not the format/content, but people in the videos recommended by the app. The faces of people I see there seemed very different from what I see in everyday life, in the direction of — how can I put it? — “conventional attractiveness”. This is very unusual and was significantly different from other video hosting platforms — I never experienced it while watching YouTube, for example.

I thought it could be just my internal biases and how I personally perceived the system, so I decided to analyze the data to see if there is actually scientific evidence to confirm my claim. This post is an attempt at doing a more rigorous analysis and verifying the hypothesis of TikTok actively (implicitly or explicitly) promoting lookism.

Note that this write-up is aimed at people who understand data science/machine learning, and contains many technical details. If you are not interested in those, you can skip to the “Conclusion” section.

Collecting the data

Our goal is to verify whether the face of a content creator is predictive of their popularity on the platform, and if it is — to what extent. Surprisingly, there are no open ready-to-use datasets which we can use for this, so we’ll make our own. We are going to collect some open data from the TikTok — so there are no issues with privacy, we just aggregate the information which is already available to everyone.

We take the latest snapshot of Common Crawl index, and collect all TikTok accounts mentioned in it (around 16000). Of course this is biased towards active accounts, which we should keep in mind for later.

Next, we scrape these accounts. This is quite a tedious and bandwidth-intensive step. There are some tools for Tiktok scraping (e.g. TikTok-Api), however, as of now, they haven’t been updated for a while, and many functions don’t work, so some manual work is necessary.

Our goal is to obtain faces of channel creators, and unfortunately profile pics aren’t usable for this purpose (low-resolution, and often containing something else instead of faces). So we have to scrape the videos. We take at most 16 random videos per channel, yielding around 1200 GB of data. Additionally, we save channel metadata (in particular, subscriber count, which is going to be our proxy for “popularity” notion). We also save the metadata for all videos on the channel, which will be useful later for analysis (in particular, determining channel language, age and topic).

For obvious reasons, TikTok is not happy with large-scale scrapes, so there are many obstacles on the way, most annoying being daily captcha, and occasional IP address blacklisting. There are many interesting technical challenges in the scraping process (e.g. reverse-engineering of the weird (mis-)use of AES in TikTok request headers), but discussing them is beyond the scope of this post. After some days of scraping, we finally get the data, and can start with the analysis.

Note: we don’t publish the raw dataset, since it is not easy to host this amount of data. However, we do publish smaller derived datasets — see below.

Processing the data

As mentioned above, for our labels, we take the number of subscribers of the channel (to be more precise — log(1 + Subscribers)). However, how do we get the faces?

The idea is to cut all of the videos into frames (we take a frame every 1 second), extract all of the faces from each frame, cluster them, and pick the largest cluster as the face of the channel owner.

Extracting faces is quite easy — we can use any of the numerous existing face recognition/extraction libraries, e.g. face_recognition Python package. As a result, we get a bunch of faces occurring in the channel videos, e.g.:

Next, we want to find the actual face of the channel owner. For this, we can use any reasonable face embedding (e.g. coming from the same face_recognition library). We pick the most “central” face, minimizing the sum of distances from its embedding to embeddings of all other faces, and sort all faces by the distance from it. Usually there is a very clear separation between the cluster with the “main” face and the rest and the faces, with the “main” cluster being bigger than everything else combined (if it wouldn’t be the case — this simplistic approach wouldn’t work, and we’d need to run a proper clustering and pick the largest cluster).

Below is the t-SNE projection of the faces from the example above, with color marking “main” faces vs outliers (interactive link):

t-SNE projection of face embeddings for a particular user; x/y axes don’t have clear semantics, as this is a projection.

After filtering, we see that our algorithm makes sense:

We process all users like this, discarding the ones with no/too few faces (surprisingly, a non-trivial fraction of them have zero faces —in contrast with what we see in the recommendations on the main TikTok page). As a result, we get a dataset of 12947 users with multiple faces per user.

We share the derived dataset here:

TFRecord with faces (capped at 32 faces per channel), with training/test split: link
Metadata per channel: link

Predictive model: first attempt

Now we can treat our problem as a simple supervised regression task — given the face, predict the label (log(1 + Subscribers)). As the training objective, we use Mean Squared Error, as the metric — Pearson Correlation (it is a bit easier to reason about than MSE, which depends a lot on label distribution). If someone is skeptical of the Pearson Correlation, they can also look at the Spearman Correlation, which is even more intuitive, since it is agnostic both to scale and to distribution of labels. We do the usual training/validation split (obviously by users, not by faces), with 90%:10% ratio.

Let’s look at the distribution of the subscriber counts first (before applying the logarithm):

It looks kind of weird — pretty much bimodal. Thinking about it, this is not surprising, there must be some clusters like “English vs other languages”, or topic-based (more on it later).

There are some choices to be made with respect to the model (e.g. how many faces we take from every user; what are the hyperparameters; etc.). We pick the simplest possible convolutional model, augmented with the embeddings from the face_recognition library (augmentation is of the following form: first we pass the image through the convolutional NN, get an embedding; then we concatenate is with the embedding from the library, and pass through a small MLP). As it turns out, particular hyperparameters don’t matter that much.

However, the number of faces per user to pick does affect the resulting performance a lot (also if we use multiple faces at the inference stage, ensembling the result across multiple faces). Using multiple faces at the training stage can be considered a form of data augmentation, so this is not very surprising. We achieve Pearson correlation 0.33 with 1 face per user, 0.42 with 32 faces per user in training and 1 in validation, and 0.45 with 32 faces per user in both cases (i.e. ensembling in validation).

Note that all of these numbers are actually crazily high! This means that 45% of variance of subscriber count is explained by the face of the content creator, not by “insignificant” things like the quality of the content or personality. It is hard to believe such a high correlation can be organic, without the “help” from the TikTok algorithms.

Model prediction (x) vs log-label (y). Note that low predictions pretty much prevent high subscriber counts.

The trained model can be found here, Jupyter notebook — here. Let’s look at the faces in the validation set which the model likes (left), and the ones which it doesn’t (right):

Highest validation set predictions from the model (left), and lowest (right)

Intuitively, it does look like the “conventional beauty” bias is confirmed by the model. Also, looking at lowest predictions, there appears to be a gender bias and age bias. Let’s dig into what exactly is picked up by the model in more detail.

Slice and factor analysis

There are many ways of getting insights into model behavior. Perhaps the simplest one is to decompose the input into a set of easy-to-understand concepts, and slice by them/train simpler models on top of them.

Where to get these concepts from? The famous Celeb-A dataset contains a set of faces annotated with a bunch of concepts, which fits our use-case very well. So we train a separate model on Celeb-A (Jupyter, weights) to predict these attributes, and apply it to the faces in our TikTok dataset.

Now for each user we have a bunch of predictions in range [0, 1] for attributes like “Smiling”, “Pale Skin” and “Male”. First thing we can do is to compute correlation with the label per attribute:

The “Bushy_Eyebrows” one is weird — all values are low, and higher values (around 0.2) indicate people with heavy makeup, in particular fake eyebrows. The rest of the high (in absolute value) correlations seem to make sense (takeaway: if you want to become popular on TikTok — wear lots of makeup and no glasses).

Next, we can train a model based on these attributes to predict the label. The most fitting model type would be gradient-boosted decision trees, e.g. LightGBM. Just throwing the predictions at the model gets us correlations around 0.41, which is close to the convolutional NN model, but this is not entirely fair: these float predictions contain more information than just attributes, and the tree model extracts it, yielding significantly non-monotonic dependencies on the features, while it is attempting to extract some underlying superficial correlations in the model which provides these concepts, and not from the concepts themselves.

A cleaner way would be to “discretize” the features to 0/1, and train a model on top of it (e.g. the aforementioned Bushy_Eyebrows disappears with this). An alternative would be to learn simple models (e.g. linear ones) which will be monotonic, and it’ll be harder for the tree model to extract superficial correlations.

This loses a lot of information and gives us a model with correlation just 0.21, however, now it is learning what we want it to learn, and we can look at feature importances:

OK, it seems that smiling is important, but now we see that the “Male” feature is very high. This is rather suspicious, let’s take a closer look.

First thing which stands out is that the prediction quality of the original CNN model on the Male vs non-Male slices differs a lot: Pearson correlation 0.40 vs 0.47! Average label is slightly in favor of non-Male, but difference in correlations explains what we’ve seen above: all lowest-prediction faces were non-Male, not because of averages, but because the model is more confident on that slice, so it makes more extreme predictions (in particular, in the low range).

Next, we can slice the data by the “Male” feature, and look at correlations and feature importances. E.g. correlations:

We see some differences, e.g. being young is more important for males, while wearing heavy makeup — for non-males.

Besides gender, there are many worrisome patterns. E.g. ageism: “Young” is a very important positive feature across the board, while concepts associated with aging (e.g. “Grey Hair”) are negative. Or anti-body-positivity: concepts like “Chubby” or “Double Chin” are negative. Makeup and makeup-associated concepts are perhaps the strongest across the board.

Adding more signals

We have seen the staggering 0.45 correlation with subscriber count just from the face-based models. However, there are factors which should be orthogonal (or mostly orthogonal) to the face, which should significantly affect this quantity. E.g., the older the channel, the more subscribers it should have, or channels in English should have more subscribers than in some obscure languages. What happens if we use these signals in the predictive models as well? How good of a predictor can we get? Can we check if the subscriber count is pretty much predetermined (if it is — it is likely predetermined by TikTok algorithms), or is there still substantial stochastic component?

Let’s note an important aspect of this before we go into more details. A fair-and-square setup would be to take a snapshot of data at the channel creation time (potentially a long time ago), compute signals from it, and then use subscribers right now as the label. What we’re doing is somewhat different. There is clearly a survivor bias: we took the channels from the recent CommonCrawl snapshot, so this is biased towards channels which are active now. Meaning that the channels which died out are not captured appropriately. Another (similar) effect is the feedback loop — TikTok itself affecting the signals we’re measuring.

Language and face are the signals which are the least subject to these effects, so they’re the most trustworthy in this sense (although one can imagine face changing as part of a feedback loop, e.g. adding more makeup). Channel age is less trustworthy because of survivor bias. The least trustworthy one is channel topic (to be precise — the way we model it, see more details below).

Let’s start with adding language, as it is the least biased one. Determining it is quite straight-forward — we can look at video descriptions of all videos on the channel, run them through a language detector, and pick the majority. Let’s do some sanity checks: the most popular languages (in order) are EN, ES, AR, ID, PT, DE, JA, VI. The order makes sense, with English leading by a large margin. What about the label distribution? Below are distributions for EN, AR, and smaller languages combined:

We see that with some exceptions, bi-modality is preserved even within languages. What about the mean label per language? It does vary significantly, but in a non-obvious way:

So, let’s feed the language to the model, and co-train a predictor of (face, language) -> label. What do we get? The resulting correlation is 0.47.

Next step is to add channel age. As mentioned above, this feature is not entirely “fair” because of survivor bias, nevertheless, let’s try to use it. First of all, how to determine it? TikTok doesn’t appear to show when the user was registered, but we do have timestamps of all videos, so we can take the earliest one to determine the age of the channel. The resulting feature has 0.5 correlation with the label, part of it is likely due to survivor bias, but high correlation is expected — the longer the channel exists, the more subscribers it accumulates. Surprisingly, it also correlates with the face-based prediction, this part is probably due to survivor bias (the component “old channels have older users” is unlikely to be strong; but there can be another effect of people with certain faces being the first TikTok content creators). What happens if we add it to the face and language? Resulting correlation is 0.65, which is already very solid (extremely high for something which has to be stochastic).

Age in days (X) vs log-subscribers (Y). Old channels seem to always have many subscribers, which might indicate survivor bias

Finally, what about the topic of the channel? One can imagine that channels about dancing vs cooking vs makeup are all having different average subscriber counts. But how to determine the topic? We didn’t find any ready-to-use models, so we did the following instead. Let’s take video descriptions and hashtags, train word2vec on these, and compute per-user embedding as an average of embeddings of all tokens mentioned by the user. One can see that it is pretty shaky — it’ll likely capture topics, but it can capture many other things (in the worst case, labels can leak into features, if someone decides to write “I have 10M subscribers!” in the description).

Nevertheless, let’s see what happens when we add it. Projections (t-SNE) of the term embeddings can be explored here. After aggregating to the level of the user, we get the following (color coding indicates labels). It seems that the label is rather continuous in this space, but there also appears to be some leakage, so something to try in the future is to only use very first/early textual descriptions from the channel, to reduce it.

What happens if we add it as a feature to the training, in addition to face, language and channel age? The correlation sky-rockets to 0.79, but due to the reasons mentioned above, we shouldn’t read too much into it. Note that correlations of such magnitude for a predictive model indicate that there is almost no stochastic component, which would be extremely surprising for something like subscriber count.

Prediction (X) vs label (Y) for the model with all signals

Bonus: face tiktokifier

For the sake of fun, in addition to the above more theoretical analysis, let’s build a model which “tiktokifies” the given face, increasing the predicted number of subscribers.

There are many ways to do it, perhaps the best one quality-wise might be to use some of the modern image generation architectures, e.g. Diffusion + img2img + classifier guidance. However, we’re going to take a simpler and faster route.

Let’s use StyleGAN3 as our model for faces, and let’s do the model inversion, based on this repo. StyleGAN architecture generates a face from a latent encoding of shape [16, 512], which is supposed to look like random noise. The “inversion” produces a latent encoding given a face (this task is not as trivial as it looks; sure one can produce some encoding, but it can end up in an out-of-distribution region of latent space, and the editing magic won’t work). Then it happens that “natural” face edits correspond to linear transformations in this space (see this for more details).

There are pre-trained transformations for attributes like “Age” and “Gender” (see the official Colab), as well as all of the CelebA attributes, including “Attractiveness”. We train another linear transformation for maximizing TikTok subscribers, and visually compare it with the CelebA “Attractiveness” transformation.

Side note: while this is a fun exercise, it is not very rigorous, e.g. because there are no guarantees the “optimal” transformation will be linear. It is nice to play around with, but for drawing actual conclusions, previous sections are more useful.

So, we encode our dataset to StyleGAN3 latent space, and use this data to train the linear transformation for increasing subscriber count. Let’s look at how it affects some faces. On the charts below, the first row is the variations (both in positive and negative directions) of the face along our trained vector, the second row — along the “Attractiveness” vector from CelebA (for comparison). The “no edit” image in the middle is indicated with the black rectangle.

One can see that the “TikTok” vector doesn’t have the whitening and feminizing effects, unlike the CelebA-Attractiveness one. It does have the effect of adding more makeup though, and reducing the perceived age.

Let’s look at some other examples:

We see that the pattern more or less preserves — it doesn’t do whitening or feminization, but does add more makeup, makes people younger and in some cases adds more hair. CelebA-Attractiveness vector is much more biased in this sense.

To play with this model, one can use the aforementioned Colab, loading the following vector to the “Editor” section: link. Again, as a fun exercise, one can build a tool which processes a photo/video on commodity hardware without the need to run custom colabs, but this is outside of the scope of this write-up. It might even be helpful to real-life tiktokers, who are unlucky to have an “un-tiktoky” face, but want to gain subscribers. Perhaps one should go along the route of GFPGAN if they decide to build such a tool.

Conclusion

We have shown that the appearance/face of the content creator is very predictive of their subscriber count in TikTok. What can we conclude from this practically?

One thing to note is that all the models we train are correlational, and it is impossible to determine causation without doing randomized A/B tests. However, it is possible to reason even with the correlational models: we don’t know if wearing heavy makeup causes your subscribers to grow, or if getting a high amount of subscribers causes people to wear heavy makeup, but the presence of correlation indicates that something is going on.

In particular, the scale of the correlation (0.45 from just faces, 0.8 (speculatively) from face+language+age+topic) is so high that it is hard to imagine this would happen organically. Even if the users would click on such faces, achieving such numbers without a reinforcement loop is close to impossible. Which leads us to the following conclusion: most likely, TikTok recommendation algorithms use faces as a signal, thus promoting lookism (either explicitly, or implicitly — by training on user preferences).

Is it bad to do so? I’d argue that even implicit lookism is bad, because of bias amplification: by basing recommendations of faces, they’ll make the bias even stronger (i.e. in a counterfactual world where they wouldn’t use face for prediction, people with less “TikTok-y” faces would get more subscribers). I.e. this not only mirrors their ecosystem, but takes the worst parts of it (e.g. ageism) and makes it even worse. If there is indeed a feedback loop, eventually this will be amplified to extreme; we can repeat the analysis in a couple of years and check if the correlations become even stronger (I bet they do).

What is the solution to this? The most obvious thing to do is to make recommendation models blind to faces. I think this should already help a lot, but the consensus in the “AI Fairness” community claims this is not enough, since the information can be extracted from the correlated features (although this community doesn’t really propose any solutions, and just criticizes existing ones, so I wouldn’t listen to them too much).

A more extreme way would be to stop optimizing for engagement, at least directly. Engagement is good as a topline metric because it correlates well with a dollar sign, but as we know, if we optimize for topline metrics directly, they stop being good metrics (Goodhart’s law). Incorporating something along the lines of creator-independent content quality would definitely help, but it’ll hurt short-term engagement (while it might actually help long-term growth). However, I’m skeptical this approach will ever be adopted — it seems all modern social networks are laser-focused on engagement optimization, and TikTok is no exception to this.

Pointers

Code and notebooks

Training notebook for the faces-only model
Checkpoint for the faces-only model
CelebA-based annotation model: notebook, checkpoint
Face-editing vector for the tiktokifier: link
Face editing Colab (plug in the vector from the above to the “Editing” section)
A modified version of this colab which loads the vector: link. Note: if you want to run it locally rather than in the colab environment, it requires CUDA development kit, since the code uses custom Pytorch ops which need to be compiled.

Data

TikTok faces dataset: link
Channel metadata for the dataset: link
CommonCrawl — open web index
CelebA — large dataset of annotated faces

Libraries

face_recognition — a nice Python package for face extraction and embeddings
LightGBM — a library for boosted decision trees
Gensim — an NLP library, which can be used to easily train a custom word2vec
StyleGAN3, and its inversion

Author & contacts

Email: enryu@proton.me
GPG public key: https://medium.com/@enryu9000/about
Signature of this URL:

— — -BEGIN PGP SIGNED MESSAGE — — -
Hash: SHA512
https://medium.com/@enryu9000/lookism-in-tiktok-3def0f20cf78
— — -BEGIN PGP SIGNATURE — — -
iHUEARYKAB0WIQS6MzboE1mQihro0YYJtk434j9gEgUCYzGIFgAKCRAJtk434j9g
EtRzAQCoNj5K9wQ3sNkQQyzXBDlDKs38+UpE9wL1qs4QmaAWGAEAu6M2ztaYUqLW
r7TG8TqNAhuQSD/2lwtgL1vZEfQPhw8=
=BCeE
— — -END PGP SIGNATURE — — -