Identifying potential ‘micro-influencers’ on TikTok with data science techniques

Kirstin Nichols
INST414: Data Science Techniques
6 min readMar 11, 2024

As one of the most popular social media platforms, TikTok engagement is a meaningful metric to study how people interact online. I’m interested in the ways in which behavioral norms spread, especially through online platforms, and believe it could be valuable to evaluate which people are most similar in terms of engagement in order to better test who is effective at spreading behaviors.

Photo by Olivier Bergeron on Unsplash

A smaller start-up company may be interested in finding a micro-influencer; not one with a massive following, but one who is very engaged on the platform and, therefore, has strong engagement from others with their content. It would be interesting to compare the impact of a micro-influencer promoting a product to the success of a more mainstream influencer doing the same. I think a micro-influencer that is very engaged online could potentially have stronger ties to a TikTok community, instead of a bunch of weak ties that we would typically see in a mainstream influencer. I would be interested to see how the ‘influencer’ with a smaller following but more engagement would compare to the ‘mainstream’ influencer in ability to get people to adopt the use of a product. My thinking is that a high quantity of weak ties is most successful in spreading information, but a smaller quantity of strong ties is most successful in creating adoption of behaviors — in this case, using a product because a strong tie also uses the product.

While I don’t have the time or resources to fully complete this experiment, I wanted to complete the first step by identifying which highly-engaged users are similar to each other so that we have an idea of who we could target. I want to specifically find the user in a dataset with the most users following, the most likes, and the most videos posted, and find the ten users in the dataset most similar in each aspect. These three groups could be used to find variations in influence of those with high TikTok engagement levels based on type of platform use. I think these three metrics in specific are interesting because a high amount of accounts that the user is following indicates that they are receiving a lot of information and are perhaps trying to engage with other users, a high like count shows a high level of interaction from other users, and a high number of videos could indicate an effort to connect with others on the platform by sharing lots of content. I was curious about whether there is a trend in how users high in one of those categories tend to engage in the other two categories. If I were to continue the experiment, I would test these three groups of ten individuals to see which is most effective in initiating others’ adaptation of a product.

I began my analysis by looking for TikTok user data on Kaggle. I found this dataset, updated 7 months ago, which has information about TikTok user profiles. The code is in CSV format and features multiple metrics to analyze engagement activity on the platform. I was interested in the columns ‘following’ (meaning how many accounts the user follows), ‘likes’ (meaning how many likes the user has on their videos) and ‘videos_count’ (meaning how many videos the user has posted).

I first cleaned the data by creating a data frame with the columns ‘followers’, ‘following’, ‘likes’, and ‘videos_count’, with account_id (the username) as the index.

Next, I applied L1 normalization to increase the interpretability across these values. This removed the ‘followers’ column, but I wasn’t as interested in that column because I felt as though ‘likes’ was a similar metric. The beginning of my normalized dataset is as follows:

I checked that the normalized columns for each row add up to 1, and this is indeed the case, meaning that normalization was successfully completed.

Then, I looked to find the user with the highest normalized ‘following’, the highest normalized ‘likes’, and the highest normalized ‘videos_count’. This way, we could see which users were highest in proportion for each of these values rather than the specific counts themselves, allowing us to better understand the influence of these values.

Next, I found the ten users with the highest cosine similarity to each of these three top-ranked users based on my previously explained conditions.

We can examine the normalized counts for each of these groups to find trends. First, we have the users most similar to nbeil10 in terms of cosine similarity; this user had the highest ‘following’ proportion.

We can see that users similar tend to also have a much higher proportion of users that they follow than likes or number of videos. Perhaps these users don’t have much influence on others, but rather are influenced themselves by the accounts that they follow (in proportion to other engagement aspects).

Next, we can examine the users most similar to diana_aster, who had the highest ‘likes’ proportion.

Similar users had an extremely low proportion of users following and number of videos compared to their proportion of ‘likes’. I decided to look up diana_aster in my map of users to see why this is.

After looking at the un-normalized data for diana_aster, we can see that this user has a high amount of followers and an even higher amount of likes, but does not follow many people. This explains the normalized results we are seeing. Diana isn’t a micro-influencer after all; it looks like they have a large enough following to be considered a significant influencer. Users similar to Diana in terms of these metrics are also probably more significant influencers and it would likely cost more to get them to promote a product. It would still be useful to test the effectiveness of an influencer like Diana in promoting adaption of a product versus that of someone with a smaller following but strong levels of engagement.

Next, I looked at users similar to the user with the highest number of videos: __a716. The results we get are more well-rounded than our last group.

At first glance, these users seem to be more strongly connected to the TikTok community, receiving (proportionally) a lot of likes, but not so many that the number is out of proportion compared to how many users they follow and how many videos they have.

However, when I looked closer at the data for __a716, I saw that this user only has 4 followers, is following 1 person, has 6 likes, and 2 videos. This small number skewed the proportions to look like the user was receiving more engagement than in reality.

My results show that a proportion of likes is the most important metric in determining how much engagement a user receives online. My results also show that tracking down potential ‘micro-influencers’ is difficult, and not easy to do simply by looking at normalized cosine similarities. If I had more time, I would set minimum values for ‘likes’ and ‘videos_count’ so that our results would better fit the standards of a ‘micro-influencer’ that we are looking for. However, I’m not sure at the moment what these standards are, and was hoping that normalized values would provide some insight into who is most engaged with the TikTok community. Although we were able to identify who is reaching the biggest audiences, we were not able to identify smaller but influential users in this way. To proceed, I would want to do more research on what constitutes an influential ‘micro-influencer’ and speak to start-up company owners to see what they are looking for. From there, I would be able to narrow down the data.

Below is a link to a GitHub repository for my code.

https://github.com/kirstinnichols/INST414/blob/028b7b748d40a78ecef7d300743e0f308dc355ad/tiktok3.ipynb

--

--