Finding friends via Hashtags

Michael Campbell
INST414: Data Science Techniques
3 min readApr 25, 2022
Photo by Joshua Hoehne on Unsplash

Have you ever wondered what twitter users have similar interests with you and friends in connection with you? In order to do an analysis of this, I wanted to find a dataset which had a reasonably large data. After some searching, I found a dataset by the name of twitter friends on Kaggle. This dataset had important features like:

  • id
  • screen name
  • hashtags
  • followers Count
  • friends Count
  • lang
  • friends

After finding the dataset, the first step was to clean the dataset and convert it into a pandas dataframe. The data was a bit messy, so I had to convert the column into the correct datatypes which I did by using a few functions other people on Kaggle had created.

df = pd.read_csv("data.csv", sep=',(?=\S)', engine='python')def delete_quotes(x):  return x[1:-1]for column in ["id", "screenName", "avatar", "lang", "tweetId"]:  df[column] = df[column].apply(delete_quotes)for column in ["tags", "friends"]:  df[column] = df[column].apply(lambda x: json.loads(x))Code from: Visualization twitter followers and friends | Kaggle

With the data cleaned and loaded into a dataframe, I could start looking into who I wanted my model users who I would compare to the rest of the dataset. After some thinking, I decided to randomly choose three people who had three hashtags. My first account was named Happylouistommo and their latest hashtags were louisweloveyou, liamsbirthdayproject, larryisoverparty. In order to find the similarity between them and the other users, I used the Jaccard similarity index which gives you the distance. I calculated this by dividing the number of times the hashtag was seen in both accounts and all of the unique hashtags in both accounts.

distance = []base = '3151187359'for index, row in df.iterrows():  user = df[df['id'] == base]['tags'].tolist()[0]  target = df[df['id'] == row['id']]['tags'].tolist()[0]  numerator = len(set(user).intersection(set(target)))  denomenator = len(set(user).union(set(target)))  distance.append({    'user':row['screenName'],    'similarity': numerator/denomenator})

The results told me that there was only one person who had a distance of 1 which meant they had all the same hashtags. The other people had a distance of .66 which means they have only two hashtags and they are both in common. Finally, I found one with a distance of .50 which means they have two hashtags in common with a total of three total hashtags. For the second user, I found a person named sarahhh8042 who used the hashtags: nationaldogday, happybirthdaydylanobrien, and respecttylerjoseph. In this case, I found that no one had a distance of 1, only people who had a distance of .66. The third user that I found was named Paris_trudeau and they used the hashtags: respecttylerjoseph, liamsbirthdayproject, and louisweloveyou. They had 5 people with a distance of 1 and 4 people with a distance of .66.

After having seen the results, I wondered if they had friends in common. To find this, I wrote a loop similar to the one above to go through and check if they have any friends in common.

nameList = ['LiamJamesTommo','moonlitlarrie','aya_elbasyony']for curr in nameList:original = set(df[df['screenName'] == 'sarahhh8042']['friends'].tolist()[0])target = set(df[df['screenName'] == curr]['friends'].tolist()[0])print(target.intersection(curr))

In the end, I was able to find anyone with similar tags which also had friends in common. Although there weren’t any friends in common with common tags, you could probably find someone with similar interests by looking at the tags that they use. I thought it was interesting to look at the data this way, but I think that I didn’t find any friends in common who also followed the same tags. I would have liked to go in deeper, but data was kind of hard to process given the size. Just comparing the users tag similarity took around 6 minutes each. That was only comparing tags which had a max of 3. The list for friends had lists with upwards of 100 ids in them.

Link to Code Repository: twitter.ipynb

Link to dataset: Twitter Friends | Kaggle

--

--