Exploring Users’ Data in twitter using Twitter API.

Published in

Analytics Vidhya

7 min readJul 21, 2020

Measuring African users’ influence in twitter.

This is my first post on Data science and indeed a remarkable step in my journey as a Data scientist. My gratitude is to 10 Academy For they have indeed made me 10X better than yesterday. Every week for the next three months my skills will be increasing and scaling high through the weekly challenges.

After reading the Million Follower Fallacy document, I got to see how studying influence can help us to understand why certain trends and innovations are adopted faster than others and how we could help advertisers and marketers design more effective digital campaigns. On this effect, I decided narrow down to Africa and measure the influence of some proposed African influencers and African government officials. Here is explanation of the analysis.

Getting data through Web scraping

The first step is getting a list of African twitter user and top government officials who were proposed as being influential. This was made possible using these two python packages:

BeautifulSoup package for handling all HTML processing
Requests package for performing HTTPs requests

The two websites from which I scraped data are:

The websites had proposed the names as the top influential people in Africa, and using data from twitter, we are going to validate the list by comparing different measures of influence. After understanding the way forward, i had to install the libraries that i will be using.

#Getting the content from the url by making a requestsurl="https://www.atlanticcouncil.org/blogs/africasource/african-leaders-respond-to-coronavirus-on-twitter/#southern-africa"
r=requests.get(url)
soup=BeautifulSoup(r.content)
links=soup.find_all('blockquote') #block quote is the tag that contains the information we need.

In the above code with the help of requests package we get the content through passing the url and the we convert it using BeautifulSoup. By touring the website we find the similarity that is within our the list we need, and for this case all the twitter handles we need are embedded in <blockquote> tag.Using BeautifulSoup i find_all the content in the tags and change it to a list string. Using python string operations, i performed cleaning on the data and got the twitter handles. I did the same to the other website and got the list of African influencers.

Accessing Twitter API

I first installed Tweepy — A python wrapper for interacting with twitter API.

!pip install tweepy

Next I created a Authentication object to be able to access data from twitter.

# Creating the authentication object
auth = tweepy.OAuthHandler(Consumer key, Consumer secret)
# Setting your access token and secret
auth.set_access_token(Access token, Access token_secret)
# Creating the API object while passing in auth information
api = tweepy.API(auth, wait_on_rate_limit=True,
                     wait_on_rate_limit_notify=True)

Focusing on an individual’s potential to lead others to engage in certain acts, i was interested in the “interpersonal” activities in twitter. I was interested in obtaining:

Number of followers of a user
Number of friends of a user
Number of retweets of a user
Number of likes
Number of mentions

Next using the user object methods, i was able to get the followers_count and friends_count of all the users in our list.

#getting followers_count of users
followers = []
for screen_name in names:
    u=api.get_user(screen_name)
    followers.append(u.followers_count)#getting friends_count of the users
friends = []
for screen_name in names:
    u=api.get_user(screen_name)
    friends.append(u.friends_count)

Next I wrote a function to get the number of retweets as retweet_count and number of likes as favorite_count. The function gets tweets of the users and the retweet_count and favorite_count for each tweet. The function also writes the data into csv files for each user.

def get_tweets(screen_name):#initialize a list to hold all the tweepy Tweets
    alltweets = []#make initial request for most recent tweets (200 is the maximum allowed count)
    new_tweets = api.user_timeline(screen_name = screen_name, count=200)#save most recent tweets
    alltweets.extend(new_tweets)#save the id of the oldest tweet minus one
    oldest = alltweets[-1].id - 1#keep grabbing tweets until there are no tweets left to grab. 
    # Limit set to around 3k tweets, can be edited to preferred number.
    while len(new_tweets) > 0:
        print("getting tweets before %s" % (oldest))#all subsiquent requests use the max_id arg to prevent duplicates
        new_tweets = api.user_timeline(screen_name = screen_name,count=200, max_id=oldest)#save most recent tweets
        alltweets.extend(new_tweets)#update the id of the oldest tweet less one
        oldest = alltweets[-1].id - 1print("...%s tweets downloaded so far" % (len(alltweets)))#transform the tweets into a 2D array that will populate the csv 
    outtweets = [[tweet.id_str, tweet.created_at,tweet.retweet_count,tweet.favorite_count, tweet.text.encode("utf-8")] for tweet in alltweets]#write the csv  
    with open('%s_tweets.csv' % screen_name, 'w') as f:
        writer = csv.writer(f)
        writer.writerow(["id","created_at","retweet_count","favorite_count","text"])
        writer.writerows(outtweets)pass

Now i needed to get the mentions; number of times the users were mentioned in other tweets. To avoid the code breaking when running maybe due to some username not operational, we pass the tweepy.error using try exception.

#collecting mentions for the users#list100 is a list where i stored the twitter handles. 
mentions = [] 
for influencer in list100: 
  try:
    for status in tweepy.Cursor(api.user_timeline, id=influencer).items():
      if hasattr(status, "entities"):
        entities = status.entities
        if "user_mentions" in entities:
          for ent in entities["user_mentions"]:
            if ent is not None:
              if "screen_name" in ent:
                name = ent["screen_name"]
                if name is not None:
                  mentions.append(name)
  except Exception:
            pass

Having obtained all the required information of the users, I happily proceeded to doing the analysis. Congrats for we just finished 60% of the project.

Analysing the twitter data

Because it is hard to determine the influence of twitter users who had few tweets, i focused on the users who had a minimum level of activity, those who had tweeted more that 50 tweets. I also dropped the names of those who had invalid screen_names because this information is important in identifying the number of times a user is mentioned or retweeted by others.

From the activities of a user in twitter, we get three types of influence as outlined in earlier mention document; Million follower fallacy, These are:

Indegree influence(Reach_score), the number of followers of a user which indicates the audience of the user. I calculated this by subtracting the following from followers.
Retweet influence(popularity_score), which i measured by adding the number of retweets and likes. This indicates the ability of a user to generate content with pass along value.
Mention influence(relevance_score), which we measure through the number of mentions containing one’s name. This indicates the ability of a user to engage others in a conversation.

#Calculating the popularity_score and storing the results in a new column
data2["popularity_score"]=data2["retweet_count"]+data2["favorite_count"]#Calculating the reach score
data2["reach_score"]=data2["No.of followers"]+data2["friends_count"]

A smile, because we now have the whole data, but still we have not answered our question. We will now sort our data with the different influence metrics and draw insights and try to answers, Does influence depend on the number followers one has?

We will first sort the data by the popularity score and get the top 10 according to retweet influence. Then i plotted the data using plotly package and here is the plot.

#sorting data according to popularity score
data3.sort_values(by=['popularity_score'], ascending=False).head(10)

Second, we sorted the list according to the reach_score,got the top10 influencers and then plotted a bar plot using plotly package.

#sorting data according to reach_score
data3.sort_values(by=['reach_score'], ascending=False).head(10)

Finally i sorted the data according to the number of mentions to be able to determine the top 10 in mention influence.

#sorting data according to mention_count
data3.sort_values(by=['mention_count'], ascending=False).head(10)

After the analysis, I drew different insights from the operation, here are the insights.

Conclusion

On this project i analysed the influence of the proposed twitter influencers in Africa, i employed three different measures that capture different perspectives. These were the findings:

The indegree influence represent the user’s popularity but it is not related to other important metrics like engaging the audience i.e retweets and mentions and therefore it is not enough to conclude that a user is influence by the popularity through the number of followers or friends.
Retweet are driven by the content value of a tweet, while mentions are are driven by name value of a user. Hence a recommendation for ordinary users to post rich and creative content to gain influence.
There was little overlap for the top10 influencers in the popularity_score, reach_score and relevance_score, this suggests that indegree , retweet or mention influence alone reveals very little about the influencer.

This disapproves that follower fallacy that many hold onto.

Find the whole project and materials in my github repository

What about the dynamics of influence across topics and time? Is influence gained spontaneously or accidentally?

This will form part of my post next time. If you find this resourceful, don’t forget to share with friends.

References

Some of the materials I found resourceful during the exercise.