Analyzing Instagram data with Python

Innovi Tech Academy
5 min readMar 24, 2020

In this post I want to talk a bit about how to explore your own Instagram account data and generate interesting insights. I will be using Google Colab for this, with the following packages:

  • Pandas
  • InstagramApi (official Instagram API)

I assume you will have pandas installed already. If not, you may install the library

!pip install pandas!pip install InstagramApi

Then, let’s import the necessary packages that will be used along this demonstration:

from InstagramAPI import InstagramAPIimport pandas as pdfrom pandas.io.json import json_normalize

Then, create a small function to login to Instagram with your account:

def login_to_instagram(username, password):
api = InstagramAPI(username, password)
api.login()

return api
api = login_to_instagram('instagram_username','instagram_password')

Once executed, you should receive this return message: Login success!

Cool, we are logged in to Instagram! Now we can start exploring it. I believe it’s natural to start retrieving all your posts:

def get_my_posts(api):
'''Retrieve all posts from own profile'''
my_posts = []
has_more_posts = True
max_id= ''

while has_more_posts:
api.getSelfUserFeed(maxid=max_id)
if api.LastJson['more_available'] is not True:
has_more_posts = False #stop condition
max_id = api.LastJson.get('next_max_id','')
my_posts.extend(api.LastJson['items']) #merge lists
if has_more_posts:
print(str(len(my_posts)) + ' posts retrieved so far...')
print('Total posts retrieved: ' + str(len(my_posts)))
return my_posts
my_posts = get_my_posts(api)

The Output should look like this:

18 posts retrieved so far...
36 posts retrieved so far...
54 posts retrieved so far...
72 posts retrieved so far...
90 posts retrieved so far...
108 posts retrieved so far...
126 posts retrieved so far...
144 posts retrieved so far...
162 posts retrieved so far...
180 posts retrieved so far...
198 posts retrieved so far...
Total posts retrieved: 209

my_posts will be a list of dictionaries, and each item represents a single post from your Instagram account. In my case, I have only 90 posts. I encourage you to explore the available fields of each post. Many interesting stuff there :)

Now that we have all the posts, let’s retrieve all the “post likers” to see which users like your posts

def get_posts_likers(api, my_posts):
'''Retrieve all likers on all posts'''
likers = []
print('wait %.1f minutes' % (len(my_posts)*2/60.))
for i in range(len(my_posts)):
m_id = my_posts[i]['id']
api.getMediaLikers(m_id)
likers += [api.LastJson]
# Include post_id in likers dict list
likers[i]['post_id'] = m_id
print('done')
return likers
likers = get_posts_likers(api, my_posts)

This should take some minutes, depending on the amount of posts you have. An approximate wait time will be displayed:

wait 7.0 minutes

You will receive done when it finished. likers will also be a list of dictionaries, and should have the same length as my_posts. Inside each dictionary, you will find the key users, which contain all the users that liked a specific post.

Ok, let’s do a similar operation, but to return the post commenters this time:

def get_posts_commenters(api, my_posts):
'''Retrieve all commenters on all posts '''
commenters = []
print('wait %.1f minutes' % (len(my_posts)*2/60.))
for i in range(len(my_posts)):
m_id = my_posts[i]['id']
api.getMediaComments(m_id)
commenters += [api.LastJson]
# Include post_id in commenters dict list
commenters[i]['post_id'] = m_id
print('done')
return commenters
commenters = get_posts_commenters(api, my_posts)

You will have to wait the same time as you waited to retrieve the likers, and done will be printed out once it’s finished. commenters will also be a list of dictionaries, and should have the same length as my_posts. The actual comments will be under the key comments of each item of the commenters list.

Converting to pandas DataFrames

It’s time to use the powerful pandas package and structure this data a little bit. I’ll use json_normalize within pandas.io.json. The data transformation differs a bit betweeen likers and commenters:

def posts_likers_to_df(likers):
'''Transforms likers list of dicts into pandas DataFrame'''
# Normalize likers by getting the 'users' list and the post_id of each like
df_likers = json_normalize(likers, 'users', ['post_id'])
# Add 'content_type' column to know the rows are likes
df_likers['content_type'] = 'like'
return df_likers
def posts_commenters_to_df(commenters):
'''Transforms commenters list of dicts into pandas DataFrame'''
# Include username and full_name of commenter in 'comments' list of dicts
for i in range(len(commenters)):
if len(commenters[i]['comments']) > 0: # checks if there is any comment on the post
for j in range(len(commenters[i]['comments'])):
# Puts username/full_name one level up
commenters[i]['comments'][j]['username'] = commenters[i] ['comments'][j]['user']['username']
commenters[i]['comments'][j]['full_name'] = commenters[i]['comments'][j]['user']['full_name']
# Create DataFrame
# Normalize commenters to have 1 row per comment, and gets 'post_id' from parent
df_commenters = json_normalize(commenters, 'comments', 'post_id'
# Get rid of 'user' column as we already handled it above
del df_commenters['user']

return df_commenters
df_likers = posts_likers_to_df(likers)
df_commenters = posts_commenters_to_df(commenters)

With this, we have 2 panda DataFrame: df_likers and df_commenters. Each row of df_likers represents a single like, and each row of df_commenters represents a single comment. We can now get some interesting numbers. Let’s start with some basic counts:

print('Total posts: ' + str(len(my_posts)))print('---------')print('Total likes on profile: ' + str(df_likers.shape[0])) #shape[0] represents number of rowsprint('Distinct users that liked your posts: ' +str(df_likers.username.nunique())) # nunique() will count distinct values of a colprint('---------')print('Total comments on profile: ' + str(df_comment.shape[0]))print('Distinct users that commented your posts: ' +str(df_comment.username.nunique()))Total posts: 209---------Total likes on profile: 2324Distinct users that liked your posts: 991---------Total comments on profile: 85Distinct users that commented your posts: 68

Top 10 likers of my Instagram account:

# As each row represents a like, we can perform a value_counts on username and slice it to the first 10 items (pandas already order it for us)df_likers.username.value_counts()[:10]hc_ka_sky             77
ceddie112 72
tcyinv3v 55
cheunghotinfeat.cm 52
ngtszkin18 42
lawlaw0925 34
yanyuwong 28
eric_khshing 27
photo.by.min 24
raychongtk 22
Name: username, dtype: int64

Hmm, “hc_ka_sky” is the person who liked my post the most: out of 209 posts, he liked 77 of them! Let’s plot the distribution of this Top 10:

Bar plot

df_likers.username.value_counts()[:10].plot(kind='bar', title='Top 10 media likers', grid=True, figsize=(12,6))

Pie plot

df_likers.username.value_counts()[:10].plot(kind='pie', title='Top 10 media likers distribution', autopct='%1.1f%%', figsize=(12,6))

--

--

Innovi Tech Academy

InnoVi Tech Academy First Tasted Program The “First Tasted Programme of InnoVi Tech Academy” was successfully completed on 14 Jan 2020.