Web Scraping Instagram to build your own profile’s dashboard — With Instaloader and Google Data Studio

Felipe de Pontes Adachi
Analytics Vidhya
Published in
7 min readSep 19, 2020
Photo by NordWood Themes on Unsplash

Looking for an excuse to learn a bit more about web scraping and Google Data Studio, I decided to begin a project based on my wife’s commercial Instagram profile. The goal was to build an online updatable dashboard with some useful metrics, like top hashtags, frequently used words, and posts distribution per weekday:

Requirements:

  • Fast and easy to create/update
  • Usable for any Instagram account, as long as it’s public

In this article, I want to share with you my approach to do that.

The whole project can be found at the project’s GitHub repository, and if you are only interested in usage rather than knowing how it works, you might consider going straight to the project’s README.

So, in order to achieve my goal, I would need to do the following:

  1. Extract information from Instagram
  2. Transform this information into useful metrics
  3. Upload these metrics and information to a data source accessible to Google Data Studio
  4. Connect the data source and build the dashboard at Data Studio

It should be noted that I won’t cover number 4 here. In this article, I’ll limit myself to steps 1 to 3, and leave the explanation on how to actually build the dashboard for some other time.

To summarize, I used the Instaloader package for extracting information and then processed it in a python script using Pandas.

As a data source, I decided to use a Google Sheet at my personal Drive account. To manipulate this spreadsheet, I used a python library called gspread.

The dashboard also uses some images for the logo and to generate the word cloud (which I will discuss later). At the time I was building the dashboard, Data Studio didn’t recognize image URLs from my Drive account, so I created an Imgur profile and used the python API imgurpython.

Let’s detail the whole process a bit more.

The Pipeline

I want to do all these tasks sequentially, so I wrote a shell script in order to generate/update the report in a single command:

./insta_pipe.sh <your-profile> <language>

In which your_profile is your Instagram profile, and language is the language you want the report to be generated in(currently en and pt).

The shell script looks something like this:

I will talk about each command of this script in the following sections.

What we end up with

After running the shell script, the goal is to have a google spreadsheet updated with the information we need, much like the one below:

I decided to divide groups of information into different worksheets:

  • ProfileInfo — profile name and imgur URL for the profile pic
  • WordCloud — Imgur URL for the word cloud image
  • Top_Hash — top 10 hashtags, according to the average number of likes
  • Data — table of one Instagram post per row, with info about media type (image, video, or sidecar), plus the number of likes and comments
  • MediaMetrics — the average number of likes per media type
  • DayMetrics — the average number of likes and number of posts per weekday
  • MainMetrics — the overall average number of likes and comments

Extracting Information from your Insta Profile

For the first task, I decided to use this awesome library called Instaloader. As it is said at the library’s website, Instaloader is a tool to download pictures or videos from Instagram, along with other metadata. For this project, I am mainly interested in info such as the number of likes, comments, captions, and hashtags.

Once you pip install instaloader and read the documentation, it turns out that for this project, a single command is all it takes:

instaloader --no-pictures --no-videos --no-metadata-json --post-metadata-txt="date:\n{date_local}\ntypename:\n{typename}\nlikes:\n{likes}\ncomments:\n{comments}\ncaption:\n{caption}" $1;

That will create a folder named with your profile, and inside there will be a bunch of .txt files, one for each post. Inside each file, there will be information about:

  • date
  • media type (image, video ou sidecar)
  • number of likes
  • number of comments
  • caption

If you run this command, you’ll see that my attempt of breaking the lines with \nwas not successful. The command automatically escapes the backslash, and I end up just with “\n” written on it.

I am certain there is a smarter way to do that, but my workaround was to replace \\n with \nfor every text file, which is whatfix_lines.py does.

Oh, Instaloader will also download your profile pic, which I will also use it as a logo for the dashboard.

Transform and Upload — Preliminary Steps

For this step, I had to make sure I had some things beforehand:

  • a Google account to use Drive
  • an Imgur Account

I also had to follow some instructions to authenticate and authorize the application for both gspread and imgur.

For gspread, I followed these instructions:

to, in the end, have a credentials.json to put at ~/.config/gspread/credentials.json.

As for Imgur, I followed these instructions:

and followed the steps at the registration quickstart section just up until I had the following: client_id,client_secret,access_token and refresh_token. These tokens are to replace the placeholders at the imgur_credentials.json file, along with the username of your Imgur account.

The last thing is that I had to create a blank google sheet beforehand and get its key. If you open a google sheet, the link will be something like this:

https://docs.google.com/spreadsheets/d/1h093LCbdJtDCNcDUnln4Lco-RANtl6-_XVi49InZCBw/edit#gid=0

The key would be that sequence of letters and numbers in the middle:

1h093LCbdJtDCNcDUnln4Lco-RANtl6-_XVi49InZCBw

I will use it later, to let gspread know where to update the information into.

Transform and Upload — Assemble, Generate and Upload

The script at transform_and_upload.py reads the .txt files created with Instaloader, assembles all the information, creates metrics and dataframes, and then updates the worksheets:

Creating the clients

First, we begin by setting the Google Sheet key we wish to update and creating a sheet object so we can later update its content:

g_sheet is an instance of the gSheet class, which contains the methods to authenticate and update the worksheets. Just to show you the beginning of it (you can check the rest of it at the repository):

An Imgur client is also created, using the credentials to perform operations on your account, like removing images from past runs of your application and uploading the new images you want to be displayed at your dashboard:

Generating dataframes and metrics

The assemble_info function is the one that actually reads the text files line by line and assembles the information into an initial dataframe called df_posts :

Each row of df_posts refers to a single post, with the following columns:

  • media_type: Video, Image or Sidecar
  • media_code: Just encoding the above into integers (1,2 or 3)
  • likes: Number of likes the post currently has
  • comments: Number of comments the post currently has
  • date: Date of creation
  • hashed: List of hashtags used

In addition, assemble_info also uploads the image profile at Imgur, returning its URL, and concatenates the text of every caption (except hashtags), which is used later to generate the word cloud.

df_posts is then used to generate further specific dataframes and metrics:

df_hash contains the hashtags in descending order, according to the average number of likes (hashtags that appear less than 5 times are ignored).

Information according to weekday is stored in df_day, while df_data is just a subset of df_posts that will be used by Data Studio to create the pie graph according to media type.

As for metrics, it contains some overall metrics of average likes and comments, besides info according to media type.

Uploading to Drive

With everything at hand, it is time to update our Google Sheet:

All these methods basically update each worksheet with the appropriate information. The only exception would be update_wordcloud, that also generates the word cloud, using nltk to tokenize the text and remove stopwords, and thewordcloud package:

and then uploads it to Imgur before sending the URL to the worksheet:

And that’s it!

For the sake of brevity, I didn’t show here every line of code used throughout the script, but if you are interested, you can check it all out at the Project’s Repository.

Next Steps

Now that we have an updated Google Sheet, we can use it as a data source to plug into Data Studio. The next step would be, of course, building the dashboard. As I said before, I feel that explaining the process here would make the story somewhat extensive, so I will leave it for some other time!

Thank you for letting me share the experience!

--

--