Web Scraping Instagram to build your own profile’s dashboard — With Instaloader and Google Data Studio
Looking for an excuse to learn a bit more about web scraping and Google Data Studio, I decided to begin a project based on my wife’s commercial Instagram profile. The goal was to build an online updatable dashboard with some useful metrics, like top hashtags, frequently used words, and posts distribution per weekday:
Requirements:
- Fast and easy to create/update
- Usable for any Instagram account, as long as it’s public
In this article, I want to share with you my approach to do that.
The whole project can be found at the project’s GitHub repository, and if you are only interested in usage rather than knowing how it works, you might consider going straight to the project’s README.
So, in order to achieve my goal, I would need to do the following:
- Extract information from Instagram
- Transform this information into useful metrics
- Upload these metrics and information to a data source accessible to Google Data Studio
- Connect the data source and build the dashboard at Data Studio
It should be noted that I won’t cover number 4 here. In this article, I’ll limit myself to steps 1 to 3, and leave the explanation on how to actually build the dashboard for some other time.
To summarize, I used the Instaloader package for extracting information and then processed it in a python script using Pandas.
As a data source, I decided to use a Google Sheet at my personal Drive account. To manipulate this spreadsheet, I used a python library called gspread
.
The dashboard also uses some images for the logo and to generate the word cloud (which I will discuss later). At the time I was building the dashboard, Data Studio didn’t recognize image URLs from my Drive account, so I created an Imgur profile and used the python API imgurpython
.
Let’s detail the whole process a bit more.
The Pipeline
I want to do all these tasks sequentially, so I wrote a shell script in order to generate/update the report in a single command:
./insta_pipe.sh <your-profile> <language>
In which your_profile
is your Instagram profile, and language
is the language you want the report to be generated in(currently en
and pt
).
The shell script looks something like this:
I will talk about each command of this script in the following sections.
What we end up with
After running the shell script, the goal is to have a google spreadsheet updated with the information we need, much like the one below:
I decided to divide groups of information into different worksheets:
- ProfileInfo — profile name and imgur URL for the profile pic
- WordCloud — Imgur URL for the word cloud image
- Top_Hash — top 10 hashtags, according to the average number of likes
- Data — table of one Instagram post per row, with info about media type (image, video, or sidecar), plus the number of likes and comments
- MediaMetrics — the average number of likes per media type
- DayMetrics — the average number of likes and number of posts per weekday
- MainMetrics — the overall average number of likes and comments
Extracting Information from your Insta Profile
For the first task, I decided to use this awesome library called Instaloader. As it is said at the library’s website, Instaloader is a tool to download pictures or videos from Instagram, along with other metadata. For this project, I am mainly interested in info such as the number of likes, comments, captions, and hashtags.
Once you pip install instaloader
and read the documentation, it turns out that for this project, a single command is all it takes:
instaloader --no-pictures --no-videos --no-metadata-json --post-metadata-txt="date:\n{date_local}\ntypename:\n{typename}\nlikes:\n{likes}\ncomments:\n{comments}\ncaption:\n{caption}" $1;
That will create a folder named with your profile, and inside there will be a bunch of .txt files, one for each post. Inside each file, there will be information about:
- date
- media type (image, video ou sidecar)
- number of likes
- number of comments
- caption
If you run this command, you’ll see that my attempt of breaking the lines with \n
was not successful. The command automatically escapes the backslash, and I end up just with “\n” written on it.
I am certain there is a smarter way to do that, but my workaround was to replace \\n
with \n
for every text file, which is whatfix_lines.py
does.
Oh, Instaloader will also download your profile pic, which I will also use it as a logo for the dashboard.
Transform and Upload — Preliminary Steps
For this step, I had to make sure I had some things beforehand:
- a Google account to use Drive
- an Imgur Account
I also had to follow some instructions to authenticate and authorize the application for both gspread
and imgur
.
For gspread
, I followed these instructions:
to, in the end, have a credentials.json
to put at ~/.config/gspread/credentials.json
.
As for Imgur, I followed these instructions:
and followed the steps at the registration quickstart
section just up until I had the following: client_id
,client_secret
,access_token
and refresh_token
. These tokens are to replace the placeholders at the imgur_credentials.json
file, along with the username of your Imgur account.
The last thing is that I had to create a blank google sheet beforehand and get its key. If you open a google sheet, the link will be something like this:
https://docs.google.com/spreadsheets/d/1h093LCbdJtDCNcDUnln4Lco-RANtl6-_XVi49InZCBw/edit#gid=0
The key would be that sequence of letters and numbers in the middle:
1h093LCbdJtDCNcDUnln4Lco-RANtl6-_XVi49InZCBw
I will use it later, to let gspread know where to update the information into.
Transform and Upload — Assemble, Generate and Upload
The script at transform_and_upload.py
reads the .txt files created with Instaloader, assembles all the information, creates metrics and dataframes, and then updates the worksheets:
Creating the clients
First, we begin by setting the Google Sheet key we wish to update and creating a sheet object so we can later update its content:
g_sheet
is an instance of the gSheet
class, which contains the methods to authenticate and update the worksheets. Just to show you the beginning of it (you can check the rest of it at the repository):
An Imgur client is also created, using the credentials to perform operations on your account, like removing images from past runs of your application and uploading the new images you want to be displayed at your dashboard:
Generating dataframes and metrics
The assemble_info
function is the one that actually reads the text files line by line and assembles the information into an initial dataframe called df_posts
:
Each row of df_posts
refers to a single post, with the following columns:
- media_type: Video, Image or Sidecar
- media_code: Just encoding the above into integers (1,2 or 3)
- likes: Number of likes the post currently has
- comments: Number of comments the post currently has
- date: Date of creation
- hashed: List of hashtags used
In addition, assemble_info
also uploads the image profile at Imgur, returning its URL, and concatenates the text of every caption (except hashtags), which is used later to generate the word cloud.
df_posts
is then used to generate further specific dataframes and metrics:
df_hash
contains the hashtags in descending order, according to the average number of likes (hashtags that appear less than 5 times are ignored).
Information according to weekday is stored in df_day
, while df_data
is just a subset of df_posts
that will be used by Data Studio to create the pie graph according to media type.
As for metrics
, it contains some overall metrics of average likes and comments, besides info according to media type.
Uploading to Drive
With everything at hand, it is time to update our Google Sheet:
All these methods basically update each worksheet with the appropriate information. The only exception would be update_wordcloud
, that also generates the word cloud, using nltk
to tokenize the text and remove stopwords, and thewordcloud
package:
and then uploads it to Imgur before sending the URL to the worksheet:
And that’s it!
For the sake of brevity, I didn’t show here every line of code used throughout the script, but if you are interested, you can check it all out at the Project’s Repository.
Next Steps
Now that we have an updated Google Sheet, we can use it as a data source to plug into Data Studio. The next step would be, of course, building the dashboard. As I said before, I feel that explaining the process here would make the story somewhat extensive, so I will leave it for some other time!
Thank you for letting me share the experience!