Photo by Phil Desforges on Unsplash

Building a Data Pipeline for Instagram Comments — Part 1

A simple example of building a Data Pipeline solution on AWS

Brief Introduction

One day I was wondering how bad scores for a football team can affect the fans’ comments on its players’ instagram posts, after thinking for a while, I wanted to try to do NLP analysis on one of my favourite football player post’s comments. I managed to extract comments data from Instagram manually and I wanted to automate this process so it will keep updating me with the new comments.

Extract data from Instagram

Since the summer of 2019, Instagram has drastically updated its anti-automation algorithm. For that reason, I was looking for a legal and safe way to extract comments data from Instagram. I found Phantombuster, a tool that can automate the comments loading in the web browser and provide them in a JOSN file that will be loaded on its AWS S3 bucket.

How does the Data Pipeline work

A typical data pipeline will extract and refine different types of data from a different data sources and then load the data to a data warehouse such as Redshift, Snowflake, or BigQuery. For example, in every company, disparate sets of data is exist in different places, data pipeline can bring those data sets together in one place where people can query across different sources.

According to the Cloud Data Management book, data pipelines consist of four main stages: source, lake, warehouse and mart.

The 4 Stages of Data Sophistication — Cloud Data Management

As this post is the first part of the Instagram Data Pipeline, I will be implementing only the Data Source and Data Lake, and in the next part I will build the Data Warehouse using AWS Redshift and Redshift Spectrum

Extract and Load to Data Lake

Data Lake: is a storage repository (usually a file-system) that contains structured and unstructured data from multiple resources. On AWS, a data lake is usually built using S3 and as I mentioned before, we are getting Instagram comments data in a JSON format uploaded to the Phantombuster’s S3 Bucket, thus, we need to download, transfer and upload those files to our data lake on AWS S3

Firstly we need to create an Instagram Post Commenters workflow:

  • Connect to your Instagram account :
  • In order to extract comments from multiple posts, create and fill Google spreadsheet with posts to process, it is only required to put the profiles URLs. However, we need to include the usernames to use it in organising objects in S3 using prefixes:
  • Get a shareable link for your spreadsheet
  • You can copy the spreadsheet URL from here or directly from the search bar and paste it into the relevant field in the Phantom’s setup:
  • If you’re using a spreadsheet where your input data isn’t found in column A, you can simply tell your Phantom which column to find it in by filling in the “Column name from which to take profile URLs” section of your Phantom’s setup. This will take the column title, e.g. “profileUrl” in the example below:
  • I assumed that the comments extractor will be triggered once a day to update the comments’ data, fortunately, Phantombuster allows us to do so . Otherwise, I would schedule an AWS Lambda function using EventBridge to invoke the Phantombuster workflow on a daily-basis
  • Lastly we need to configure a custom webhook which will be called at the end of the workflow execution, it will POST a payload containing some useful information about your Phantom’s status and result to its associated webhook URL. Building a webhook on AWS is super easy using fully managed service called API Gateway which can be integrated with AWS Lambda (another fully-managed service that lets you run your code without creating or managing any servers).

Building AWS Lambda and API Gateway using cloud formation:

Note: You can find the complete source code in IG-ETL repository on GitHub

Get the API Gateway URL:

Add API Gateway URL to Phantombuster:

Time to start writing the code that will receive and manipulate the Phantombuster’s payload with the Phantombuster webhook payload

Example of the payload:

{ 
"agentId": "5027055349780535",
"agentName": "Test Script",
"script": "test_script_50516785467.js",
"scriptOrg": "phantombuster",
"branch": "test-branch-0545107204",
"launchDuration": 121,
"runDuration": 1850,
"resultObject": {...},
"exitMessage": "finished",
"exitCode": 0
}

The comments data will be in the resultObject, therefore, our job is to aggregate this data and upload it to our S3 data lake. However, this will cost more work and time as the Lambda Function will run longer and therefore more money will be billed while this is already done by Phantombuser which update the data in a JSON file that will be uploaded on its S3, thus, our job is to bring this file to our data lake

Firstly we need to check whether we have new comments or not, this can be done here:

if data['resultObject'] is not None:

and then we need to download the profiles csv file from Google Sheets

download = requests.get(
"https://docs.google.com/spreadsheet/ccc?key=1AHy_iqUlDdd7NOh20PyU5vutMnKTLIn_y95JKW2hyig&output=csv")
decoded_content = download.content.decode('utf-8')

cr = csv.DictReader(decoded_content.splitlines(), delimiter=',')

As I mentioned before, we need the Instagram comments’ username to use it in organising objects in S3 using prefixes:

for row in cr:
if row['postUrl'] == json.loads(data['resultObject'])[0]['query']:
object_name = "raw_files/{0}/{1}.json".format(row['username'], current_date)
else:
object_name = ""

We need to run the workflow manually first in order to get the JSON File URL:

Download the JSON File Phantombuster’s S3 Bucket:

file_url = "https://phantombuster.s3.amazonaws.com/7lwPTCxbhH8/zDCVZvofS135Ke3mlHcJ4Q/result.json"

file_data = requests.get(file_url).json()

file_path = '/tmp/' + "{0}-raw.json".format(file_data[0]['username'])

with open(file_path, 'w') as f:

json.dump(file_data, f, ensure_ascii=False, indent=4)

Upload the JSON File to S3:

try:
response = s3_client.upload_file(file_path, bucket_name, object_name)

except ClientError as e:
logging.error(e)
return False

Run the Instagram comments extractor workflow:

Check data in S3:

The entire AWS services and infrastructure that have been used in this post can be re-deployed using AWS SAM CLI. Source code and sam template can be found on my GitHub account

Summary

In this article, I showed how to use Phantombuster API to extract data from Instagram, this can be applied to other social media platform such as LinkedIn and Twitter. The extracted data includes comment, like count, comment date, etc. The collected data is public and can be very helpful for research or marketing purposes. With the provided data pipeline, we can automate the data extracting for multiples profiles and posts, and move the data to our S3 bucket.

I hope you find this helpful, if there are suggestions or comments just write it in the comments below.

See you in the next post!

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Labib Mansour

Labib Mansour

Associate Data Engineer | Builder 🚀