Targeting Influencers with Machine Learning— Part 1 — Scraping Instagram

Raul Incze
Cognifeed
Published in
7 min readSep 25, 2019

We’ve talked about what we’re doing at Cognifeed in our article Cognifeed — a Foreword. That was aimed to be an abstract overview on what we were set to do. Starting today we will offer more concrete and pragmatic examples of how you can utilize our machine teaching platform to create and deploy your own ML-powered intelligent systems.

In this two-part article we’re going to take a look at how you can use Cognifeed to predict which Instagram influencers align the best with your brand values. In the first part we’ll gather the data we need with an easy to use scraper. In the following article we are going to show you how to setup a new project in Cognifeed and start teaching our AI algorithm to pick up on your brand values in Instagram posts.

A refresher on influencers

Influencer marketing seems to be all the rage in the age of social media and… well… online influencers. Brands were already spending about $2 billion in 2017 on Instagram models, tech thought leaders, reviewers and (b/v)loggers to get their products featured. By 2022 it’s estimated that this branch of the marketing industry will reach $15 billion.

We’re sure that a large portion of this sum is due to the unfocused nature of the campaigns. Brands tend to spend on a wide array of influencers and don’t necessarily concentrate on those that would actually drive their sales the best. We believe that predicting the brand alignment of these influencers can provide this needed focus and improve the return of investment of influencers campaigns.

We’re not alone in believing that! Various brands teamed up with IBM’s Watson and their AI research teams to accomplish this, including Mazda. Here’s an article with a bit more info on this:

Now you can do it yourself!

Introducing our (mock) brand — Soupstainable

In order for this guide to make any sense let’s make up a brand! After all, we need those influencers to promote… something. Given that food tech is all the rage nowadays, let’s join in on it!

Introducing Soupstainable. Soupstainable is a plant based, sustainably sourced, instant super-soup containing all the nutrients you need in a day. Now… let’s see what influencers we could target!

The initial list

It would be insane to scrape the whole Instagram (although, quite awesome), so the first thing we need to do is put together a list of influencers in our space — food. This shouldn’t be too hard. There are a number of lists out there of influencers in various domains and at this stage you don’t need to worry if every influencer on that list is a perfect match for you. That’s what we’re going to use Cognifeed for.

For our example we’ve concatenated a few food, vegan and sustainability influencer lists from the internet and we ended up with 37 possible targets. You probably want more, but this will do for our experiment.

Our list of influencers

Of course, even with only 37 Instagram accounts, going through all of their posts to see if they’re a good fit for your brand requires an insane amount of work. Needless to say that it doesn’t scale very well.

Save your list in a text file and remember the path! Now that we have the names, let’s get the data!

Setting up requirements

First you’ll need to download and install Chrome, if you don’t already have it. Next, we need an interface so that we can control chrome from our python code. This interface is provided by ChromeDriver. Simply downloaded it from their website and extract it in a folder. Note down the path to this file.

Next you’ll need Python 3.x. Simply download and install the latest version from their website. There’s a good chance you already have python installed if you’re running a Linux-based operating system or OSx.

Next, clone this repository using git, or simply download the project from here:

And finally, run the following command:

pip install -r requirements.txt

This will install all the 3rd party libraries that our project needs.

Behind the scenes

We won’t go into too many details when it comes to the code, but we will provide an overview of what’s happening and how the scraper is working. If you want to follow along, open up the cloned git project. You can also skip over this part any time you feel this is getting too technical for you!

First, in main, we open up our influencers.txt and load the list. We also instantiate an InstagramScraper object which we’ll use to get the data off of Instagram.

Then, we iterate through all of our influencers. We get the latest 10 posts for each one of them by calling scraper.get_posts_from_user(influencer, 10). This will return a list containing the links to their latest 10 posts. Next, we use the same scraper object to get the data we need from the posts in that link by calling scraper.get_data(links).

Behind the scenes, get_posts_from_user(username, number_of_posts, wait_for_scroll) will navigate to username’s Instagram page using Selenium and a headless Chrome instance controlled through ChromeDriver. It will then scroll down the user’s post until number_of_posts posts will be loaded. We save the links to these posts and return them.

The second task our scraper accomplishes is to take this list of posts and load each one of them with urlopen. It then parses the request response and gets the data we need: media (link to the image/thumbnail for the post), comments (number of comments), likes (number of likes), username, shortcode (which will serve as an id for the post) and caption (the text component of the post, containing hashtags.

Part of the scraper’s code and the keys used to parse the post page are taken from Srujana Takkallapally’s post:

Make sure to check it out if you want to learn more!

Running the code

Alright, enough beating around the bush! Let’s run the scraper! All you have to do is navigate to the cloned project and run the following command:

python main.py --username_list USERNAME_LIST_PATH --chromedriver_path CHROMEDRIVER_PATH --out_file OUT_FILE

Where:

  • USERNAME_LIST_PATH is the path to the file you’ve crated containing influencers’ usernames.
  • CHROMEDRIVER_PATH is the path to your ChromeDriver binary file.
  • OUT_FILE is the path to the file where you want to save the data scraped from Instagram.

Here’s an example:

python main.py --username_list ./lists/shortlist.txt --chromedriver_path ./chromedriver.exe --out_file dataset_v1.csv

And that’s it! The scraper takes care of everything: from loading up your list, to scraping the post info for each influencer in it and finally saving the data in a format recognized by Cognifeed (csv). Here’s how your’s should look like:

The gist was being displayed badly, so here’s a picture. And here’s the whole file as well.

Each row represents one post. In every column we have the info we wanted to collect about that post, such as a link to the (static) media, post’s description, number of likes and so on. If you copy and paste the value of the media column in your browser, you’ll be able to load the image from the post!

Wrapping up

Of course, the code is quite simplistic and hacked together. As soon as Instagram changes their front-end the scraper might stop working. If you encounter any issues with the scraper, let us know in the comments bellow. If you feel you can improve something leave us a comment, open an issue or submit a pull request.

For other, more non-specific Instagram scraping, you can always use projects like these:

Now we have the data we need to do some machine learning. In the next part we’ll go through how to import this data into Cognifeed and how you can teach our machine learning algorithm to predict how well these Instagram posts align with your brand’s values!

Stay tuned for part two and don’t forget to follow us on Twitter, Facebook and LinkedIn!

--

--

Raul Incze
Cognifeed

Fighting to bring machine learning to as many products and businesses as possible, automating processes and improving living experience.