Tutorial: Web scraping Instagram’s most precious resource — corgis.

John Naujoks
May 14, 2019 · 10 min read

I love data. BUT…I do love my corgi Ellie more.

Image for post
Image for post
Ellie (Instagram)

If you have opened a web browser sometime in the last 10 years, there is a good chance you’ve seen one before. Corgis are pure joy-inducing creatures for some. I can’t count how often I get stopped when taking Ellie for a walk for people to say hello. She’s a bit like walking around with a Disney costumed character: people scream, she politely says hello, occasionally photos are taken, and then everyone goes about their business.

Instagram in particular is an active place to find corgis cashing in on their particular cuteness. I have seen several post sponsored ads for a wide variety of brands looking to get in on that corgi clout. I’ve considered similarly using her talents for my benefit, but navigating social fame on Instagram isn’t easily intuitive.

Looking to shortcut my success with the power of data, I decided to see if I could figure out if there are particular trends in how other popular corgis on Instagram function. My goals were to:

  • Gather links to the 25 most recent posts for some popular corgis
  • Gather the details of their posts (likes/views, hashtags used) to see if there are any helpful trends.

To help answer these questions, I decided to web scrape Instagram to find information on how the top dogs become… the top dogs.

Web scraping allows you to interact with information on a website and extract it for analysis and interpretation. Though many websites have publicly accessible APIs, there are many that do not. Reasonable, non-disruptive web scraping is a great way to gather that information from those kind of sources.

This tutorial will help give you a hands-on example of web scraping with Python using Selenium, and go into a couple of the nice features that make it so convenient for the process. This will require a basic understanding of HTML and Python, but don’t be scared to give it a try.

The full details for this tutorial can be found on my GitHub here.

1. Getting setup with Selenium

To get this project started, we are going to import Selenium and a few other tools to help (Pandas to temporarily store and explore our results, Time to make our requests to the website more natural and less taxing for the website)

import pandas as pd
import time
from selenium.webdriver import Chrome

The basic workflow for Selenium is:

  • Instantiate a web browser (in this example using Chrome)
  • Create a browser object to a specific webpage
  • Use the browser object to interact with the webpage

Here would be the basic setup to do these steps:

browser = Chrome()
url = "https://www.instagram.com/"
browser.get(url)

Doing these simple steps will launch a browser window that goes to the url specified. You may think “Oh man, who opened a new browser window?” The answer is you!

Image for post
Image for post

This simple setup instantiated this browser window which you now control. Once this is set we can extract elements on the page with a range of commands, just for example here are some of the ones we will use in our first function:

#Retrieve every link from a page
browser.find_elements_by_tag_name('a')
# Retrieve a link from a page that has "likes" in its text
browser.find_element_by_partial_link_text('likes')

From this point on, we are going to see how we start from this basic step and use our browser to extract details from the page.

2. Getting the most recent 25 posts links.

def recent_25_posts(username):
"""With the input of an account page, scrape the 25 most recent posts urls"""
url = "https://www.instagram.com/" + username + "/"
browser = Chrome()
browser.get(url)
post = 'https://www.instagram.com/p/'
post_links = []
while len(post_links) < 25:
links = [a.get_attribute('href') for a in browser.find_elements_by_tag_name('a')]
for link in links:
if post in link and link not in post_links:
post_links.append(link)
scroll_down = "window.scrollTo(0, document.body.scrollHeight);"
browser.execute_script(scroll_down)
time.sleep(10)
else:
return post_links[:25]

Here is a little more explanation of what each part of this function is doing:

  • The function takes in a username, adds it to the basic Instagram url and opens that public page, and off we go!
  • Post links is our empty list that will contain all our final links. Post is an indicator we use when sorting the links found on the page for those that are post links, all start with this format.
  • While our post links list is less than our target, we retrieve every link on the page. To extract the actual HTML link, we use get_attribute(‘href’) to get that actual link from the href attribute of the link element selected. If the link matches our post url setup and isn’t already in our list, we add it to post links.
  • The browser then scrolls down to the bottom of the page, waits ten seconds, and then grabs any new links it finds on the page, until we hit our limit.

Infinity corgs: dealing with infinite scroll

scroll_down = "window.scrollTo(0, document.body.scrollHeight);"
browser.execute_script(scroll_down)

So what’s going on?

If you open an Instagram account page, you may notice that it has infinite scroll. When we are trying to extract elements from our browser, it can only see and grab elements that are within the current HTML of the page. When you scroll to the bottom of the page, it then loads more content. For our Instagram page, we need to keep scrolling down to get additional post links. The important thing to remember about Selenium is that we control the browser, so all our browser controls are possible. The script tells our browser to scroll down to the bottom (0), so we can have infinite corgis:

Image for post
Image for post

Time out!

time.sleep(10)

This creates a brief pause in function. You may want all the corgis now, but you must be patient. This delay timer of 10 seconds does two things for us: 1. the infinite scroll takes a few seconds to actually load the new content. Having it loop irstantationously will just be a glitchy mess. 2. When scraping we are trying to keep our browsing as natural as possible. We may be browsing via a bot-controlled window, but we don’t have to interact with the page like we are one. The less stress on a website, the more responsible the scrape (more on that at the end).

With all these pieces together, our function is complete. Let’s test the function out on Sneakers the Corgi:

Image for post
Image for post

Looks good! Now that we have the links for the posts, let’s look at getting the details.

3. Extracting the post details from our links.

  • Go to each of these pages
  • Get the initial post comment
  • Get the number of likes or views for the image or video.
  • Get the time the post was made.

Here is our complete function, which we will walk through with more detail:

def insta_details(urls):
"""Take a post url and return post details"""
browser = Chrome()
post_details = []
for link in urls:
browser.get(link)
try:
# This captures the standard like count.
likes = browser.find_element_by_partial_link_text(' likes').text
except:
# This captures the like count for videos which is stored
xpath_view = '//*[@id="react-root"]/section/main/div/div/article/div[2]/section[2]/div/span'
likes = browser.find_element_by_xpath(xpath_likes).text
age = browser.find_element_by_css_selector('a time').text
xpath_comment = '//*[@id="react-root"]/section/main/div/div/article/div[2]/div[1]/ul/li[1]/div/div/div'
comment = browser.find_element_by_xpath(xpath_comment).text
insta_link = link.replace('https://www.instagram.com/p','')
post_details.append({'link': insta_link,'likes/views': likes,'age': age, 'comment': comment})
time.sleep(10)
return post_details

Hopefully some of these pieces will start to look familiar, but let’s start to walk through it:

  • Start with an empty list of post details that we will return at the end. We will be making dictionaries of the details for each post and then add them to this list.
  • Using a try statement, we see if there is a link with the word ‘likes’ in it, which is the most common one with our like count. We use .text to get the links actual text that we store as a variable for our details. Pages that have videos display the ‘views’ instead of ‘likes’ and this detail is not stored in a link, so we have to access it a little differently.

XPath Gon’ Give It To Ya

xpath_view = '//*[@id="react-root"]/section/main/div/div/article/div[2]/section[2]/div/span'

This is called an XPath, otherwise known as an XML path. It is syntax used for for finding any element on a web page using XML path expression. In some webpage structures, we have easy elements we can use to help pinpoint where we would like to interact or extract. On many websites, things like CSS classes or tags are very common, and can often be used to find what you need. However, look at the what we see when we inspect our Instagram page:

Image for post
Image for post

Though our structure is standard, the attributes appear to not have standard names. These classes are often dynamically generated by Javascript and are used to offer greater flexibility to web elements. Since we can count of the page structure being the same, we can find the XPath to help find the same section on all pages. To get the correct XPath link, you should do the following:

  1. Find the element you are interested in, right click and select “Inspect”
  2. When you are in the Inspector view, right click the section it pointed you to and select “Copy” and scroll down to “Copy XPath”. It will look like this:
Image for post
Image for post

XPath is extremely helpful when you know exactly where on the page you would like to extract.

Our final details

Image for post
Image for post

Pretty nice! It looks like Sneakers the Corgi actually uses his own hashtag in comments often. With all of this in place, I ran our functions on a few other top dogs to look for insights: Tibby the Corgi (242k followers), Winston the White Corgi (232k followers), Ralph the Corgi (315k followers), and Geordi La Corgi (396k followers!).

Just as a quick example, here are the stats for our first example of Sneakers the Corgi:

Image for post
Image for post

I am now well on my way to catch up with the big guys. Our data is a little messy, but in exploration here are a couple quick observations on the data I collected:

  • The primary hashtags that appear to be used are either for national events (#EarthDay, #MothersDay, #AprilFools) or custom ones that are just specific to the dog (#sneakersthecorgi, #ralphandgeorge, #winstonsroadtofetch). The top hashtags found in all are as follows:
#sneakersthecorgi: 24
#shopsneakersthecorgi: 6
#TakeoverByGeorge: 6
#RalphandGeorge: 10
#winstonsroadtofetch: 10
  • Posting is often every 2 to 4 days, some daily streaks.
  • The common approach seems to be a mix of photos and videos regularly.
  • Videos are a bit complex in terms of engagement, it would appear they get often 4x as much engagement, but are probably from the same users watching multiple times.

A quick note on scraping, the law, and YOU

Bonus: Extracting photos

I hope you all enjoyed looking at all of these corgis with me, thanks for reading!

The Startup

Medium's largest active publication, followed by +771K people. Follow to join our community.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store