How to Scrape SocialBlade for YouTube Subscription Data

GDPR and YouTube regulations make it harder than you might think

Anjali Shrivastava
The Startup
6 min readSep 14, 2020

--

In my latest video, I analyzed the impacts of the popular YouTube series Content Cop, and the journey of actually acquiring the data I needed for this was much longer than anticipated. In this post, I’ll explain how to scrape subscription and views data for a YouTube channel so you don’t have to go through the same frustrations that I did.

If you’ve ever used the YouTube API, you’ll know that you can only get current subscription numbers from it — you cannot go back in time or fetch historical subscription data with the API. Now, you could build your own database and update it daily with API calls. Or, you can take advantage of the fact that SocialBlade has already done this, and scrape your data from the third party website.

However, there are two pitfalls that you might fall into when using SocialBlade. The first is that it only makes data from the last 3 years publicly available, in order to be compliant with GDPR. The second is that in late 2019, it changed which data is available in accordance with YouTube changing what is available through their API .

Screenshot of SocialBlade for YouTube user ‘LeafyIsHere’ on 9/13/20, showing that only data from last three years is available. Image by author.

Now, I needed data from 2016, which is before 3 years ago. So in order to access this data, I simply plopped the url I wanted to scrape into WayBack machine to get an earlier version of the page. Any page that was archived before May 15, 2018 will have data going back to the channel’s inception.

(Left) All available dates that SocialBlade page for YouTube user ‘LeafyIsHere’ was archived. (Right) Archived SocialBlade page for ‘LeafyIsHere’ from December 2016, showing that data is available going back to the channel’s inception, in 2012. Images by author.

Once you have the url, you’re ready to start scraping! Here is a link to the Github repo with the scraping code and resulting datasets.

Note: The code I’m sharing here works for SocialBlade urls that end in “/monthly” and were archived before 2019. I have not checked if it works for any other versions or urls on the site.

I have intentionally written the code so that it can be reused for many projects. Here is the link to a repo with all of the code.

I first defined a function to scrape the targeted data values from SocialBlade. The function, sub_scraper, takes in two arguments — the URL that you are trying to scrape, and the var that you want to scrape. There are four options for the “var” input: ‘count’, which represents the daily change in subscribers; ‘total’, which represents the total amount of subscribers; ‘views’, which represents the daily change in channel views; and ‘views_tot’, which represents the total amount of channel views.

This function returns a list that contains the available dates and respective values. Here is the function:

import requests as req
from bs4 import BeautifulSoup as bs
import re
def sub_scraper(url, var):
r = req.get(url)
print(r.status_code)
soup = bs(r.text, 'lxml')
script_divs = soup.find_all('script', {'type': 'text/javascript'})
res = 0
for i in range(len(script_divs)):
# print(i)
# print(script_divs[i])
if "CSV" in str(script_divs[i]):
if var == 'count':
res = script_divs[i]
elif var == 'total':
res = script_divs[i + 1]
elif var == 'views':
res = script_divs[i + 2]
elif var == 'views_tot':
res = script_divs[i + 3]
break
# print(res)
lst = str(res).split('+')
lst = [test.strip() for test in lst]
lst = [test.replace('\\n"', '').replace('"', '') for test in lst]
return lst

I then defined a function, to_df, to parse the list that sub_scraper returns and convert it into a dataframe. This function takes in the channel name as an argument as well.

import pandas as pd
def to_df(url, name, var):
lst = sub_scraper(url, var)
print(len(lst))
lst = lst[1:len(lst) - 1]
df = pd.DataFrame()
df['Date'] = [x.split(',')[0] for x in lst]
df['Subs'] = [x.split(',')[1] for x in lst]
df['Name'] = name
return df

And finally, I created functions to filter the dataframe by date. You can skip this step if you are wanting to use all of the data that is available on SocialBlade. But the function filterdate takes in a string in the format of year-month-day (eg. ‘2016–05–19’) and returns a dataframe with one month prior to the given date and one month after (eg. ‘2016–04–19’ to ‘2016–06–19).

from datetime import date
from dateutil.relativedelta import relativedelta
def checkmonth(check, year, month, day):
target = date(year, month, day)
check = date.fromisoformat(check)
bounds = [target + relativedelta(months=-1), target + relativedelta(months=+1)]
if check >= bounds[0] and check <= bounds[1]:
return True
else:
return False

def filterdate(date_str, df):
target = date.fromisoformat(date_str)
month = target.month
day = target.day
year = target.year
return df[df['Date'].apply(checkmonth, args=(year, month, day))]

This function can be easily modified by editing the bounds line in checkmonth if you want a wider or narrower range of values.

And a full call using all of these functions looks like this:

filterdate('2016-09-13', to_df('https://web.archive.org/web/20161218062757/https://socialblade.com/youtube/user/leafyishere/monthly', 'LeafyIsHere', 'count'))

And again, if you do not want to filter by date, you can simply call to_df.

In 2019, YouTube changed how subscription counts are displayed on the website, which subsequently affected the numbers that are returned via API calls. Prior to 2019, the API returned the absolute number of subscribers for each channel. Now, however, the API rounds the number so that it displays 3 significant digits.

This means that a channel with 123 subscribers would display as 123; a channel with 51,734 subscribers would display as 51.7K; a channel with 349,999 subscribers would display as 349K subscribers; and a channel with 10,291,544 subscribers would display as 10.2M subscribers.

This change obviously affected which data SocialBlade makes available. Prior to this change, SocialBlade published accurate subscriber and view counts for each day, and the daily change in both of these counts. Now, SocialBlade only provides weekly numbers.

SocialBlade also redesigned their website. I have not checked if my scraping code works for the redesign, but my guess is that it does not. Please keep this in mind if you are intending to scrape data either in or after 2019.

Here is a link to my resulting dataset. And here are some of the plots I produced with this dataset:

(Left) The absolute change in subscribers for each of the channels whose data I scraped. (Right) The percent change in subscribers for each of the channels. Images by author.

Here is the code for producing a plot similar to the image on the left, if that is of interest:

for i in range(len(names)):
df = cc[cc.Name==names[i]]
plt.plot(np.arange(len(df.index)), pd.to_numeric(df.Subs), label = df.Name.values[0], color = colors[i])
plt.axvline(x=30, color='red', linestyle='dashed',linewidth=0.5)
plt.ylabel('Change in Subscribers by Day')
plt.xlabel("Day (0 is 1 month before Content Cop's release)")
plt.legend()
plt.annotate('Content Cop released', xy=(30, 20000), ha='center', size=10)
plt.title('The Impact of Content Cop')

And if you’re interested in seeing more graph examples and the code to produce them, please see this Jupyter notebook.

And last, but not least, if you’re interested in the results of my analysis of Content Cop, check out my YouTube video!

https://www.youtube.com/watch?v=EJpJWYTdtPc

--

--