Crawling a JSON API into Google DataStudio

3 steps to integrate external data into your company’s workflow

Recently I’ve been working a lot on gathering knowledge on the web and making it available to analyse by myself and others. In particular, I have been looking at APIs of top websites and setting up the processes for taking that information from an endpoint to a datastore that can be queried on demand.

Using a JSON API over the traditional approach of accessing the webpage and extracting the information from the HTML/CSS has many benefits. In particular, the data is in a neat, structured format which means it’s easier to work with and explore as well as being less susceptible to website changes. On top of this, there is often much more data available in the JSON API then is actually rendered on the page which means you can access ‘secret’ information.

In this article, I’d like to outline the approach that I have used, to show you just how easy it can be. As a quick summary, we will:

  • start with an API endpoint
  • use Scrapy and Scraping Hub to schedule crawling jobs with Python
  • pull and process the data using Google Scripts
  • forward the data onto Google BigQuery where it will be stored
  • create a quick and easy dashboard using Google DataStudio

The best part: it’s all free (using Google’s 12 month trial)!

To pull off something like this requires knowledge in APIs, Python, Google Scripting and dashboard building. It reflects the multiple levels of knowledge that a good Data Analyst should have when working for an online company. If you don’t have all the skills, not to worry, read on! If you get lost, pause and read a little more background on that area before continuing.

The API Endpoint

The API Endpoint that we will experiment with is Instagram, since it is widely used and is applicable across many industries. We will aim to monitor the posts of our top competitors and see which ones bring the most followers.

The endpoint looks something like this:

https://www.instagram.com/yourHandle/?__a=1

which, in the case of the account “instagram” will return JSON data looking like:

{  
"logging_page_id":"profilePage_25025320",
"show_suggested_profiles":true,
"graphql":{
"user":{
"biography":"Discovering and telling stories from around the world. Founded in 2010 by @kevin and @mikeyk.",
"blocked_by_viewer":false,
"country_block":false,
"external_url":null,
"external_url_linkshimmed":null,
"edge_followed_by":{
"count":244279586
},
...

Wow, that’s a lot followers.

If you’re trying to follow along, you can get the JSON simply by putting the URL into your address bar and then using a tool such as jsonformatter.curiousconcept.com to parse the JSON into a more human-readable format.

Delving into the JSON we see that the particular information we care about, the posts, sit under:

response['graphql']['user']['edge_felix_combined_post_uploads']['edges']

And the data (showing just those which are relevant) for one of these posts looks like this:

"node":{  
"__typename":"GraphVideo",
"id":"1827796522823934046",
"edge_media_to_caption":{
"edges":[
{
"node":{
"text":"Hello, world! Meet four love-struck Pacific parrotlets named Winter, Sprig, River and Willow (@freyaeverafter_)."
}
}
]
},
"shortcode":"Bldo0DgnWBe",
"edge_media_to_comment":{
"count":526
},
"taken_at_timestamp":1532110486,
"dimensions":{
"height":1340,
"width":750
},
"display_url":"https://instagram.fsin2-1.fna.fbcdn.net/vp/f9785a26d7120833ba3b99ec3e5eb13a/5B5EB778/t51.2885-15/e15/37119199_422055764959713_4492943422667096064_n.jpg",
"edge_liked_by":{
"count":13720
},
"edge_media_preview_like":{
"count":13720
},
"owner":{
"id":"25025320"
},
"is_video":true,
"is_published":true,
"product_type":"igtv",
"title":"Birds of a Feather with @freyaeverafter_",
"video_duration":91.891
}

OK, so we have a starting point. In your case you may have an API of a competitor or any number of things, but the structure will be the same. You should identify how the information is returned and exactly which information you would like to extract. The following processes will then be the same.

Crawling with Scrapy

Now that we know what data we want by looking at the JSON we need to set up the procedure which will talk to the endpoint and get the required information according to a defined schedule.

The tool that we will use for this is www.scrapinghub.com, which is built using a very popular Python library for scraping the web, Scrapy. Both Scrapy and Scraping Hub have many good tutorials on the basics of setting up, which I will leave for you to read.

The one point that I want to make here is that most of the documentation on Scrapy talks about parsing a traditional HTML/CSS document using CSS Selectors or XPath to get the information you need, but neither of those apply here because we’re dealing directly from a JSON response. Instead, to parse the content, you should use something like:

import json
data = json.loads(response.css('::text').extract_first())

Here we are reading the JSON response from our API endpoint as text (and extracting the first and only result) before using the json library in Python to parse the text into a Python dictionary.

The data we want can then be accessed as previously described above

posts = data['graphql']['user']['edge_felix_combined_post_uploads']['edges']

which will give a group (Python list) of posts. These lists can be iterated over to build an item to yield to Scrapy, as in the case of the traditional HTML/CSS scraper.

for post in posts:
yield post

In many cases you don’t need, or want, to keep all the data returned in the JSON object, so at this point you should choose the fields that you care about. In the case of the Instagram API we mentioned above what we would like to keep, so we would proceed as follows:

keep_fields = ["__typename", "id", "edge_media_to_caption", "shortcode", "edge_media_to_comment", "taken_at_timestamp", "display_url", "edge_liked_by", "edge_media_preview_like", "owner", "is_video", "is_published", "product_type", "title", "video_duration"]
for post in posts:
yield { k: post[k] for k in keep_fields }

Once we have our job working on Scraping Hub and returning results, it’s time to setup scheduling for that job and move it to somewhere where we can put the data to work!

Note that it’s possible to schedule jobs within Scraping Hub if you pay the $9/month subscription fee, but we’re going to use Google Scripts to avoid that fee and continue using the free account for now.

Processing with Google Scripts

Before we go ahead and process the data with Google Scripts, let’s set up our scheduling from Google Scripts. This is possible using the Scrapy API /run, a simple example of how we could do this from Google Scripts is below. This job should be set to trigger using Google’s triggering system at the regular intervals you would like to crawl, e.g. once per day, once per hour.

/**
* @summary Trigger a Scrapy Job to run.
*
* Equivalent curl is:
* curl -u APIKEY: https://app.scrapinghub.com/api/run.json -d project=PROJECT -d spider=SPIDER
*
* @params {String} projectId
* @params {String} spiderName
*
* @returns {Object}
*/
function runJob(projectId, spiderName) {
var url = 'https://app.scrapinghub.com/api/run.json' +
'?apikey=' + globVar('apiKey');
var options = {
'method' : 'post',
'payload' : {
'project': projectId,
'spider' : spiderName,
};
};
return UrlFetchApp.fetch(url , options);
}

With that out of the way we might want to process the data to modify some of the entries according to our requirements. For example, in the above case the result returned for edge_media_to_caption is a multi-level dictionary so we could process this to return just the text of the caption. In other cases we might want to map naming conventions on our competitor’s websites to those that we use on our own website.

Note that this can also be done directly in Scrapy before we yield the item or as Middleware, this is generally a better idea, but sometimes processing in Google Scripts after extracting from Scrapy can be easier.

Storing in Google BigQuery

To store all the data we crawl, we will use Google BigQuery. The process for getting set up here is pretty simple. Assuming we already have an account setup, we can go ahead and create a new dataset and table with the fields and data types that we picked out above. Once this is done we note down the respective ids and names and go back to Google Scripts.

To send the data to Google BigQuery we must first format it as a CSV. This can be a little painful, especially when you’re dealing with user generated content that has unexpected lengths, formats and characters, but the following gets the job done.

/**
* Converts an array to a csv file.
*
* @param {Array} data Data to be converted to CSV.
*
* @returns CSV file generate from the array.
*/
function convertArrayToCsvFile(data) {
var csvFile = undefined;
// Loop through the data in the range and build a string with the csv data.
if (data.length > 1) {
var csv = "";
for (var row = 0; row < data.length; row++) {
// In each column, check the if the data has a comma and then add.
for (var col = 0; col < data[row].length; col++) {
if (data[row][col] != undefined) {
if (data[row][col].toString().indexOf(",") != -1) {
data[row][col] = "\"" + data[row][col] + "\"";
};
};
};
// Join each row's columns.
// Add a carriage return to end of each row, except for the last one.
if (row < data.length-1) {
csv += data[row].join(",").replace(/\r/g, '').replace(/\n/g, '') + "\r\n";
}
else {
csv += data[row];
}
}
csvFile = csv;
}
return csvFile;
}

The above CSV file can then be saved to Google Drive. With the data stored in the correct format, it can then be sent on to BigQuery with the following script:

/**
* Inserts data into a BigQuery table from a csv file.
*
* @param {Object} data CSV file to upload to BigQuery.
* @param {String} projectId BigQuery project to insert into.
* @param {String} datasetId Bigquery dataset to insert into.
* @param {String} tableId BigQuery table to insert into.
*/
function sendToBigQuery(data, projectId, datasetId, tableId){
// Run this to get correct permissions for BQ.
DriveApp.getRootFolder();
// Define Insert Job.
var insertJob = {
configuration:{
load:{
destinationTable:{
projectId: projectId,
datasetId: datasetId,
tableId: tableId
}
},
}
};
// Send job to BigQuery.
var job = BigQuery.Jobs.insert(insertJob, projectId,
data.getBlob().setContentType('application/octet-stream'));
Logger.log(job.status.state);
var jobId = job.jobReference.jobId;
// Wait to hear back about the job.
var sleepTimeMs = 500;
while (job.status.state !== 'DONE'){
Utilities.sleep(sleepTimeMs);
job = BigQuery.Jobs.get(projectId, jobId);
Logger.log(job.status.state);
}
}

Creating Beautiful Dashboards with Google DataStudio

I’m a big fan of Google DataStudio, it fits very nicely with BigQuery and allows you to quickly generate nice looking, and sharable, reports from your data.

In this case, we will create a new report in Google DataStudio and create a data source using the BigQuery connector. Since it’s all Google very little needs to be done to the data typing or formatting as this should all be worked out automatically.

An example of a beautiful Google DataStudio report

Conclusion

While this is just an overview and doesn’t contain the code or instruction for the entire process, I hope it gives you an idea of what is possible. To wrap up what we’ve done, we’ve:

  • identified data on an API endpoint
  • used Scrapy and Scraping Hub to crawl the data at regular intervals
  • scheduled crawls and processed the data using Google Scripts
  • forwarded the data onto Google BigQuery where it is stored
  • created a quick and easy dashboard using Google DataStudio

Love to scrape data? Check out this article about scraping Memrise to investigate user behaviour: Memrise Data Scrape & Analysis.

If you enjoyed reading this, please don’t forget to leave a clap or two, or three… 👏👏👏🔄

If you have any questions about setting up something similar, please don’t hesitate to reach out.