Python & JSON: API Walkthrough

Strap yourselves in boys & girls & machines! It’s API tutorial time!

Today, I’ll be walking through how to use Python to webscrape pages that have JSON files or are formatted in JSON.

  • API: Application Program Interface — it is a part of a server that receives requests and send responses. We’ll be interacting with this to get the data that we want.
  • JSON: JavaScript Object Notation — the telltale sign that you’re dealing with JSON when webscraping is that it appears as dictionaries within dictionaries.

For this tutorial, we’ll be using IMDb’s data, basically just the title IDs, as requests into OMDb’s API.

First, we’ll need to bring in our imports:

The above are some imports you’ll want to bring in. The “Key” import is an optional import that contains a file for an API key, if necessary for a website. Sometimes you’ll want to keep this out of the public eye.

  • Pandas is your basic library, we’ll be using this to create a DataFrame based on what we request from the API.
  • JSON is for interacting with JSONs.
  • URLlib is a package that collects several modules for working with URLs.
  • Requests allows you to used to interact with websites to make requests.
  • Datetime is optional, but a good option to have available if you will be pulling any dates and want to make them into datetime objects.
  • OS.path and “exists” are used for creating new files down the line and checking to see if the file already exists for appending (rather than creating new headers or replacing the data each time).

JSON Responses

Below, I’ll showcase how one would build the infrastructure for a single request, based on the website’s API structure (check out the link from earlier for OMDb).

This is what a single request looks like from OMDb looks like. You should take note of the braces: { }. In Python, this is how dictionaries are built. JSONs have a characteristic different from Python dictionaries in that they are ordered.

Anyway, our task here is to pull out each element that we want for our pandas DataFrame. A really nice feature of interacting with JSON’s is that you can easily enter the dictionary and pull exactly what you want from it.

Here you can see that from our request, we can access the actors in the dictionary. This can be assigned to a column in our dataframe. Similarly, if you see a nested list, it can still be accessed pretty easily, as shown below:

This is the Rotten Tomatoes Score from the API request.

As you might notice above, we can just enter the nested dictionaries by specifying what we want. Again, this will allow us to get exactly what we want from the request.

Building an API Caller

So now, let’s get what we wanna get and put it into a dataframe! I recommend checking out the scraper I built for this purpose: Github. I’ll still pick out some portions that you can use as a reference for the walkthrough. It can be a little intimidating at first, but I promise it gets easy to put together your own scraper.

Line 1: First, I personally like to create an empty list for all the things that fall through the cracks so that I can return to them later on to try them again.

Line 2: Next, for the purposes of keeping track of where we are in our requests, I use enumerate to number my rows and I insert the column possessing the title IDs to pass through the API request.

Line 3: Then, the URL is put in for us to send our requests to. Each API might have a different way to send requests. Fortunately, there will almost always be documentation for how to interact with the API.

This shows us how we should structure our API requests. It differs from what I have in my scraper because I was emailed a different structure.
This shows the different parameters we can put into the url for our request.

Note: It’s important that you structure the url and anything you add into it as a string.

Line 4: This is where we assign the url requests as response. We pass through the url as the argument for “getting” the requests from webpage.

Line 5: If the response code == 200, this means that we’ve successfully made a request. Refer to this page to see common website response codes.

Line 6: We try! Try, if you’re unfamiliar, this literally tells your function to just try to do the things below it. It is often paired with an except statement that will do something if the attempt fails.

Sick flip.

The try-except statements end up being my webscraper’s closest friend throughout the process.

Line 7: We assign the text of our website response, as a JSON, to a variable (here it’s named data).

Line 8: For the purposes of my scraper, I give it some conditional statements based on what I learned from utilizing the scraper earlier. I noticed that requests that didn’t have a response of “True” would considerably slow down my requests.

Line 9: I only wanted movies in my requests so I placed another condition to only give me the data from requests that were of the type: movie.

Line 10: Yet another conditional, I didn’t need movies that didn’t have any box office returns.

The point of highlighting these conditional statements is to showcase how flexible your requests can be in making API requests, especially when working with JSON returns.

Line 11: We just keep trying.

Line 12: Here, we assign the data from Actors into a variable aptly named, actors.

Line 13: If our attempt to try fails, we will…

Line 14: Input a null value.

Lines 11–14 are repeated for every element of data that we want to pull.

Putting it All Together

Next, we have to put all the elements we grabbed into a Pandas dataframe for us to interact with in Python. Below, I’ll show the basic infrastructure of how you’ll put your assigned element variables into a dataframe.

The above is the basic infrastructure for building any empty Pandas dataframe with chosen column names. Afterward, the elements we grabbed are assigned to the columns we created. This process occurs for every request we make.

From this, you’ll have a nice dataframe that gets saved to a filename of your choosing. The os.path.exists comes into play here right by the end.

In the final line, we put everything we collected saved to a .csv file. The first portion names the .csv file. “mode=’a’” sets the .csv to be appended each time there’s a new observation, rather than having it be replaced. Finally, the last portion checks for whether or not a file with the given filepath and name already exists and, rather than inputting a header each time.

You’ve done it! You’ve created a Pandas dataframe from API requests that return JSON objects.

However…there is one last piece to all this…

The Leftovers

This is the pair to the if statement from all the way back at line 5. This bit of code will create a dataframe of all the times a request failed. This is useful to have saved for trying again in the future.

That’s actually the end now. If you have any clarifying questions, feel free to leave a comment or send me a message! Good luck out there on the world wide web!