Using Google Cloud Natural Language API for SEO with Python

Benjamin Burkholder
Feb 23, 2019 · 14 min read

The purpose of this article is to provide you with a program I’ve built that combines all of the Google Natural Language API modules into one. I’ll provide some basic background on what this API is, how it can be used for SEO and how to leverage the program. There are a lot of intricacies around the data provided from these APIs, so, to keep things lightweight I’ll provide links to the official Google documentation so you can read the in-depth material for yourselves.

What is the Google Cloud Natural Language API?

In the simplest terms, Google’s Cloud Natural Language API uses existing machine learning models to determine the relationship between text. With this API you can:

Use Cloud Natural Language to extract information about people, places, events, and much more mentioned in text documents, news articles, or blog posts. You can use it to understand sentiment about your product on social media or parse intent from customer conversations happening in a call center or a messaging app. You can analyze text uploaded in your request or integrate with your document storage on Google Cloud Storage.

Just from this short blurb, you should be able to glean just how useful such a tool can be. The API actually contains five methods that, when called with text to analyze, will return the data specific to the call. Below are these five methods:

  • Sentiment Analysis ~ Identifies the overarching opinion within the text, essentially determining the writer’s attitude as positive, negative or neutral.
  • Entity Analysis ~ Identifies the text for known entities, such as public figures, places, etc. It will return information on these entities, such as a Wikipedia URL among other data points.
  • Entity Sentiment Analysis ~ For entities detected in given text, the method will determine the overarching emotion of the writer toward the entity.
  • Syntactic Analysis ~ Analyzes text and breaks up the text into a series of tokens (words) and provides further data on their relationship. E.g. Verb, Noun etc.
  • Content Classification ~ Analyzes the text and determines a ‘content category’ to return for the text.

This is a very high level summary of these methods. For more in-depth information, be sure to read the available Google documentation.

What does the program do?

While conducting my own tests using this API, I thought it would be much more efficient to combine all five of these methods into one main module. From there the user can analyze text in a variety of ways.

Using the basic API modules Google provides for the Python implementation, I’ve added additional functionality to make the tool more useful for non technical users:

  • All five methods have been combined into a central Python module.
  • User is given the option to run tests via ‘bulk’ or ‘direct’. Bulk allows the user to fill a txt file with as many URLs as they want to analyze the text contained within. This is useful for gathering a large amount of data at once. The direct option allows the user to simply add text into a txt file and run the analyses that way. This option is better for more in-depth, targeted analyses.
  • When analyses are run, the results are saved to local CSV files automatically. This way the data can be analyzed offline.
  • The program is set to loop until the user states, when prompted, that they wish to terminate the program. This was included so multiple analyses can be made on the same text/URLs.

How is Google Cloud Natural Language Useful for SEO?

Now into the whole purpose of this article’s existence, how can this program be useful for SEO?

Out of the five methods available, I’ve found two to be the most useful for SEO purposes:

  • Entity Analysis
  • Content Classification

Why? Allow me to explain how both of these methods can be used for identifying opportunities, as well as taking a potential glimpse into how Google may be interpreting the content on your web pages.

Entity Analysis

As mentioned in the brief overview, the Entity Analysis method will analyze the given text and determine words and phrases which it ‘thinks’ are important. Each result returns the base fields shown below, it’s important to note that these fields will change slightly depending on whether the user selected ‘bulk’ or ‘direct’ upload:

  • name
  • type
  • salience
  • wikipedia_url
  • mid

For this task, the most important fields from this analysis are name, type and wikipedia_url. The ‘name’ will display the word the program has determined to be an entity, the ‘type’ fields will display what the program thinks the word is referring to. In the image below, you can see that the program has picked out “U.S. Cellular” from the text, and it has determined the type to be an ‘organization’.

Lastly, the Wikipedia URL for the entity is provided as well. This part is very useful because it shows the level of confidence Google has that the entity in the text is the same as that in the Wikipedia page. This is also important because Google sources much of the content in a Knowledge Graph from Wikipedia, so having this strong connection is crucial. This is also useful when identifying entities that don’t have Wikipedia pages, for current or prospective clients.

Why?

Let’s say we’re working on auditing a website, running some of the web page content through the program. You notice that the name of the company is not being ‘typed’ correctly and there is no Wikipedia page being pulled in when you believe it should. As you investigate further, you notice they do have a Knowledge Graph but it’s pretty much bare.

The reason?

Since Google tends to source much of the content in a Knowledge Graph from Wikipedia, if no Wikipedia page exists it is likely that little to no content is being sourced. This presents a good opportunity for a Wikipedia page to be created for the company, beefing up the Knowledge Graph while also strengthening Google’s understanding of the company being an organization type.

Let’s take a quick look at this process in action. But first, a shout out to my colleague Drew Schwanitz for providing the correlation for this particular example.

  1. While running some entity analyses on the content present for a random entity, ViSenze, we started to notice that the company name is not being ‘typed’ correctly. It is essentially being lumped into the “Other” bucket. This is an issue because clearly Google’s Natural Language API is not identifying the company name as being that of an organization, or an entity in its own right.

2. Next, you’ll notice that no Wikipedia page is listed in the results. Out of curiosity, we queried to see if ViSenze had a Knowledge Graph present in Google search. They did, however, it is very bare bones as you’ll see below.

3. Why is the Knowledge Graph so bare? The chief reason is because Google does not have a reliable ViSenze Wikipedia page to source valuable content from. As a result, only the most basic company information is included in the Knowledge Graph.

As you can see, these simple discoveries came from a chance curiosity after running web page content through the GNL program. With this data, better suggestions can be made to strengthen the correlation Google gives to a company and related content on the web.

Content Classification

The other GNL method I’ve found to be most helpful for SEO is the content classification method. As mentioned earlier, this method classifies the given text into a category name the API ‘thinks’ it belongs in. It then assigns the text a confidence score, which shows how confident the API is that the assigned category is correct. The confidence score is a decimal (float) value between 0 and 1.

Why is this helpful?

Being able to analyze text on a web page, and understanding how Google may be interpreting it, is pertinent in ensuring on-page copy is relevant and focused. Take the results below for example:

*NOTE* This example is using the ‘bulk’ upload feature in the program.

For both blurbs of content analyzed, the API is placing them in the “Internet & Telecom” category. While correct, it is pretty generic as there are many more specific categories available below this one. Also, the confidence score of this accuracy is only about 50–60% for both blurbs. This may explain the higher level categorization, as Google can’t form a solid enough connection between the text and the category choice.

This method is helpful for SEO because it helps the user better understand how Google may be ‘seeing’ the content. The human writer of the content may think they’re being very explicit, but after analyzing the text with the program, it may turn out that Google isn’t very confident about what the content is really about. This can be used to help shape how content is written and the keywords used to ensure max confidence on the part of Google.

Program Setup

Time to get down to the nuts and bolts of setting up and using the program I’ve created. In this section, I’ll walk you through the prerequisites you’ll need in order to use the program.

Here is the basic breakdown:

  • Install latest Python release (3.7.2 currently or newest version).
  • Install git.
  • Download PyCharm or Visual Studio Code (or a suitable equivalent to run the code).
  • Clone the program from my Github repository to your machine.
  • Install Beautiful Soup module.
  • Create Google Cloud Account, save JSON credential file to local directory, set ENV variable (path to JSON file).

I will not be going into minute detail on all aspects of peripheral configuration. If you hit a snag, a simple Google search should help you as most of these issues are heavily documented. Or ask a developer for assistance if you have access to one.

Installing Python

In order to install the latest Python library, simply navigate to Python.org and look under “Downloads”. There should be releases available for all major operating systems.

In this example, we’ll download for Windows and select one with an executable installer. Simply click the option you want and follow the installation prompts until it indicates Python has been installed onto your machine.

Installing Git

We’ll be using git in order to communicate with GitHub, so we need to first install git. You can find the directions on installing git on different OS via their official website.

If you aren’t very familiar with installing packages directly from the command line, they also provide downloadable versions for ease of use.

Downloading PyCharm / Visual Studio Code (or equivalent)

Next we need a platform in order to run the program in, I typically work in PyCharm or Visual Studio Code since they’re free and very robust. However, if you have an equivalent you prefer feel free to use that. All you need is a platform in which to open and run the program.

Clone the Repository from GitHub

Now we need to clone the repository from my GitHub that contains the program. For this, we’ll be using git (which we installed earlier) as well as using the command terminal.

Here are the steps:

  1. On your machine, in the start search bar look for the command terminal native to your OS. On Windows it’s PowerShell and on Mac typically it’s Linux. A simple Google search should help you determine which one you have.
  2. Navigate to my GitHub repository containing the program. On the far right you’ll see a green button that says “Clone or download”, we’re going to be cloning. Now click the little clipboard image circled below, this will copy the path.

3. Open the command line terminal native to your machine, we’re going to clone the project to the desktop for ease of access. You can accomplish this by entering :

cd desktop

Then simply hit enter and you should see the desktop folder appended to the path.

This is how you navigate in and out of folders via command line. For our purposes however, this is as far as we need to go.

Next we will clone the Github project to our desktop. Simply enter this command into the terminal and hit enter:

git clone https://github.com/ibebeebz/google-natural-language-api.git

You should see some action taking place and a success message in your terminal. It worked! Now if you look at your machine’s desktop you should see a folder containing the Python file called ‘google-natural-language-api’.

Installing the ‘Beautiful Soup’ Module

In order for the Python program to run, all dependencies need to be installed. This program leverages the Beautiful Soup module which doesn’t come with the standard library, thus it needs to be installed separately.

In your command terminal enter this command and hit enter:

pip install beautifulsoup4

You should see activity and a success message, now the Beautiful Soup dependency should be downloaded to your machine.

NOTE ~ I believe this was the only third-party dependency, but if not, simply look at the “import” section at the top of the program and see if any errors (red underlines) are being called out on the imported modules. If so, search for “python install [name of dependency]” in search to find the terminal command. This should provide you with instructions on installing any other necessary dependencies.

Create Google Cloud Account/JSON Creds/Env Variable

This step is important, as you will unable to run the program, or any other Google Cloud Machine Learning APIs without it. I will not outline these steps as existing documentation exists to explain it better.

Once both steps above are completed, the program should work.

NOTE ~Make sure to keep your credentials private, never publish it anywhere publicly where it can be exploited.

Program Flow

Below is the general flow of the program.

  1. Looking in the project folder you cloned from my GitHub repository, you should see a few different files:
  • 2 txt files ~ ‘gnl-direct-check’ and ‘gnl-bulk-check’. These are the files where you will input either URLs (bulk check) or insert text directly (direct check).
  • 1 Python file ~ ‘gnl-main-module’. This is the program where you’ll run analyses from.
  • 1 folder ~ ‘gnl-separate-modules’. This is just a folder of each separate GNL method modules for reference, they do not interact with the program at all.

2. Once you’ve placed the URLs or text you want to analyze into their respective text files, open the program Python file in your code editor of choice. It’s important to note that the folder with the files should be placed in the folder that the code editor is pointing to. You can typically find this path in the editor’s terminal. You can change this path in the editor, but it’s easier just to place the files in the directory it’s already pointing.

3. Once you run the program, you’ll be given the option to run the analyses in bulk or direct. Simply type in the option of your choice.

4. After entering the option of your choice and hitting enter, you’ll be given a list of the GNL methods to choose from. Each analysis corresponds with a letter in parentheses, so just enter the letter of the analysis you want to run in the prompt.

NOTE ~ For the ‘bulk’ option, you’ll notice that there are only four options compared to the ‘direct’ option’s five. The reason is that I’ve removed the ‘Syntax Analysis’ option from bulk because it would create very unwieldy results if analyzing many URLs. I may add back at some point if deemed valuable.

5. There are a few key differences on how the processes are conducted depending on whether you analyze in bulk or direct. With the bulk option, the program uses Beautiful Soup to scrape all available ‘p’ tags from the page. This is done since best practice is to have all content located within these tags. Obviously, this means that heading tags are not included or text not included in ‘p’ tags. Because web pages generally contain a lot of messy code that gets in the way, I opted to only collect ‘p’ tag content to make the program more scalable across different domains. From the tests I’ve run, this seems to collect most of the important content anyways. If the pages you’re analyzing have a different structure, use the ‘direct’ method instead to collect everything.

FUNCTIONALITY UPDATE ~ The bulk functionality of the program now gives the user the ability to specify which tag to target. This option, in conjunction with the regex that strips out all extraneous tag code, means the program can now run everything present on a page through the API. This means the user can gain a more holistic view of how the page is being “typed” by the GNL API.

When running the ‘Content Classification’ option for ‘bulk’, you may see a few instances of an error about too few tokens. This occurs when the content located within a given ‘p’ tag is less than 20 words (tokens). If the page content you are analyzing is somewhat short, run it in the direct option to ensure it can be analyzed in full.

6. Once an analysis finishes executing, you’ll see a prompt asking if you want to run another analysis. Answer by simply entering either Y or N. If Y, you will be returned to the GNL method choice screen. If N, the program will finish executing.

NOTE ~ If you want to run analyses using a different check choice, you will need to end the program to get back to the screen to choose either ‘bulk’ or ‘direct’.

7. Each time you run an analysis, you should see a specific CSV file populate in the folder. This is where the results of the analysis have been written. Each analysis choice for both bulk and direct will have their own CSV files.

It’s important to mention that the program is coded to overwrite the CSV if the same analysis is run again without saving the file under a new name. To avoid losing results, once you’ve run an analysis you want to save, simply rename the file. Then, when you run the analysis again, a new file will populate with the latest data.

Conclusion

Hopefully this article and program prove useful for you, there’s a whole plethora of data to be extracted from these API methods so definitely dig in. Google’s Natural Language API methods provide an interesting look into how Google is currently using machine learning, and may give us insight into how Google ‘sees’ the content on web pages.

I will be reviewing the code in this program ongoing to make revisions and enhancements, so be sure to clone out the program again in the future if you don’t use it for an extended period. If you have any questions or concerns, feel free to reach out.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade