Google Scraping With Python

Ever want the data behind Google Search results? Keep reading…

Background

Ever needed to get your hands on the data from a Google Search Results page? Did it seem impossible? Did you exhaust every scraping method? Only to enter a staring contest with reCAPTCHAS and expensive proxies?…Same.

Enter the Google Custom Search API, a free-ish API from the tech giant Google themselves. This simple API provides just about all you could need for your next project, and getting started is fast.

Getting Started

You will need to create a custom search engine on Google which you will use to get your search results. Fill in similar details below and enter whatever your favorite site is for “Sites to Search”

Next you should have a success screen. Click “Control Panel”

Your default settings should look similar to below. (Note: Keep your Search Engine Id safe just in case)

Keep scrolling on this page. There are a few important settings.

Make sure “Search the entire web” is enabled, this will replicate the Google Search functionality you know and love, and is the key core of this whole project right?

After you enable the above. Click “Get Started” under Programmatic Access. This will be how you get an API key to make requests.

Congrats you are doing great. Few more steps and we will get to the cool stuff I promise. After clicking the “Get Started” you will see the following screen. Click “Get A Key”. A little popup will appear, give your project a name.

Here is the key, keep it safe! Few IMPORTANT things to highlight. You ONLY get 100 Free API Requests per day. Do not exceed your free limit or you will have to pay for additional requests.

Here is the API Console if you click the “API Console” link from above. You can also access by searching for “Custom API Search” in search bar. You can increase quotas and limits in here if you so desire. I personally set my limit to 100 per day to ensure I don’t owe a penny.

Here is the Python code to make this work. Won’t go into every line detail, but at a high level it does the following: Creates a GoogleSearch Instance and opens a request session to the API. It then gets back up to the Google Limit of 100 results for a given search request. Also collects meta data about each request stored in the search_terms and search_stats attributes of the instance.

Using the above code you can pass and valid Google API Parameter into the get_results() function to control results returned.

Once you have the results you can store them in a Pandas DataFrame or process downstream in vanilla Python the options are limitless

If you are interested in the full code or taking this further check out these links for full source code and notebooks.

Get the Code

Thanks for reading hope you enjoyed and can use this in your next project.

26 | Data & Analytics | Designing/Developing/Building Things