Scraping News from Google

Rahman Taufik
Analytics Vidhya
Published in
2 min readJan 4, 2021

Basically, we can use several web scraping tools (e.g. BeautifulSoup, Scrapy, Selenium, etc.) to extract information from google. For this article, author use BeautifulSoup because it is easy to implement. Actually, it depends on what you can or you are comfortable with.

Furthermore, this article explain about how to scrape from Google and how to deal with google query and request limitations. There are python code examples for google scraping, including Google News and common Google Search

Google News

Get Stock News from Google News

Google News Scraping Code

Google News is used to search for news from several publishers. For this example, we are searching for company news related to stock.

In the code above, we have a function which is used to get the news for all the tickers we registered.

One of important things on google scraping is google query settings, you can set and explore for more google query in here, but in this code the query includes keywords for search, language, sorting by date and number of news.

keyword = '{} stock news'.format(t)        
url = f"https://www.google.com/search?q={keyword}&tbm=nws&lr=lang_en&hl=en&sort=date&num=5"

Get Company Description from Google Search

Wikipedia Scraping Code

Meanwhile, in common google search, we try to scrape a company description from wikipedia. We use ‘wikipedia’ and ticker name as a keyword to get description about the ticker, for example ‘wikipedia apple inc’.

url = f"https://www.google.com/search?q=wikipedia {ticker_name.lower()} company&lr=lang_en&hl=en"

Actually the code can be the same as google news scraping, but the difference is in the query, there is no tbm = nws that represents google news

Handle Request Limit

res = request.get(url, headers=random_header)

We need headers for Google scraping because Google need information about your request. The header setting is important to prevent the request limits (sometimes Google will block your IP when you exceed a certain amount of requests).

We can use several methods to handle request limits on google scraping, such as library usage or user-agent rotation.

# random header library
from fake_useragent import UserAgent
from user_agent import random_header
User-agent rotation code

You can create and set the header list to rotate your user-agent, for more information about user-agent rotation please check this source

The most important thing on Google Scraping is how to set up the query url and the header on your code

Once you can handle the query and the request limitation, you can scraping news according to what you want

--

--