GoogleNews API—Live News from Google News using Python

Mansi Dhingra
Analytics Vidhya
Published in
4 min readJun 17, 2020

Hello everyone. I’ve got something exciting and a lot more easier way to search news from all over the world. We know how to scrape the sites using BeautifulSoup and Selenium to get news articles. But with this api, you don’t need to restrict yourself with the couple of sources or publications.

Let me tell you a little stupid way of my working so that you don’t lead to that similar mistake. But they say being stupid can be good for your career. I was given a project to scrape news articles and to apply some data visualization on the articles and create a dashboard. If you guys want a story on how I created my dashboard, I would love to share with you all. So, back to the story, I didn’t know back then that we have such an easy way to fetch live news articles. I stuck with some specific number of publications, I went on every publication’s site and applied beautifulsoup and selenium for every search query to get the links and then I used the newspaper module to parse the article. After I was done with the project, I came across with easiest api which could make our lives more easier and it is not time consuming at all, at least not compared with my earlier way of achieving the same even more accurate results. So, let’s go ahead without further wasting your precious time.

The first task is to install GoogleNews and newspaper3k for parsing the article which can be accomplished with the following statement.

!pip install GoogleNews

!pip install newspaper3k

It’s time to import some libraries to get a list of the articles that we will now fetch.

from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd

The next task is to fetch the data. Given the time period and query on which you want to search the news articles, we will get a list which will contain date, title, media, description, link where the article is published and link of the image under the img attribute.

You will see in the following code, that googlenews.result() will return a list of everything we discussed above

googlenews=GoogleNews(start='05/01/2020',end='05/31/2020')
googlenews.search('Coronavirus')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())

We have specified the time period in start and end variables. We will provide the query as the parameter while calling the search function. There is a search limit on the number of news in this api. The maximum number of news we can gather is 10. Now, we will try to fetch more than that.

for i in range(2,20):
googlenews.getpage(i)
result=googlenews.result()
df=pd.DataFrame(result)

We can write as many numbers in the for loop to get the page numbers. It will stop until it gets all the articles regarding that search in the particular time period.

Let’s see a sample of graph for 20 articles from the given publications.

Now, you are not getting the full article using this module. Description attribute from the googlenews will not give you the full article. So we will now try to extract and parse the article from the newspaper module of the python. We have imported the article library already.

for ind in df.index:
dict={}
article = Article(df['link'][ind])
article.download()
article.parse()
article.nlp()
dict['Date']=df['date'][ind]
dict['Media']=df['media'][ind]
dict['Title']=article.title
dict['Article']=article.text
dict['Summary']=article.summary
list.append(dict)
news_df=pd.DataFrame(list)
news_df.to_excel("articles.xlsx")

We will iterate through the dataframe created earlier and create a new one where we can also have the full article and summary. link column in the previous dataframe contains the link where the article is published and with article package, we can fetch the article and summary. title column is already there so either you can access it from the previous dataframe or you can access it from the article package. We can also convert this dataframe into an excel sheet.

Here’s the full code:

from GoogleNews import GoogleNews
from newspaper import Article
from newspaper import Config
import pandas as pd
import nltk
#config will allow us to access the specified url for which we are #not authorized. Sometimes we may get 403 client error while parsing #the link to download the article.nltk.download('punkt')

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'
config = Config()
config.browser_user_agent = user_agent
googlenews=GoogleNews(start='05/01/2020',end='05/31/2020')
googlenews.search('Coronavirus')
result=googlenews.result()
df=pd.DataFrame(result)
print(df.head())
for i in range(2,20):
googlenews.getpage(i)
result=googlenews.result()
df=pd.DataFrame(result)
list=[]
for ind in df.index:
dict={}
article = Article(df['link'][ind],config=config)
article.download()
article.parse()
article.nlp()
dict['Date']=df['date'][ind]
dict['Media']=df['media'][ind]
dict['Title']=article.title
dict['Article']=article.text
dict['Summary']=article.summary
list.append(dict)
news_df=pd.DataFrame(list)
news_df.to_excel("articles.xlsx")

Hope it helps. Happy coding!

--

--