Summarizing Google News Headlines Using VertexAI PALM API & Langchain

Choirul Amri
Google Cloud - Community
6 min readOct 1, 2023

Google News is an aggregator of tons of news channels in many languages. Reading news quickly in concise summary could save our time rather than scanning full articles individually.

In this article, we will perform news summarization of content gathered from Google News using the following components:

  • GNews API: Collect news titles & metadata from Google News
  • Langchain’s UnstructuredURLLoader: Retrieve news content
  • Google Cloud’s Vertex PALM API: Generate news summary

Google News does not provide an official API, so We will be using GNews, a 3rd party open source option created by ranahaani.

Vertex PALM API is a large language model (LLM) that can be used for a variety of tasks, including text summarization. In this tutorial, we will use the text-bison@001 model from PALM API to summarize news content.

Credit to the following resources:

Jump into the full code in github:

How it works

The following drawing depicts the conceptual approach for news summarization flow. First, we use the GNews API to get the news metadata. The most important attribute is the news URL, which will be the address to get the full article.

Then, we use the Langchain’s UnstructuredUrlLoader to get the full article. Next, we parse the article into the PALM model, which is a large language model that can generate and understand text. The PALM model then generates a summary of the news article, which we then return to the user.

image by author

Installation & Preparation

Install the following package:

#install all required package
!pip -q install langchain
!pip install google-cloud-aiplatform
!pip install gnews
!pip install unstructured

Once installation completed, import the following components:

# import required packages
from langchain.llms import VertexAI
from langchain.chains.summarize import load_summarize_chain
from langchain.prompts import PromptTemplate
from langchain.document_loaders import UnstructuredURLLoader

Calling GNews API to Get News Metadata

I came across The GNews API, an open-source Python package for accessing Google News. It allows you to search for articles by keyword, topic, location, or date range. Other commercial options are available such as serper and newsapi.

A simple execution of get_news method will retrieve news metadata from Google News. Common usage of this function is to search news by keyword. The following snippet retrieves news by keyword by supplying “MRT Jakarta” as search keyword.

#get by keyword
google_news = GNews()
news_by_keyword = google_news.get_news('MRT Jakarta')

We can also supply various attributes to google_news object to filter what kind of news retrieved. For example, We can control the time periods, language, and date range of the news.

from gnews import GNews
google_news = GNews()
google_news.period = '1d' # News from last 1 day
google_news.max_results = 5 # number of responses across a keyword
google_news.country = 'ID' # News from a specific country = Indonesia
google_news.language = 'id' # News in a specific language = Bahasa Indonesia
google_news.exclude_websites = ['yahoo.com', 'cnn.com', 'msn.con'] # Exclude news from specific website i.e Yahoo.com and CNN.com

#use date range if required
#google_news.start_date = (2023, 1, 1) # Search from 1st Jan 2023
#google_news.end_date = (2023, 4, 1) # Search until 1st April 2023

news_by_keyword = google_news.get_news('MRT Jakarta')

Limiting news period can be done using the following patterns of time operators:

  • h = hours (eg: 12h)
  • d = days (eg: 7d)
  • m = months (eg: 6m)
  • y = years (eg: 1y)

Example:

google_news.period = '3d' # News from last 3 days

Retrieve top news

GNews API provides the get_top_news() method to get the top news articles from Google News. We can supply optional arguments into the GNews while initializing a new google_news object.

The following snippet shows how to get the top 10 news articles from Google News in Indonesian, from Indonesia, for the last 7 days:

# get top news from the last 7 days
google_news = GNews(language='id', country='ID', period='7d',
start_date=None, end_date=None, max_results=10)
top_news = google_news.get_top_news()

# check collected news metadata
top_news

Retrieve news by topic

Finally, We can use the get_news_by_topic() method to get news metadata by topic. The following topics are available in Google News:

  • WORLD
  • NATION
  • BUSINESS
  • TECHNOLOGY
  • ENTERTAINMENT
  • SPORTS
  • SCIENCE
  • HEALTH

The following snippet shows how to get the top 5 news articles for the “NATION” topic in Indonesian, from Indonesia, for the last 7 days, excluding news articles from Yahoo.com and MSN.com:

#collect metadata by news topic
google_news = GNews(language='id', country='ID',
period='7d', start_date=None, end_date=None,
max_results=5, exclude_websites = ['yahoo.com', 'msn'] )
news_by_topic = google_news.get_news_by_topic('NATION')

#check collected news
news_by_topic

Extract news content

The next step is to get full article contents for each news url. We will use UnstructuredURLLoader from Langchain library to get full news contents from a url. This package is actually a wrapper of bricks.html partition from Unstructured library.

#test to extract content from url inside news_by_topic
urls = [news_by_topic[0]['url'],
news_by_topic[1]['url'],
]
loader = UnstructuredURLLoader(urls=urls)
content = loader.load()
#check news content
content

Summarize News with Vertex PALM API

The last step is calling text-bison@001 to generate the news summary. We need to supply a prompt to tell the model on how to summarize the text.

Prompting

Correct prompting is essential for getting accurate results from a LLM. Supply prompt_template with prompt text to tell the model to generate news summary, using the following steps:

  1. Summary consists of maximum 100 words
  2. If the text cannot be found or error, return: “Content empty”
  3. Use only materials from the text supplied
  4. Create summary in Bahasa Indonesia
#prompting to perform news summary
prompt_template = """Generate summary for the following text, using the following steps:
1. summary consists of maximum 100 words
2. If the text cannot be found or error, return: "Content empty"
3. Use only materials from the text supplied
4. Create summary in Bahasa Indonesia

"{text}"
SUMMARY:"""

prompt = PromptTemplate.from_template(prompt_template)

#declare LLM model
llm = VertexAI(temperature=0.1,
model='text-bison@001',
top_k=40,
top_p=0.8,
max_output_token=512)

Generate_Summary Function

Wrap the summarization process inside a function to loop collections of news urls. The generate_summary function perform the following:

  • Retrieve news content from each urls
  • Generate summary for each news contents
  • Print the output
# create function to generate news summary based on list of news urls
# Load URL , get news content and summarize
def generate_summary(docnews):
for item in docnews:
#extract news content
loader = UnstructuredURLLoader(urls=[item['url']])
data = loader.load()

#summarize using stuff for easy processing
chain = load_summarize_chain(llm,
chain_type="stuff",
prompt=prompt)
summary = chain.run(data)

#show summary for each news headlines
print(item['title'])
print(item['publisher']['title'], item['published date'])
print(summary, '\n')

Let’s call generate_summary function and parsing news_by_keyword from the previous steps.

#call the function and generate summary for news by keyword
generate_summary(news_by_keyword)

The sample output of news summarize is as below:

Tarif Kereta Cepat Jakarta-Kota Bandung Rp 350 Ribu, Mahal? - CNBC Indonesia
CNBC Indonesia Sat, 30 Sep 2023 09:45:00 GMT
Kereta Cepat Jakarta-Bandung akan diresmikan pada 2 Oktober 2023. Tarifnya diperkirakan sekitar Rp300.000-Rp350.000 untuk kelas ekonomi. Setelah uji coba gratis, tiket akan dikenakan biaya. Presiden Jokowi ingin harga tiket terjangkau dan bisa didiskon untuk menarik minat masyarakat.
Kemenparekraf dikatakan Dessy senantiasa mendorong pelaku industri agar mulai membuat paket-paket perjalanan wisata dengan memasukkan kereta cepat sebagai salah satu daya tarik ataupun transportasi pilihan.

Kereta Cepat Diharapkan Bantu Geliatkan Kunjungan Wisatawan … - Republika Online
Republika Online Sat, 30 Sep 2023 14:40:05 GMT
Kereta cepat Jakarta-Bandung akan resmi beroperasi mulai 2 Oktober 2023. Kereta cepat ini diharapkan dapat memberi dampak pada peningkatan kualitas sektor pariwisata dan ekonomi kreatif di Tanah Air.
Khususnya, aksesibilitas wisatawan menuju berbagai destinasi dan sentra ekonomi kreatif di Jawa Barat.
Deputi Bidang Kebijakan Strategis Kemenparekraf, Dessy Ruhati meyakini kehadiran Whoosh dapat memperkuat capaian target wisatawan baik nusantara maupun mancanegara di tahun 2023

We can also the same call for news_by_topic as below:

#call the function and generate summary for news by topics
generate_summary(news_by_topic)

The source code is available in github or you can run directly in Google colab.

--

--

Choirul Amri
Google Cloud - Community

Cloud Customer Engineer @Google. Data Enthusiast. Stories are my own opinion