Extract information from thousands of articles— This time, NO API

Python | Haystack | Webscraping

Published in

Data And Beyond

6 min readMay 11, 2023

With the increasing amount of information on the internet, it can be challenging to extract specific information from articles. Manually searching through articles can be time-consuming and inefficient. Luckily, web scraping has been easy by using Power Automate Desktop, check this article if you wanna se how:

WEB SCRAPING IS MADE EASY BY POWER AUTOMATE DESKTOP

Web Scraping | Web data extraction | Screen Scraping | Web Harvesting | Power Automate Desktop

medium.com

But, webscraping and extracting information out of an articles scrapped from the web is two different things. Sure, it can be done using OpenAI, Check this if you wanna know how:

Extract Information From Thousands of Articles using AI

Web Scraping | Open AI | Power Automate Desktop

medium.com

But, since it uses OpenAI API, it will be costly. Thus makes me wonder if there is a free alternative. In this article, I want to show you how. Keep reading If you want to know how.

In this article, we will explore how to extract information from articles using the Haystack framework in Python. using Free LLM available on Huggingface. Haystack is an open-source Python framework that provides tools for building scalable search platforms. It is built on popular deep learning frameworks such as PyTorch and TensorFlow. In addition to text extraction, Haystack provides other functionalities such as document storage, retrieval, and QA systems.

In this tutorial, we will walk through a script that extracts information from a collection of articles. We will use the Haystack framework to build a pipeline for text indexing, embedding retrieval, and a prompt node for question answering. We will then use fuzzy string matching to extract specific information from the articles.

First, we need to install the required packages using pip. We will use pandas for data manipulation, numpy for numerical computation, and farm-haystack for the Haystack framework. Additionally, we will use fuzzywuzzy for fuzzy string matching.

!pip install -q pandas
!pip install -q numpy
!pip install -q farm-haystack
!pip install fuzzywuzzy

Next, we will import the required modules and set up our document store. In this example, we will use the FAISSDocumentStore provided by Haystack. I lost track on which one is required and which one is not. So, being lazy as I am, I might as well leave it as it is. As we all know, if it work do not touch it. Leave a comment to define which one is truly required and what’s not.

import pandas as pd
import os
import re
from haystack import Pipeline
from haystack.nodes import TextConverter, PreProcessor
from haystack.document_stores import FAISSDocumentStore
from haystack.pipelines.standard_pipelines import TextIndexingPipeline
from sentence_transformers import SentenceTransformer
from haystack.nodes import EmbeddingRetriever
from haystack.nodes import FARMReader
from haystack.pipelines import ExtractiveQAPipeline
from haystack.nodes import PromptNode

Next, we need to define which LLM are we going to use. Here we are going to use “google/flan-t5-base”. There are other option, you could check on this site.

prompt_node = PromptNode(model_name_or_path="google/flan-t5-base", use_gpu=True)

Once, it all set, next we need the data. The data i used, is bunch or artciles I got from railwaypro. There, there are a lot of news article about rolling stock market scraped in this article. A few trimming here and thre required to have a clean and align with the context of this tutorial. And to shorten the running time. I will only use the first 10 of the data available (There are 6K+ of data in this dataset)

# read database
df = pd.read_excel('railwaypro v1.xlsx')
filtered_df = df[['Title', 'URL', 'Date', 'Context']]
first_10_rows = filtered_df.head(10)

# replace irregular characters
replace_lambda = lambda x: x.replace("Š", "S")
first_10_rows['Context'] = first_10_rows['Context'].apply(replace_lambda)

Once our data is clean and redy we will develop a way to read the context of the data, decide if the article is about a rolling stock deal, and extract who will supply the rolling stock. This step them ierates through the rows of dataset.

#loop trough dataframe
from haystack.nodes import PromptTemplate
from fuzzywuzzy import fuzz
first_10_rows.drop(["answer","RS Manufacturer"], axis=1, inplace=True)

for index, row in first_10_rows.iterrows():
    news = row['Context']
    prompt_text = "Is this news:{news} states that a manufacturer will supply a rolling stock? answer Yes or No, if you are not certain. Answer:"


    response=prompt_node.prompt(prompt_template=prompt_text, news=news)
    first_10_rows.loc[index, 'answer'] = str(response[0])
    
    if str(response[0]).lower() == 'yes':
        
        options=['Alstom','Altaivagon','American Railcar Industries','Amstead Maxion','Astra Vagoane Calatori','Baotou Beifang Chuangye','Belkommunmash','Bharat Earth Movers Limited','Bombardier','Bombardier Sifang (Qingdao) Transportation','Bozankaya','Brookville Equipment Corporation','Bryansk','Cad Railway Industries','CAF','Changchun Bombardier Railway Vehicles Company','China National Machinery Import & Export Co.','Chineese Civil Engineering and Construction Company','Chinese shipbuilding company Apshara','CKD Dopravni Systemy','Clayton Equipment','CRRC','CZ Loko','Demikhovsky Machine Building Plant','Dongfang Electric Corp','Durmazlar','Electro Motive Diesel','Faccns','General Electric','Golden Rock Workshop','Greenbrier','GWR','H. Cegielski','hai Phong Carriage Co','Harsco','HeiterBlick','Hitachi Rail','Hyundai rotem','ICF','Iranian Firm Wagon Pars','Jinan Railway Vehicles Equipment','JTREC','Kawasaki Heavy Industries','Kinki Sharyo','KVBZ','KVSZ','Legios Czech','Liaoning MEC','Luganks Locomotive Works','Marubeni','Metrovagonmash','Mitsubishi','Modertrans','Motive Power','National Steel Car','Newag','Nippon Sharyo','Novocherkassk Electric Locomotive Plant','OEVRZ','Oregon Iron Works','Osipovichsky plant of mechanical engineering','Pesa','Petersburg Tram Mechanical Factory','PK Transportnye Systemy','Progress Rail','Promtraktor-Vagon','PT INKA','RailConnect JV ','Rites','RJ Corman Railpower','RM Rail Russia','Serveis Ferroviaris de Mallorca','Siemens','Sinara group','Skoda','SMH Rail SDN BHD','Softronic Craiova','Solaris','Stadler','Sumitomo','Talgo','Talleres Alegria','Tatravagonka','Texmaco Rail & Engineering Ltd','TIG/m','TikhvinChemMash','TikhvinSpetsMash','Titagarh Firema','Toshiba','Transmashholding','Transnet','Tulomsas','Tver Car-Building Plant','Tver','TVZ Gredelj','UGL','United Wagon Company','Uraltransmash','Uralvagonzavod','Ust-Katav Wagon-Building Plant','Vagonmas','Vivarail','Vossloh Locomotives','Waggonbau Niesky','Wagon Pars Co','Wagony ?widnica','Woojin']
        query="which manufacturer that will supply the rolling stock?"
        best_match = None
        prompt_template = "Choose one manufacturer from the following options: {} to answer this query: {} based on the given news: {}. Answer:"
        max_prompt_length = 400
        option_slices = [options[i:i+3] for i in range(0, len(options), 3)]
        prompt_chunks = []
        for option_slice in option_slices:
            option_string = ', '.join(option_slice)
            prompt_chunk = prompt_template.format(option_string, query, news)
            if len(prompt_chunk) > max_prompt_length:
                prompt_chunk = prompt_chunk[:max_prompt_length]
            prompt_chunks.append(prompt_chunk)

        # prompt for each chunk and find the best match
        for prompt_chunk in prompt_chunks:
            answer = prompt_node.prompt(prompt_template=prompt_chunk)
            for option in options:
                ratio = fuzz.token_set_ratio(answer, option)
                if best_match is None or ratio > fuzz.token_set_ratio(answer, best_match):
                    best_match = option
     
        first_10_rows.loc[index, 'RS Manufacturer'] = best_match

It basically have two steps. the first step is to define if the article contained specific information that we are looking for, i.e., Rolling Stock deal.

news = row['Context']
prompt_text = "Is this news:{news} states that a manufacturer will supply a rolling stock? answer Yes or No, if you are not certain. Answer:"
response=prompt_node.prompt(prompt_template=prompt_text, news=news)
first_10_rows.loc[index, 'answer'] = str(response[0])
if str(response[0]).lower() == 'yes':

Second, is to extract who won the deal, here to standardize the answer, we will define the answer of rolling stock manufacturer in a list format. and then the prompt will read who won the deal and pick the name from provided list so the answer will be standardized.

options=['Alstom','Altaivagon','American Railcar Industries','Amstead Maxion','Astra Vagoane Calatori','Baotou Beifang Chuangye','Belkommunmash','Bharat Earth Movers Limited','Bombardier','Bombardier Sifang (Qingdao) Transportation','Bozankaya','Brookville Equipment Corporation','Bryansk','Cad Railway Industries','CAF','Changchun Bombardier Railway Vehicles Company','China National Machinery Import & Export Co.','Chineese Civil Engineering and Construction Company','Chinese shipbuilding company Apshara','CKD Dopravni Systemy','Clayton Equipment','CRRC','CZ Loko','Demikhovsky Machine Building Plant','Dongfang Electric Corp','Durmazlar','Electro Motive Diesel','Faccns','General Electric','Golden Rock Workshop','Greenbrier','GWR','H. Cegielski','hai Phong Carriage Co','Harsco','HeiterBlick','Hitachi Rail','Hyundai rotem','ICF','Iranian Firm Wagon Pars','Jinan Railway Vehicles Equipment','JTREC','Kawasaki Heavy Industries','Kinki Sharyo','KVBZ','KVSZ','Legios Czech','Liaoning MEC','Luganks Locomotive Works','Marubeni','Metrovagonmash','Mitsubishi','Modertrans','Motive Power','National Steel Car','Newag','Nippon Sharyo','Novocherkassk Electric Locomotive Plant','OEVRZ','Oregon Iron Works','Osipovichsky plant of mechanical engineering','Pesa','Petersburg Tram Mechanical Factory','PK Transportnye Systemy','Progress Rail','Promtraktor-Vagon','PT INKA','RailConnect JV ','Rites','RJ Corman Railpower','RM Rail Russia','Serveis Ferroviaris de Mallorca','Siemens','Sinara group','Skoda','SMH Rail SDN BHD','Softronic Craiova','Solaris','Stadler','Sumitomo','Talgo','Talleres Alegria','Tatravagonka','Texmaco Rail & Engineering Ltd','TIG/m','TikhvinChemMash','TikhvinSpetsMash','Titagarh Firema','Toshiba','Transmashholding','Transnet','Tulomsas','Tver Car-Building Plant','Tver','TVZ Gredelj','UGL','United Wagon Company','Uraltransmash','Uralvagonzavod','Ust-Katav Wagon-Building Plant','Vagonmas','Vivarail','Vossloh Locomotives','Waggonbau Niesky','Wagon Pars Co','Wagony ?widnica','Woojin']
        query="which manufacturer that will supply the rolling stock?"
        best_match = None
        prompt_template = "Choose one manufacturer from the following options: {} to answer this query: {} based on the given news: {}. Answer:"
        max_prompt_length = 400
        option_slices = [options[i:i+3] for i in range(0, len(options), 3)]
        prompt_chunks = []
        for option_slice in option_slices:
            option_string = ', '.join(option_slice)
            prompt_chunk = prompt_template.format(option_string, query, news)
            if len(prompt_chunk) > max_prompt_length:
                prompt_chunk = prompt_chunk[:max_prompt_length]
            prompt_chunks.append(prompt_chunk)

        # prompt for each chunk and find the best match
        for prompt_chunk in prompt_chunks:
            answer = prompt_node.prompt(prompt_template=prompt_chunk)
            for option in options:
                ratio = fuzz.token_set_ratio(answer, option)
                if best_match is None or ratio > fuzz.token_set_ratio(answer, best_match):
                    best_match = option
     
        first_10_rows.loc[index, 'RS Manufacturer'] = best_match

The result will be retuned into the dataframe in column answer and RS manufacturer, as shown in this

The result is quite satisfying, it is able to select which articlec contains information that we need, with just a few drawback, sometimes, when the article is not what we are looking for, it return a summary instead of No answer. But the information of the manufacturer name, is answered correctly.

In conclusion, this step is a free alternative to extract information from thousands of articles, the only drawback is that we are required to define the prompt to extract information one by one, since one prompt only be able to extract one information at a time. Thus, this alternative thought costless, will still be time consuming. In my personal opinion, OpenAi API is still the best LLM available in the market.