Extract Information From Thousands of Articles using AI

Web Scraping | Open AI | Power Automate Desktop

bedy kharisma
Coinmonks
Published in
5 min readJan 5, 2023

--

Imagine you are burdened with the task to gather information from thousands of articles. Traditional web scraping will not work, since the data is unstructured like many e-commerce would, and the data is scattered over the paragraphs. Usually, this is done manually, by reading the passage one by one to get the desired information. For one article, this task may seem simple. Open AI could do just that. But what if there are thousands of them? That’s where the combination of power automate, excel, openAI, and python+streamlit comes in handy!

In this article, I would like to demonstrate how to get specific information related to rolling stock manufacturers' contracts published on railwaypro.com. Specifically on the rolling stock only. Hence the specific URL for our scraping target is as follows, it is free, and contains a lot of information about rolling stocks procurements contracts all over the world:

And in this article, I will guide you to be able to create your own automatic machine to grab information within the article. Here’s a peak at the end result. If you are interested, keep on reading

Step one. Web Scraping

In short, read the following article:

Or in detail, we separate this scraping process into two parts. The first one is to get all the links to each piece of news. then store it in an excel sheet. we do this using power automate desktop I(PAD) . a non-coding alternative to automate desktop tasks, i.e. scrape websites provided free by Microsoft. It basically tells you the obvious, launch a browser with a specific URL, get the data from a web page (this time is the title, date, and URL), launch a specified excel sheet, write those data to this excel sheet, then save it.

New to trading? Try crypto trading bots or copy trading on best crypto exchanges

The next step is to let PAD, read each article from the previously scrapped news’ URL. Basically, it is a loop function to read the URL stored in an excel sheet, then open it in a new browser, and scrap the whole text within the website. and store it back in the excel sheet.

Step Two. BUILD YOUR OWN AI WEBPAGE

The second step is building an openai website, to enable the PAD to scrape the information. This step is necessary, due to the nature of open ai web UI is chat-like conversation. So it will be difficult for PAD to distinguish which part of the website is needed to be grabbed.

To do this, you will need an API KEY from Open Ai. Google it,m there are tons of tutorials to do so. Once you get one, Here’s the script you need to copy and paste:

#set environment
import requests
import streamlit as st

# Set the API endpoint and your API key
url = "https://api.openai.com/v1/completions"
api_key = "YOUR OPEN AI API KEY"


st.markdown("# Extract Information from Article")
question = st.text_area("Input Question","",height =100)


# Set the request headers
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {api_key}",
}

# Set the request data
data = {
"model": "text-davinci-003",
"prompt": question,
"suffix": "python",
"max_tokens": 1000,
"temperature": 0.1,
}
# Send the request and store the response
response = requests.post(url, headers=headers, json=data)
# Parse the response
response_data = response.json()
# Extract the text from the response
text = response_data['choices'][0]['text']
#write response
st.text_area(label ="Response",value=text, height =500)

This script, once run using the command “streamlit run chatgpt-module.py” will open a web browser with the address http://localhost:8501/

The top box area will allow you to input your question. And the bottom box area will allow AI to provide a response.

Step Three. Automate the Q&A Process

In this step, we already have all the components, and what is left is to automate the whole conversation between humans with AI into a conversation between machines to AI. So we need to tailor the question to ease AI in providing answers. This step is done within the excel sheet. A simple formula like =CONCAT([@insert],[@Context]) will do just fine. with context is the article. and insert the question template. This is the template I use:

In list format. can you get the following attribute (leave N/A if not available)
[Country,
Year,
Date of Contract,
First Delivery Date,
Type,
Brand,
vehicle description,
Maximum Design Speed (km/hr),
Carbody material,
Power (kW),
Manufacturer,
Total Project Value Currency,
Total Project Value,
Trains Configuration,
Buyer,
Funder,
] from the following article :

You may customize this to get your desired information extracted from the article.

Once finished. PAD it! The process is basically a back-and-forth between an excel sheet and the browser. the PAD will get the question from the excel sheet and then paste it into the question box on the website, give it some time for AI to provide the answer and copy the answer, then write it back to the excel sheet.

If you have any website and information you are willing to be extracted. DM me over my LinkedIn: https://www.linkedin.com/in/bedy-kharisma/

--

--

Coinmonks
Coinmonks

Published in Coinmonks

Coinmonks is a non-profit Crypto Educational Publication.

bedy kharisma
bedy kharisma

Written by bedy kharisma

Indonesian Strategist,Analyst, Researcher, Achievement Addict who helps company grow their business by applying data-driven Management. Follow to follow