This is the first of four stories that aim to address the issue of identifying disease outbreaks by extracting news headlines from popular news sources.
News headlines were chosen to form the basis of the unsupervised clustering algorithm due its ability to present information in a concise manner. Although sifting through the news story itself would provide more geographic locations, the amount of computational processing and bandwidth outweighs the benefits that it provides. In order to find the most popular news sources, two Feedspot articles (Top 100 USA News Websites and Top 100 World News Websites) containing a list of the top hundred news sources in the world and US were chosen; this article is constantly updated to ensure that it has the most recent information. This method does present bias as some of the news websites, such as The Los Angeles Times, usually report news about their city, creating skewed clusters; therefore, news websites pertaining to a particular city were removed accordingly.
A detailed explanation of the code is provided below.
Step 1: Install and Import the Relevant Libraries
!pip install bs4
!pip install django
from bs4 import BeautifulSoup as bs
import urllib.request
from django.core.validators import URLValidator
from django.core.exceptions import ValidationError
import time
from datetime import datetime
import boto3
Step 2: Initialize and Empty the S3 Bucket (Optional)
Using the boto3 library and the AWS account configuration credentials, an S3 object is cleared and put into an S3 bucket. This step is not necessary if AWS integration is not required.
s3 = boto3.resource(
's3',
region_name='us-east-1',
aws_access_key_id=*HIDDEN*,
aws_secret_access_key=*HIDDEN*
)
content = ""
s3.Object('headlines', 'headline.txt').put(Body=content)
Step 3: Create a Function that Extracts Websites
First, using the urllib library, the code opens the URL and converts the website’s content into HTML. Next, utilizing the Beautiful Soup library, the HTML is parsed, and hyperlinks are extracted from the HTML code. These hyperlinks are determined by finding the “a” blocks in the HTML code. Implementing the Django library, the hyperlink URLs are validated, filtered based on the exception URL, and added to an array. This array of URLs is then returned.
def extractWebsites(url, exceptionURL):
try:
webUrl = urllib.request.urlopen(url)
data = webUrl.read()
soup = bs(data)
arr = []
for link in soup.find_all('a'):
validate = URLValidator()
href = link.get('href')
if (href != None):
try:
validate(href)
if (href.find(exceptionURL) == -1):
arr.append(href)
except ValidationError as exception:
continue
return arr
except:
return []
Step 4: Create a Function that Searches a URL for Specific Key Words
The function iterates through the array of key words, and returns true if any of the key words are within the URL.
def search_array(url, arrOfPoss):
for index in arrOfPoss:
if (url.find(index.capitalize()) != -1 or url.find( index.lower()) != -1):
return True
return False
Step 5: Extract the Headlines
First, an array of news channels are selected from Feedspot’s top 100 US and world news websites. Then, that array is iterated through using the extractWebsites function to find pertinent websites that pertain to COVID-19. “Interactive” is chosen as the exceptionURL as these do not provide relevant headlines. Other headlines are filtered using the search_array function with a set of COVID-19 keywords. These filtered headlines are stored in a new array called covid_urls.
Using the extractWebsites function again, covid_urls are iterated through to find more COVID-related websites. To reduce runtime, a counter is implemented that stops if there are an excessive amount of websites to run through. Additionally, if this procedure takes longer than ten minutes, the inner for loop will terminate. Once again, “interactive” is chosen as the execptionURL, and the headlines are filtered using the search_array function with the same set of COVID-19 key words. These filtered headlines are stored in a new array called new_covid_urls.
This process is repeated one more time to optimize the amount of relevant headlines extracted. The filtered headlines are stored in another array called newest_covid_urls. To store the relevant headlines in the S3 object, the URLs inside new_covid_urls and newest_covid_urls were iterated through. For each website in these arrays, the URL was converted to HTML to determine the title of the URL. Once checked for repetition in the headlines array, the title was added inside the S3 object and the headlines array. The final headlines array was put inside the S3 object.
headlines = []
for index in data:
new_data = extractWebsites(index, 'interactive')
covid_urls = []
arr_keywords = ['coronavirus', 'COVID', 'covid', 'pandemic', 'epidemic', 'disease', 'SARS', 'sars', 'virus']
for index1 in new_data:
if (search_array(index1, arr_keywords)):
covid_urls.append(index1)
count = 0
for val in covid_urls:
if count > 13:
break
future = time.time() + 600
newer_data = extractWebsites(val, "interactive")
new_covid_urls = []
for index1 in newer_data:
if (search_array(index1, arr_keywords)):
new_covid_urls.append(index1)
for value in new_covid_urls:
if (time.time()) > future:
break
newest_data = extractWebsites(value, "interactive")
newest_covid_urls = []
for index2 in newest_data:
if (time.time()) > future:
break
if (search_array(index2, arr_keywords)):
newest_covid_urls.append(index2)
for urls1 in newest_covid_urls:
if ((time.time()) > future):
break
try:
webUrl = urllib.request.urlopen(urls1)
data = webUrl.read()
soup = bs(data)
try:
title = soup.find('title').string
if (title not in headlines) and (search_array(title, arr_keywords)):
headlines.append(title)
content = "\n".join(headlines)
s3.Object('headlines', 'headline.txt').put(Body=content)
except TypeError as exception:
print("Exception occured")
continue
except:
continue
for urls in new_covid_urls:
if (time.time()) > future:
break
try:
webUrl = urllib.request.urlopen(urls)
data = webUrl.read()
soup = bs(data)
try:
title = soup.find('title').string
if (title not in headlines) and (search_array(title, arr_keywords)):
headlines.append(title)
content = "\n".join(headlines)
s3.Object('headlines', 'headline.txt').put(Body=content)
except:
continue
except:
continue
content = "\n".join(headlines)
s3.Object('headlines', 'headline.txt').put(Body=content)
Depending on whether this code runs on your local machine or an AWS Sagemaker instance, the code runtime can vary from a few hours to a few days. The code currently runs on an ml.c5.xlarge instance with 20 GB EBS storage; note that running on Sagemaker does incur an extra cost unlike that of your local machine.
The applications of this code are enormous, as headlines can be extracted for any range of topics. For the interests of the project, only COVID-19 keywords were changed, but this could easily changed by altering the keywords. The code can easily be repurposed as well to fit the amount of headlines you wish to obtain. Alternate websites can be added to further expand the data, or the number of keywords can be adjusted accordingly.
Click this link for access to the Github repository for a detailed explanation of the code: Github.