Photo by Igor Shabalin

Getting Straight to the Point with Scraping and Natural Language Processing

Yuli Vasiliev
The Startup
Published in
4 min readMay 22, 2020

--

Nowadays, the internet has become the main source of information for most of us. When we need to learn about something or master something, we typically go online, using a web search engine like Google to obtain necessary information. Reviewing the retrieved results however, may take considerable time, requiring you to look into each link to see whether the information it contains really suits you. You can significantly shorten your research time when you know exactly what you want to find and can narrow down your search accordingly.

The problem is, though, that sometimes it’s too hard to explain all your requirements to the search engine. For example, you may need to obtain only the latest information about a business entity, thus obtaining only those resources that were published, say, within the last week. To address this problem, you can conduct an advanced search with a search engine. To accomplish this programmatically, you may take advantage of a web scraping API, allowing you to specify the necessary parameters of the search being conducted from within a script.

The results of an advanced search conducted via a scraping API may still contain a lot of links that will be not helpful. To choose the most useful links automatically, you can apply some NLP techniques to the snippets assigned to each link, trying to find only those that contain certain types of phrases, such as expressions of monetary values, percentages, etc.

Getting the Most Relevant Articles with Scraping

Let’s start with how you can scrap google search results. There are several Python libraries that allow you to conduct a web search programmatically. Some of them can be used for free while the others providing richer result sets are paid alternatives. The code snippet below illustrates how you can conduct a web search from within your Python script using SerpApi (https://serpapi.com/):

phrase = 'Tesla stock'

from serpapi.google_search_results import GoogleSearchResults
GoogleSearchResults.SERP_API_KEY = "your_serp_api_key_here"
client = GoogleSearchResults({"q": phrase})
rslt = client.get_dict()

If you now print out the rslt dictionary variable, you’ll see that it contains a JSON document. Looking through it, you may notice that the organic_results list contains the list of retrieved links. Each link in the list is assigned to a dictionary that includes the following fields: the url, date, and snippet, among others.

{
"organic_results": [
{
"position": 1,
"title": "Judge rules Musk's 'Tesla stock too high imo' tweet …",
"link": "https://thenextweb.com/hardfork/2020/05/20/elon-musk-tesla-shareholders-lawsuit-tweets-sec-settlement-on-hold/",
"displayed_link": "thenextweb.com › Hard Fork",
"thumbnail": null,
"date": "21 hours ago",
"snippet": "Remember when Tesla shares tanked by 10% moments after its CEO Elon Musk tweeted: “Tesla stock price is too high imo? Turns out it was …",
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:zERThwUN9PQJ:https://thenextweb.com/hardfork/2020/05/20/elon-musk-tesla-shareholders-lawsuit-tweets-sec-settlement-on-hold/+&cd=20&hl=en&ct=clnk&gl=us"
},
{

]

}

As mentioned, snippets can be of the most interest when you want to use NLP to further narrow down your search result set. As for now, let’s just look at how you can get to each snippet:

for article in rslt['organic_results']:
print(article['snippet'])

Another API that you might use to conduct an advanced search programmatically is News API (https://newsapi.org/). Below is the code snippet that can give you an idea of how this API works:

phrase = 'Tesla stock'

from newsapi import NewsApiClient
from datetime import date, timedelta
newsapi = NewsApiClient(api_key='your_news_api_key_here')
my_date = date.today() — timedelta(days = 7)
articles = newsapi.get_everything(q=phrase,
from_param = my_date.isoformat(),
language="en",
sort_by="relevancy",
page_size = 100)

The structure of a result set returned by News API differs from the structure of a serpapi result set. You can obtain the description (snippet) of each article as follows:

for article in articles['articles']:
print(article['description'])

Using NLP to Narrow Down Your Search Results

Perhaps the most interesting part is using NLP techniques to filter out the links in your result set based on their descriptions (snippets). The following code illustrates how this concept might be implemented with the help of spaCy, a leading Python natural language processing library:

phrase = ‘Tesla stock’

import spacy
nlp = spacy.load('en')
nlp.add_pipe(nlp.create_pipe('merge_noun_chunks'))
answers[]
for article in articles['articles']:
flg = 1
article_content = str(article['description'])
doc = nlp(article_content)
for sent in doc.sents:
for token in sent:
if phrase.lower() in token.text.lower():
doc2 = nlp(sent.text)
for ent in doc2.ents:
if (ent.label_ == 'MONEY'):
answers.append(sent.text.strip() + '| '+ article['publishedAt'] + '| '+ article['url'])
flg = 0
break
break
if flg == 0:
break
print(answers)

Where to See It Working

The idea discussed in this article has been implemented in the stocknewstip bot, which is available at https://t.me/stocknewstip_bot. The bot can find and bring to you the latest information about a company’s stock, as well as other interesting information related to the company. All you needs to do is to type in the name of a company, say, Apple, Google or Tesla, or Gold or Bitcoin:

The same results will go to the stocknewstip channel available at https://t.me/stocknewstip. You can preview the channel without having a Telegram account at https://t.me/s/stocknewstip:

--

--