Web Scraping The New York Times Articles with Python : Part I

Sanela I.
CodeX
Published in
5 min readApr 5, 2022

--

In the digital era, you may find tons of public information on the internet that you can utilize to your benefit. You could manually copy-paste data into a new document to retrieve these pieces of information, but this can be very time-consuming; however, web scraping is the ideal way to extract large amounts of data in a relatively short time.

Photo by Sarah Shull on Unsplash

Web scraping is the automated process of obtaining data from a website using software programs that replicate human web navigation by manually exploiting the HTTP protocol or integrating a browser in an application.

Web scraping is widely utilized and has a variety of applications:

·Marketing and sales firms can obtain lead-related information by using web scraping for targeting their audience or keeping track of industry trends.

·Web scraping is helpful for real estate companies looking for information on new projects, resale properties, etc.

· Price comparison companies, such as Trivago, rely heavily on web scraping to gather product and price information from various e-commerce sites to deliver the best hotel prices for its customers.

· Web scraping Twitter data is commonly used to generate insightful information during the upcoming elections.

· Job seekers can automatically web scrape job postings from various websites and integrate them into one data source.

Isn’t It Amazing?

This post will be the first part of the 3 series. I’ll walk you through how to collect articles from The New York Times, preprocess them, remove punctuation and stop words, and apply tokenization, lemmatization, and collocation. In the end, we will perform topic discovery using LDA and sentiment analysis using TextBlob.

In this article, I will provide instructions on how you can web scrape The New York Times articles. To be able to use their API, you will need to create a developer account following the instructions below:

https://developer.nytimes.com/get-started

When you sign up for the account, you must register an application and choose which APIs you want to activate. Some options are Article Search API, Top Stories API, and Archive API but also include Movies Review or Book API. You can enable more than one. Once you create an application, you will get your API key which you will use to interact with APIs you selected.

To understand how you can request Article Search API, you can check the overview:

https://developer.nytimes.com/docs/articlesearch-product/1/overview

Let’s start!

import requests as req
import time
API_KEY='YOR_API_KEY' # your API key
TOPIC='Technology' # keyword

In the Article Search overview, you can find a list of available topics.
Some of them include Art, Business, Education, Science, Fashion, Health, Politics, and more.

Example url call:

https://api.nytimes.com/svc/search/v2/articlesearch.json?q=technology&api-key=yourkey

If you want to search for more than one keyword, you can use news_desk.
fq=news_desk:( “Technology” “Science”) for Technology and Science articles.
In addition to this, you can restrict your articles using fields such as pub_date, pub_year, source, type_of_material, and day_of_week.

The Article Search API returns a max of 10 results at a time. Keep in mind that there is a limit of requests that you can make per API:10 requests per minute and 4000 requests per day.
Here we are collecting 1200 articles and will use time.sleep() to avoid hitting API’s request limit per minute. The sleep function suspends execution for a given number of seconds.

for i in range(120):
url='https://api.nytimes.com/svc/search/v2/articlesearch.json?q='+TOPIC+'&api-key='+API_KEY+'&page='+str(i)
response = req.get(url).json()
time.sleep(6)
response

Our response is in the form of JSON and looks like this:

{'status': 'OK',
'copyright': 'Copyright (c) 2022 The New York Times Company. All Rights Reserved.',
'response': {'docs': [{'abstract': 'Concerns that popular social media platforms can expose children to posts that are sexualized, hurt their body image or are violent have escalated in recent years.',
'web_url': 'https://www.nytimes.com/2022/03/29/technology/snapchat-tiktok-parental-control.html',
'snippet': 'Concerns that popular social media platforms can expose children to posts that are sexualized, hurt their body image or are violent have escalated in recent years.',
'lead_paragraph': 'A group of attorneys general on Tuesday asked Snap and TikTok to work more closely with parental control apps and to apply more scrutiny to inappropriate content on their platforms, the latest salvo in a growing fight over child protection between governments and social media companies.',
'source': 'The New York Times',
'multimedia': [],
'headline': {'main': 'State attorneys general ask Snap and TikTok to give parents more control over apps.',
'kicker': None,
'content_kicker': None

Taking a closer look at the response, we are going to extract the abstract, headline, and lead_paragraph (I just highlighted them for better visualization) and create a list of articles.

# Extract the necessary fields from the response.
articles = []
docs = response['response']['docs']
for doc in docs:
filteredDoc = {}
filteredDoc['title'] = doc['headline']['main']
filteredDoc['abstract'] = doc['abstract']
filteredDoc['paragraph']=doc['lead_paragraph']
articles.append(filteredDoc)
articles[:10]

Here we can see how the output looks:

[{'title': 'State attorneys general ask Snap and TikTok to give parents more control over apps.',
'abstract': 'Concerns that popular social media platforms can expose children to posts that are sexualized, hurt their body image or are violent have escalated in recent years.',
'paragraph': 'A group of attorneys general on Tuesday asked Snap and TikTok to work more closely with parental control apps and to apply more scrutiny to inappropriate content on their platforms, the latest salvo in a growing fight over child protection between governments and social media companies.'},
{'title': 'Ben McKenzie Would Like a Word With the Crypto Bros',
'abstract': 'The actor, best known for his starring role in “The O.C.,” has become an outspoken critic of a volatile market driven by speculation. Who’s listening?',
'paragraph': 'ROCKDALE, Texas — Ben McKenzie was driving his father’s silver Subaru through Texas farmland, talking in breathless bursts about money: who has it, who needs it, what makes it real or fake. He detailed the perils of cryptocurrency exchanges, the online brokers that sell Bitcoin and Ether to speculators, then delivered a glowing endorsement of “Capital in the Twenty-First Century,” a 700-page book by the economist Thomas Piketty about income inequality and the power of wealthy capitalists.'},
{'title': 'How Robots Can Assist Students With Disabilities',
'abstract': 'New tools use artificial intelligence to assist students with autism and dyslexia and address accessibility for those who are blind or deaf.',
'paragraph': 'This article is part of a limited series on how artificial intelligence has the potential to solve everyday problems.'},

Voila! Now that we collected articles using The New York Times API, we will create a Pandas DataFrame and save it as a CSV file for later use.

import pandas as pd
df = pd.DataFrame(data=articles)
df.to_csv('TechArticles.csv')

In the next article, we will go through data preprocessing. We will remove stop words and apply tokenization, lemmatization, and collocation to prepare for topic discovery and sentiment analysis.

Conclusion

Web scraping is a valuable method for collecting data from websites automatically. Web scraping can keep track of prices, generate leads, improve marketing techniques, and automate image extraction, among other things. However, you must tread cautiously when carrying it out to avoid indulging in unethical acts.

References:

The New York Times Developer Network

Jupyter Notebook with detailed code can be found here:

https://github.com/sivosevic/NYTimesNLP/blob/main/NYTimesTechDataCollection.ipynb

If you found this article informative and engaging, be sure to follow me for notifications on future articles and tutorials.

You can check my other writings at: https://medium.com/@eellaaivo

Thanks for reading, and stay tuned for the next article!

--

--