Using New York Times API and jq to collect news data

Dana Lindquist
5 min readFeb 26, 2019

Are you looking for news data? The New York Times API (Application Programming Interface) is the place to go. They have an API that can be accessed free for non-commercial uses. In this article I will explain how to get this data.

Create an account and get an API key

The first thing to do is go to the NY Times Developer web site (developer.nytimes.com) and create an account. They have a great section called “Get Started” which will walk you through this process. Using their API is free for non-commercial use.

You will need an API key which they use to monitor usage levels. There is a limit to the number of requests you can perform in a day but I never ran into the limit.

As described in the “Get Started”, create an App and choose which APIs you would like to access with that App. I’m going to describe how to access the Archive API. Once you choose CREATE for the App you will be given an API key.

Use requests in Python to pull the json files

An Archive API request will pull all documents from a given month. The years can be from 1851 to 2019 and the months from 1 to 12. In the example shown here the Archive is pulled for December (month 12) in 2018.

Here’s the head of the json file that results from this query.

Extract data from json files using jq

There is a lot of information in this json file and we only want a small portion of what is there.

pyjq is a great library for extracting data from json files. It’s an implementation of jq for Python. I used pyjq to slice and filter the json files. For more information check out their web site stedolan.github.io/jq. The site jqplay.org allows you to test your jq queries and is a great place to to learn how to use jq.

Let’s use pyjq to pull the first line of the json_data file that was generated above. This line has the copyright information.

When we look at the variable copyright we see [‘Copyright © 2019 The New York Times Company. All Rights Reserved.’] which is what we expected from the head of our json file.

Now let’s extract the archive data. Under the levelresponse in the json file is the leveldocs. There are several docs in this file. How many? We pull all the docs and pipe the response from our query to length and take the first element (element 0) to get the length.

This gives us 6802 documents for December 2018. That’s a lot!

I was only interested in the snippet, headline, publication date and news desk for the documents. We will create the query as jq_query and then use that to filter our json file, json_data.

Let’s break this query apart.

  • Under each doc we want four items which we have called the_snippet, the_headline, the_date and the_news_desk. Each of these items are followed by a colon (:) and their query.
  • snippet is right under docs in the json hierarchy so it is called as .snippet (the_snippet: .snippet).
  • The headline that I want is in main under headline so we need to add another level in our query (the_headline: .headline .main).
  • The publication date and news desk are under docs so just like snippet they are called as .pub_date (the_date: .pub_date) and .news_desk (the_news_desk: .news_desk).

The variable output is a Python list of ordered dictionaries. If we look at the first 3 items in this list we get:

This is what I wanted. Just the snippet, headline, date and news desk for each article in the json file.

Since the API only provides one month of data I needed to do some more work to pull several months at one time. I did this by creating a list of year/month pairs and then looping over the list to pull the Archive json files.

From here I could perform an analysis on this data.

Other New York Times APIs

The above example was using the Archive API. The list of New York Times APIs includes

  • Archive
  • Article Search
  • Books (NY Times best seller lists and book reviews)
  • Community (user comments)
  • Geo (geographic linked data)
  • Most Popular
  • Movie Reviews
  • Semantic (people, places, organizations and locations)
  • Times Tags (NY Times controlled vocabulary)
  • Times Wire (real-time feed of NY Times article publishes)
  • Top Stories (for specific sections such as home, arts, etc.)

Conclusion

I hope you enjoy having a straight forward way to get New York Times data. I used my data for a Natural Language Processing project to look at news trends since before Donald Trump became president until today. Leave a comment to share what your project was using New York Times data.

--

--