News API — exploratory analysis

James van Doorn
INST414: Data Science Techniques
3 min readSep 17, 2023

Non-Obvious Insight

From my data, I want to be able to find out what news sources are most relevant when it comes to finding climate news. This insight might inform the news sources that I will subscribe to in the future. There are countless websites out there to get your news from, but we have limited time and attention, so having this information could help us make a decision. It could also inform others who would like to decide what resources to use for climate information and news. Subscriptions to news services can be quite costly, and it would be good to know that if you were searching for one, which one would provide the greatest number of news articles about a topic you are interested in.

The Data

The data that could answer this question can be found using the News API. It lives on the internet across multiple websites. This data that could answer this question could include source names, article titles, and key words. It could also include things like site traffic data, if available for a site. This data can also include the numbers of times a certain source appears. This data is relevant because it will inform how much an article fits into what I am searching for. Articles may be easier to find in the News API if more people are accessing them, so their position in the results will also be relevant.

Data Collection

To collect the data, I used Python’s request library. To achieve this, I used a ‘get’ request to get data from the News API.

Data Cleaning and Bugs

Initially, this data was messy and needed to be organized and cleaned up before any insights could be gathered from it. I also needed to parse which items in the output were useful and the items which were not. The data was initially in a JSON format, which necessitated some changes in order to be readable. I think if someone was using the News API and was unfamiliar with it, they might encounter difficulty parsing the output from the API. There are many fields in the output, and some may be useful, while others may not be, depending on your use case. Also, bugs or other issues may appear if you write a term in ‘category’ or ‘query’ or ‘endpoint’ that the API can’t process. To clean this data, I looked at the output and first identified the items within the JSON that seemed useful. In my case, they were the ‘title’ and ‘source’ keys. Then, I used indexing to isolate these items and stored them as variables.

When using a get request on the News API for articles on climate, how many times did each source appear in the output?

Limitations and Bias

One limitation of this analysis was that there were only ~100 articles gathered. While this analysis can give us some ideas of what sources are most relevant in the search of climate news, more analysis would have to be done with a higher degree of precision to come to any final conclusions. Additionally, this analysis only looked at the number of times each source showed up in the API output and excluded other measures such as the appearance of keywords, figures, and other items that could be useful in determining relevance.

The News API itself could have some biases, depending on their practices of data aggregation, web scraping, and other ways they collect and aggregate news articles. It’s possible that due to these biases that certain news sources that post often about science and the climate did not get included, which could have skewed the results.

Click here to visit my GitHub repository: jvand0/inst414_work: A repository containing work for INST414 (Data Science Techniques) (github.com)

--

--