Turning News Into Data — Day 2

Norman Benbrahim
2 min readAug 11, 2022

Aug 10 2022

Let’s play with the API a little more to learn about the SDK and see if it is stable enough to start using in an application. Let’s say I want to perform a search on articles related to oil

There are quite a number of ways to gather data. I will try my hand with clusters. Specifically, you can gather clusters of stories related to a topic (or a trend), for real time monitoring of events about a specific topic or entity

I am noticing that the docs are written with the Python SDK as examples, so that may explain why the JS one wasn’t very stable. They probably code things mainly in Python

After fiddling with the code in node for a bit, it just isn’t worth doing this with JavaScript. It’s like stabbing in the dark without proper documentation, guessing the parameter names for functions and whatnot. Switching to Python

And we have liftoff. The docs are badly written in general, needed to fiddle with the code to get it working. Notice that it searches topics by something called IPTC subject code, which is a code that specifies the topic for news articles. Oil has 2 numbers, 04005003 and 04005004. One is upstream one is downstream, no idea what that means yet. Full list here

src/app.py:

#!/usr/bin/env pythonimport osif os.environ['ENVIRONMENT'] == 'dev':
from dotenv import load_dotenv
load_dotenv()
import time
import aylien_news_api
from aylien_news_api.rest import ApiException
from pprint import pprint
configuration = aylien_news_api.Configuration()
# Configure API key authorization: app_id
configuration.api_key['X-AYLIEN-NewsAPI-Application-ID'] = os.environ['AYLIEN_API_ID']
# configuration = aylien_news_api.Configuration()
# Configure API key authorization: app_key
configuration.api_key['X-AYLIEN-NewsAPI-Application-Key'] = os.environ['AYLIEN_API_KEY']
configuration.host = "https://api.aylien.com/news"
# Create an instance of the API class
api_instance = aylien_news_api.DefaultApi(aylien_news_api.ApiClient(configuration))
def get_cluster_from_trends():"""
Returns a list of up to 100 clusters that meet the parameters set out.
"""
response = api_instance.list_trends(
field='clusters',
categories_taxonomy='iptc-subjectcode',
categories_id=['04005003'],
published_at_end='NOW-12HOURS'
)
return [item.value for item in response.trends]def get_cluster_metadata(cluster_id):"""
Returns the representative story, number of stories, and time value for a given cluster
"""
response = api_instance.list_clusters(
id=[cluster_id]
)
clusters = response.clustersif clusters is None or len(clusters) == 0:
return None
first_cluster = clusters[0]return {
"cluster": first_cluster.id,
"representative_story": first_cluster.representative_story,
"story_count": first_cluster.story_count,
"time": first_cluster.time
}
def get_top_stories(cluster_id):
"""
Returns 3 stories associated with the cluster from the highest-ranking publishers
"""
response = api_instance.list_stories(
clusters=[cluster_id],
sort_by="source.rankings.alexa.rank.US",
per_page=3
)
return response.storiesclusters = {}
cluster_ids = get_cluster_from_trends()
for cluster_id in cluster_ids:
metadata = get_cluster_metadata(cluster_id)
if metadata is not None:
stories = get_top_stories(cluster_id)
metadata["stories"] = stories
pprint(metadata)
else:
print("{} empty".format(cluster_id))

Will update the gitlab CI file and stuff tomorrow

--

--