Turning News Into Data — Day 3

Norman Benbrahim
3 min readAug 13, 2022

Aug 12 2022

Let’s make sense of this response object. Instead of printing the stories, I’ll write it out into a json file so I can take a look at it outside of the context of my terminal window. I get the feeling this will be a pretty large file, but in any case it’s a good idea to start storing this data since my trial will end in 12 days

What makes most sense is storing by IPTC_CODE

And now I ran into one of the annoying parts of Python on the web — dict objects are not directly translatable to json objects. Dicts can hold Python objects as well as primitive data, which make serializing files into JSON a pain

for cluster_id in cluster_ids:
metadata = get_cluster_metadata(cluster_id)
if metadata is not None:
stories = get_top_stories(cluster_id)
metadata["stories"] = stories
with open(f'{IPTC_CODE}.json', 'w+') as f:
json.dump(metadata, f)
else:
print("{} empty".format(cluster_id))

Now I need to convert that RepresentativeStory object into a string, which hopefully is possible. This will slowdown my applications, but since we’re iterating over a dict it will be an O(1) slowdown, which is ok I suppose

There’s also the possibility of just converting the entire dict to a string but that’s O(m*n) so no thanks

I got over serializing this object with this


from aylien_news_api import RepresentativeStory
...
for key, val in metadata.items():
if isinstance(metadata[key]), RepresentativeStory):
metadata[key] = metadata[key].to_str()

But now there are nested objects such as datetime to deal with now. I wish the JS SDK was more functional…

There’s a quick fix to convert everything to string but I don’t like it

json.dumps(my_dictionary, indent=4, sort_keys=True, default=str)

Deserializing will be a huge pain, especially if moving this data across a front end, which I most definitely will not be writing in Python

Anyway, I like this solution, where I extend the json deserializer class with methods of my own. This way should more annoying objects need to be deserialized I can just add them to the extended class and use it

There’s also an added benefit here of not looping over anything extra so time complexity gets saved a bit

That worked. So I made a src/util/json_serializer.py file with:

#!/usr/bin/env pythonimport json
from aylien_news_api import RepresentativeStory
from datetime import datetime
class CustomSerializer(json.JSONEncoder):
"""
Extends json encoder to serialize custom aylien api objects, as well as datetime and
other pesky non-dict objects
"""
def default(self, obj):
if isinstance(obj, RepresentativeStory):
return obj.to_dict()
elif isinstance(obj, datetime):
return obj.isoformat()

And the loop in app.py becomes

...
for cluster_id in cluster_ids:
metadata = get_cluster_metadata(cluster_id)
if metadata is not None:
stories = get_top_stories(cluster_id)
metadata["stories"] = stories

with open(f'data/{IPTC_CODE}.json', 'w+') as f:
json.dump(metadata, f, cls=CustomSerializer)
else:
print("{} empty".format(cluster_id))

The file gets written now, but there is a new problem now:

I got rate limited. All good though, I’ll fix up the gitlab ci a bit now

Done, later on will make a Dockerfile and store the images separately when there’s a more clear vision. Right now it’s just quick & dirty

stages:
- test aylien api
test_api:
image: python:3.9-alpine
stage: test aylien api
before_script:
- pip install -r requirements.txt
script:
- ls
- python src/app.py

--

--