Turning News Into Data — Day 3

3 min readAug 13, 2022

Aug 12 2022

Let’s make sense of this response object. Instead of printing the stories, I’ll write it out into a json file so I can take a look at it outside of the context of my terminal window. I get the feeling this will be a pretty large file, but in any case it’s a good idea to start storing this data since my trial will end in 12 days

What makes most sense is storing by IPTC_CODE

And now I ran into one of the annoying parts of Python on the web — dict objects are not directly translatable to json objects. Dicts can hold Python objects as well as primitive data, which make serializing files into JSON a pain

for cluster_id in cluster_ids:
    metadata = get_cluster_metadata(cluster_id)
    if metadata is not None:
        stories = get_top_stories(cluster_id)
        metadata["stories"] = stories
        with open(f'{IPTC_CODE}.json', 'w+') as f:
            json.dump(metadata, f)
    else:
        print("{} empty".format(cluster_id))

Now I need to convert that RepresentativeStory object into a string, which hopefully is possible. This will slowdown my applications, but since we’re iterating over a dict it will be an O(1) slowdown, which is ok I suppose

There’s also the possibility of just converting the entire dict to a string but that’s O(m*n) so no thanks

I got over serializing this object with this


from aylien_news_api import RepresentativeStory...
for key, val in metadata.items():
            if isinstance(metadata[key]), RepresentativeStory):
                metadata[key] = metadata[key].to_str()

But now there are nested objects such as datetime to deal with now. I wish the JS SDK was more functional…

There’s a quick fix to convert everything to string but I don’t like it

json.dumps(my_dictionary, indent=4, sort_keys=True, default=str)

Deserializing will be a huge pain, especially if moving this data across a front end, which I most definitely will not be writing in Python

Anyway, I like this solution, where I extend the json deserializer class with methods of my own. This way should more annoying objects need to be deserialized I can just add them to the extended class and use it

There’s also an added benefit here of not looping over anything extra so time complexity gets saved a bit

That worked. So I made a src/util/json_serializer.py file with:

#!/usr/bin/env pythonimport json
from aylien_news_api import RepresentativeStory
from datetime import datetimeclass CustomSerializer(json.JSONEncoder):
    """
    Extends json encoder to serialize custom aylien api objects, as well as datetime and 
    other pesky non-dict objects
    """
    def default(self, obj):
        if isinstance(obj, RepresentativeStory):
            return obj.to_dict()
        elif isinstance(obj, datetime):
            return obj.isoformat()

And the loop in app.py becomes

...
for cluster_id in cluster_ids:
    metadata = get_cluster_metadata(cluster_id)
    if metadata is not None:
        stories = get_top_stories(cluster_id)
        metadata["stories"] = stories
        
        with open(f'data/{IPTC_CODE}.json', 'w+') as f:
            json.dump(metadata, f, cls=CustomSerializer)
    else:
        print("{} empty".format(cluster_id))

The file gets written now, but there is a new problem now:

I got rate limited. All good though, I’ll fix up the gitlab ci a bit now

Done, later on will make a Dockerfile and store the images separately when there’s a more clear vision. Right now it’s just quick & dirty

stages:
  - test aylien apitest_api:
  image: python:3.9-alpine
  stage: test aylien api
  before_script:
    - pip install -r requirements.txt
  script:
    - ls
    - python src/app.py

Turning News Into Data — Day 3

Written by Norman Benbrahim