Turning News Into Data — Day 3
Aug 12 2022
Let’s make sense of this response object. Instead of printing the stories, I’ll write it out into a json file so I can take a look at it outside of the context of my terminal window. I get the feeling this will be a pretty large file, but in any case it’s a good idea to start storing this data since my trial will end in 12 days
What makes most sense is storing by IPTC_CODE
And now I ran into one of the annoying parts of Python on the web — dict objects are not directly translatable to json objects. Dicts can hold Python objects as well as primitive data, which make serializing files into JSON a pain
for cluster_id in cluster_ids:
metadata = get_cluster_metadata(cluster_id)
if metadata is not None:
stories = get_top_stories(cluster_id)
metadata["stories"] = stories
with open(f'{IPTC_CODE}.json', 'w+') as f:
json.dump(metadata, f)
else:
print("{} empty".format(cluster_id))
Now I need to convert that RepresentativeStory object into a string, which hopefully is possible. This will slowdown my applications, but since we’re iterating over a dict it will be an O(1) slowdown, which is ok I suppose
There’s also the possibility of just converting the entire dict to a string but that’s O(m*n)
so no thanks
I got over serializing this object with this
from aylien_news_api import RepresentativeStory...
for key, val in metadata.items():
if isinstance(metadata[key]), RepresentativeStory):
metadata[key] = metadata[key].to_str()
But now there are nested objects such as datetime
to deal with now. I wish the JS SDK was more functional…
There’s a quick fix to convert everything to string but I don’t like it
json.dumps(my_dictionary, indent=4, sort_keys=True, default=str)
Deserializing will be a huge pain, especially if moving this data across a front end, which I most definitely will not be writing in Python
Anyway, I like this solution, where I extend the json deserializer class with methods of my own. This way should more annoying objects need to be deserialized I can just add them to the extended class and use it
There’s also an added benefit here of not looping over anything extra so time complexity gets saved a bit
That worked. So I made a src/util/json_serializer.py
file with:
#!/usr/bin/env pythonimport json
from aylien_news_api import RepresentativeStory
from datetime import datetimeclass CustomSerializer(json.JSONEncoder):
"""
Extends json encoder to serialize custom aylien api objects, as well as datetime and
other pesky non-dict objects
"""
def default(self, obj):
if isinstance(obj, RepresentativeStory):
return obj.to_dict()
elif isinstance(obj, datetime):
return obj.isoformat()
And the loop in app.py
becomes
...
for cluster_id in cluster_ids:
metadata = get_cluster_metadata(cluster_id)
if metadata is not None:
stories = get_top_stories(cluster_id)
metadata["stories"] = stories
with open(f'data/{IPTC_CODE}.json', 'w+') as f:
json.dump(metadata, f, cls=CustomSerializer)
else:
print("{} empty".format(cluster_id))
The file gets written now, but there is a new problem now:
I got rate limited. All good though, I’ll fix up the gitlab ci a bit now
Done, later on will make a Dockerfile and store the images separately when there’s a more clear vision. Right now it’s just quick & dirty
stages:
- test aylien apitest_api:
image: python:3.9-alpine
stage: test aylien api
before_script:
- pip install -r requirements.txt
script:
- ls
- python src/app.py