JSON Data Capture Using pymongo & pydash

DailyPriyab
gopy
Published in
3 min readSep 15, 2019

Very rarely I get a chance to work with a nosql data source. It so happened that for one of our business processes I had to extract the data from mongodb and do analysis on it. The requirement was to create a csv of the information from the json that I have extracted from mongodb. Since I mostly work on python, hence the first choice for extracting data from mongodb was pymongo.

While exploring various wrappers and libraries available for mongodb and python, I found mongoengine and monary, there are few others which I will mention in appendix but these caught my attention because I wanted to convert the json data into pandas dataframe. Monary's development seemed to have been stalled in 2012 and Mongoengine which is a ODM (Object document mapper) seemed to be a overkill as it is useful if I am developing a complete application requiring a data model but was not suitable for my current one time requirement.

Hence I went back to pymongo and started by using find() method to query documents based on search criteria. This supported my simple requirement where I had to search documents with document depth of two levels. For people looking for more dynamic search we can use $regex in find() which really make my life of searching data with wild cards very easy. find() method has really good search capabilities and given that this was my first time I had just scratched the surface and will explore more in future.

import pymongo
client = MongoClient('localhost', 27017)
db = client["filedb"]
collection = db["fileAttachments"]
fid="^2019_08"
query = {"attachements.fileId":{'$regex':fid}}
qd = collection.find(query)
data = []
for op in qd:
print(op)
data.append(op)

After extracting the data from mongodb I wrote the data to a json file. For the second stage of my analysis pipeline, I had to read the json data and filter the required data into csv. Given that each record from the json had multi-level depth & I had to extract data from inner dictionary and lists hence I needed strong search feature and needed to do transformations and flattening. For queries I found objectpath and jsonpath as great options to explore a nested dictionary. But for transformation part I needed a package similar to jquery or underscore js that can handle json or complex dictionary traversal and transformations with ease. And that is where I found pydash.

Pydash a python implementation of lodash.js library with many useful functions for dictionary traversal, transformation and aggregation. I used its map, flatten and flat_map functions to extract data, flatten lists and convert the data into required format. In particular I liked flat_map which iterates over a list and for each item in the list we can provide more transformations. While here also I only used some very specific functions, I suggest others to explore its aggregation and filtering functions.

import json,pydash
jdata = json.load(ifp)
with open(filename) as ifp:
val = pydash.map_(jdata,"rec")
val = pydash.flatten(val)
val = pydash.map_(val,"attachments")
val = pydash.flat_map(val, lambda i:{"fileId":i["fileId"],"filename":i["attachment"][0]["fname"],"key":i["attachment"][0]["path"]})

Finally I used pandas to extract my data into csv, but here again I could have used json_normalize to flatten the data into the required csv format but I wanted to explore other options apart from pandas and pydash can be used in many scenarios where pandas may not be an option.

Note: This blog is more of a journal of my learning journey and not reflective of my expertise, hence any constructive comments and feedback are welcome.

Appendix:

--

--

DailyPriyab
gopy
Editor for

Data Engineering | Data Governance | Azure | Spark | Python | Manager