Ivan Begtin
4 min readSep 15, 2022

For a long time, I would like to write about NoSQL data wrangling. I work with most open and corporate data datasets, JSON or NDJSON (JSON lines) files.

Often, these datasets require complex data wrangling: cleaning, enriching, restructuring, and other types of data processing.

It could be quite a big file, up to 50GB. For example, the OpenOwnership dataset with 2GB compressed (22GB uncompressed) data.

In a “normal table” world, you have quite a lot of data-wrangling options:

  • DataFrames like pandas — a trendy way to manipulate any flat data files like CSV files and so on
  • OpenRefine — open-source magic tool for manual data wrangling derived from Google Refine
  • SQL database operations — load data into SQL and use SQL commands or tools like dbt or some other tool to manipulate data inside SQL.

But JSON/NDJSON data isn’t easy to be processed this way. Most of the approaches to data wrangling are about flat tables. Processed data couldn’t have sub-documents, arrays, and other data specifics of JSON data, like missing values and data flexibility.

If I need to process big JSON lines files, I need to write some code, for example, Python code, to convert file lines to internal dictionary objects and process them as I would like.

import json

with open(myfilename, ‘r’, encoding=’utf-8') as infile:

record = json.loads(infile.readline())

# do some data processing stuff here

As you may notice, this way of processing data is far from ideal. Even if you use parallel data processing and multithreading, it could be fast enough since data wrangling is often not a single run to apply all changes. Most commonly, you do it step by step, and sometimes you need to be able to get back to the previous stages. OpenRefine supports it, but sometimes it’s unnecessary, and you could apply changes immediately using dataframes, for example.

Is it possible to do the same for non-flat datasets?

Could we use one of the document-oriented NoSQL databases to do this job?

Yes, but too many limitations occur. Let’s take MongoDB as the most popular NoSQL RDBS.

MongoDB

MongoDB supports flexible data structures. Instead of tables, it uses the concept of collections with JSON objects.

It provides many benefits if you need data flexibility, but using it for data wrangling is much more complicated.

A task that could be solved using a typical SQL query — converting values of the specific column to upper case is effortless: “UPDATE MyTable SET MyColumn = UPPER(MyColumn)”.

But if you want to repeat it for MongoDB, you will have to iterate all records and apply code to each. It will look like

db.MyTable.find([find_criteria]).forEach(function(doc) {

db.MyTable.update(

{ _id: doc._id},

{ $set : { ‘MyColumn’ : doc.MyColumn.toUpperCase() } },

{ multi: true }

)});

Or you could use any programming language like Python to iterate, read and update each object externally, but it should be slower since network/connection latency added to execution time.

It’s not just about upper-case string transformation. And it’s about most of the string manipulation, data type transformation, and other possible commands.

ArangoDB

Other databases are pretty similar. The next popular RDBS is ArangoDB. This task to manipulate string could be solved with lesser code, but more secondary doesn’t mean better performance.

FOR u IN MyTable

UPDATE u WITH {

MyColumn: UPPER(MyColumn)

} IN MyTable

This one, too, doesn’t look like the best solution possible. Sure, it is worth comparing the performance of both solutions.

Apache Cassandra

MongoDB and ArangoDB are similar in their data schema flexibility. Other NoSQL databases like Apache Cassandra have strict schemas. They could be used for data wrangling tasks only after schema identification and should be updated during most data wrangling operations.

Cassandra doesn’t have upper-case and other string manipulation functions, but it allows implementing user-defined functions (UDF), and it could be used.

Other tools

An exciting data wrangler tool is glom. It’s a python library to query Python dictionaries.

Something like querying nested lists, an example from tool documentation

>>> target = {
… ‘galaxy’: {
… ‘system’: {
… ‘planet’: ‘jupiter’
… }
… }
… }
>>> spec = ‘galaxy.system.planet’
>>> glom(target, spec)
‘jupiter’

And it supports data mutations too. Is it possible to use it as the foundation of pandas dataframe alternative for NoSQL data? Maybe. Glom supports data mutation operations, and, probably, it could handle big datasets.

Final thoughts

For now, there is no easy-to-use data wrangling solution for NoSQL documents. String manipulations are just features that NoSQL databases lack and require special attention. Other features are:

  • adding columns based on other columns
  • data reconciliation
  • field rename
  • field removal
  • data validation
  • advanced data transformation

At least most of the features that OpenRefine allows to do manually using UI. I don’t think that UI is a critical element of data wrangling. It could be implemented as a headless solution too.

I keep looking for the best possible solution to process JSON lines data. If you know any of them, please share your thoughts.

Ivan Begtin

I am founder of APICrafter, I write about Data Engineering, Open Data, Data, Modern Data stack and Open Government. Join my Telegram channel https://t.me/begtin