JQ, Grep for JSON: Cookbook I

This tries to fill in the gaps in the jq documentation for intermediate to advanced users using some recipes. I literally went through the reference and picked up stuff that was new to me.

Debanjan Basu
Data Science Deep Dive
5 min readMar 12, 2022

--

JQ (installation, getting started tutorial) is a great tool for filtering, transforming and aggregating data when it is in a JSON form. There is also a related project YQ that inputs and outputs JSON, XML and YML.

The getting started tutorial is a prerequisite for working through this article. I have identified some recipes and variations thereof, where I needed to delve deeper into the tool and it’s DSL than the tutorial above contains.

Getting started

While I would still encourage new users to glance at the tutorial, I would still summarize the DSL that jq offers as need before each recipe. Let’s download some nested JSON data and then invoke JQ with jq '.' <filename.json>

Whoa that is a lot of data! Can we paginate this?

Recipe: Paginate preserving color

When one naively does jq '.' donut.json, the result is not colored anymore.

But with the -C option, jq -C '.' donut.json | less you get the colored output back —

The stuff within quotes, like '.' , are called a filter-expression. The simplest filter is the '.' expression, which is just an identity operation. This produces nice colored (use -r for raw output) and pretty output (use -c for compact output).

Recipe: String formatting

Say we want the output to just spell out the name of each product in the bakery. i.e {"Product": "Cake donut"}

JQ offers a python f-string style formatting that can handle arbitrarily complex jq filters inside them, like so — \(.expression1) \(.expression2) . Also note that we refer to the value of the key as .key and broadcast over arrays with the [] expression.

Recipe: Flatten nested data

flatten builtin for nested arrays

While there is a builtin jq filter 'flatten' for arrays, there is nothing like that for nested JSON. But, as long as the schema is consistent within the dataset, we can make do with something like this —

Flattening json data

Wait … that is a long one-liner! Is there a better way?

Of course … write the filter into a file cookbook-1.jq, use your favorite editor from the above SO answer, and invoke the file with jq -f cookbook-1.jq donut.json . This is how it looks —

In the top-left window, the code for generating the output on the right is shown. There are some subtleties however —

  1. Broadcasting with [] : The data being piped (i.e using the | operator) into the next filter needs to be an implicit loop (like broadcasting or functional map) — only then can we access the data with .name or .type expressions.
  2. Keeping the key-value pair unchanged: JQ allows one to easily let a key-value pair unchanged through invoking the key without the . (dot). Both ppu and "ppu" would have worked.
  3. Inverse of broadcasting using [] : Since the join operator accepts an array, not a JSON object, using the same [] operator around the expression produces an array of the values inside. This illustrates the declarative style of JQ since the last filter makes very clear how the output looks like.

Recipe: Really Flatten nested data

I cheated in the above example, since I merely gathered the nested structures into a string. What I wanted to have as many rows as there are leaves in this dataset?

A simple example can be tackled here with two loops on the same level, producing a multiplicative effect —

What about more intense loops within loops?

Note that “c” always gets broadcasted with 0,1,2,3. “d” and “e” get broadcasted with 4,5,6.

We will tackle a related case where the arrays are in different levels in the next installment, but for arrays at the same level — this works!

Recipe: Get a CSV output off the flattened list

getting a csv from a list of json objects

Using the function called tocsv (source) it is possible to define variables for the columns (i.e … as $cols) and rows. The @csv operation can only handle arrays — which the column and the rows are subsequently provided as. Also note the [$array1, $array2] syntax, which is like (*args1, *args2) from python, which is more idiomatic than $array1 + $array2 , since it is more obvious that the output is an array.

Thanks for reading this far and I hope you have a great time using these tools. I want to introduce some obscure builtins and those with confusing documentation in the next installment.

--

--

Debanjan Basu
Data Science Deep Dive

Ex-Physicist. Data scientist. Python developer. Dad in waiting.