Exploring Large Data Files with pxi

Philipp Wille
5 min readJan 27, 2020

--

Software developers, data engineers, data scientists, statisticians, project managers, and people of many other professions work with data sets every day. A recurring task shared by all these professions is to explore new data sets. The pxi (pixie) command-line processor supports them in this task.

$ npm i -g pxi                                            # synopsis

Data Science at the Command Line

As Jeroen Janssens told us in Data Science at the Command Line in 2014, the command-line is of immense help when getting to grips with new data sets. It comes with an abundance of flexible, yet powerful tools that can be combined and extended very freely. Writing pipelines of command-line programs should definitely be in your tool belt if you work with data.

Following the UNIX philosophy, today’s command-line programs are often build to handle streams of strings as a universal interface. This interface is deliberately low-level, to further program composability, which is, in principle, a fantastic idea. On the downside, however, it often leads to "stringly-typed" interfaces. Let me elaborate on what I mean:

Stringly-typed Interfaces

Most data science data sets are not formatted as strings. Rather, they are encoded in formats like JSON or CSV. Command-line tools for processing data, like awk, don't know and respect these formats, but read them as strings, process fragments, and output new strings. While users may give their best to format the output correctly, parsing and writing JSON and CSV is astonishingly complex, and awk does nothing to safeguard against errors. For example, parsing CSV data with awk is much more complex than:

$ echo '1,"2,3",4' | awk -F, '{print $2}'

In this example, awk is used with a comma as field separator. But since awk does not understand the concept of quoted fields in CSV, it splits its input string at each comma and outputs "2, instead of "2,3". There are of course many ways to cope with this problem in awk, but they all fight symptoms instead of the cause of the problem: awk’s stringly-typed interface.

Format-Aware Data Processing

Don’t get me wrong: awk is a great command-line tool for text stream processing. It is just not build for transforming formatted data — and it makes no pretence of it. In fact, GNU awk self-describes as:

If you are like many computer users, you would frequently like to make changes in various text files wherever certain patterns appear, or extract data from parts of certain lines while discarding the rest. […] The awk utility interprets a special-purpose programming language that makes it possible to handle simple data-reformatting jobs with just a few lines of code. (https://www.gnu.org/software/gawk/)

When working with JSON or CSV data sets, a different kind of processing tool is needed. One that is aware of data formats, ensures the correct structure of data it processes, and fails with meaningful errors if data formats are corrupted. pxi is exactly that kind of tool:

$ pxi --from json --to csv '({time}) => [time * 1000]' < 2019.jsonl

Like other command-line tools, pxi reads and writes text streams. In contrast to awk, however, it is aware of its input’s data format, retains the format while processing data, and writes its output according to a (different) data format. The example reads JSON data, transforms it using a function, and writes the results back as CSV.

At the same time, pxi is carefully engineered to be as fast as awk!

Processing Large Data Files with pxi

I will be straight with you: This post is to advertise pxi. I wrote it, and I am proud of it. So let me spend the rest of the post demonstrating what pxi does in a few examples.

Let us start with a large sample data file:

$ head -5 2019.jsonl                       # 2.6GB, 31,536,000 lines{"time":1546300800,"year":2019,"month":1,"day":1,"seconds":0}
{"time":1546300801,"year":2019,"month":1,"day":1,"seconds":1}
{"time":1546300802,"year":2019,"month":1,"day":1,"seconds":2}
{"time":1546300803,"year":2019,"month":1,"day":1,"seconds":3}
{"time":1546300804,"year":2019,"month":1,"day":1,"seconds":4}

Select the time:

$ pxi 'json => json.time' < 2019.jsonl1546300800
1546300801
1546300802
1546300803
1546300804

Select month and day:

$ pxi '({month, day}) => ({month, day})' < 2019.jsonl{"month":1,"day":1}
{"month":1,"day":1}
{"month":1,"day":1}
{"month":1,"day":1}
{"month":1,"day":1}

Convert JSON to CSV:

$ pxi --from json --to csv < 2019.jsonl > 2019.csvtime,year,month,day,hours,minutes,seconds
1546300800,2019,1,1,0,0,0
1546300801,2019,1,1,0,0,1
1546300802,2019,1,1,0,0,2
1546300803,2019,1,1,0,0,3
1546300804,2019,1,1,0,0,4

Convert JSON to CSV, but keep only time and month:

$ pxi '({time, month}) => [time, month]' --to csv < 2019.jsonl1546300800,1
1546300801,1
1546300802,1
1546300803,1
1546300804,1

Rename time to timestamp and convert CSV to TSV:

$ pxi '({time, ...rest}) => ({timestamp: time, ...rest})' \
--from csv \
--to tsv \
< 2019.csv
timestamp year month day hours minutes seconds
1546300800 2019 1 1 0 0 0
1546300801 2019 1 1 0 0 1
1546300802 2019 1 1 0 0 2
1546300803 2019 1 1 0 0 3
1546300804 2019 1 1 0 0 4

Convert CSV to JSON:

$ pxi --from csv --to json < 2019.csv{"time":"1546300800","year":"2019","month":"1","day":"1",...
{"time":"1546300801","year":"2019","month":"1","day":"1",...
{"time":"1546300802","year":"2019","month":"1","day":"1",...
{"time":"1546300803","year":"2019","month":"1","day":"1",...
{"time":"1546300804","year":"2019","month":"1","day":"1",...

Use JavaScript libraries for processing, like Ramda (or Lodash):

$ pxi 'evolve({year: parseInt, month: parseInt, day: parseInt})' \
--from csv \
< 2019.csv
{"time":"1546300800","year":2019,"month":1,"day":1,...
{"time":"1546300801","year":2019,"month":1,"day":1,...
{"time":"1546300802","year":2019,"month":1,"day":1,...
{"time":"1546300803","year":2019,"month":1,"day":1,...
{"time":"1546300804","year":2019,"month":1,"day":1,...

Select only May the 4th:

$ pxi '({month, day}) => month == 5 && day == 4' \
--with filter \
< 2019.jsonl
{"time":1556928000,"year":2019,"month":5,"day":4,...
{"time":1556928001,"year":2019,"month":5,"day":4,...
{"time":1556928002,"year":2019,"month":5,"day":4,...
{"time":1556928003,"year":2019,"month":5,"day":4,...
{"time":1556928004,"year":2019,"month":5,"day":4,...

Keep only certain keys and pretty-print JSON:

$ pxi --keep '["time"]' --spaces 2 < 2019.jsonl > pretty.jsonl{
"time": 1546300800
}
{
"time": 1546300801
}
{
"time": 1546300802
}
{
"time": 1546300803
}
{
"time": 1546300804
}

Deserialize JSON that is not given line by line:

$ pxi --by jsonObj < pretty.jsonl{"time":1546300800}
{"time":1546300801}
{"time":1546300802}
{"time":1546300803}
{"time":1546300804}

Process data streams from REST APIs and other sources and pipe pixie’s output to other commands:

$ curl -s 'https://swapi.co/api/films/' |
pxi 'json => json.results' \
--with flatMap \
--keep '["episode_id", "title"]' |
sort
{"episode_id":1,"title":"The Phantom Menace"}
{"episode_id":2,"title":"Attack of the Clones"}
{"episode_id":3,"title":"Revenge of the Sith"}
{"episode_id":4,"title":"A New Hope"}
{"episode_id":5,"title":"The Empire Strikes Back"}
{"episode_id":6,"title":"Return of the Jedi"}
{"episode_id":7,"title":"The Force Awakens"}

Select file size and filename from ls:

$ ls -hl / | pxi '([,,,,size,,,,file]) => ({size, file})' --from ssv{"size":"1.2K","file":"bin"}
{"size":"4.4K","file":"dev"}
{"size":"11B","file":"etc"}
{"size":"25B","file":"home"}
{"size":"64B","file":"opt"}
{"size":"192B","file":"private"}
{"size":"2.0K","file":"sbin"}
{"size":"11B","file":"tmp"}
{"size":"352B","file":"usr"}
{"size":"11B","file":"var"}

Conclusions

The command-line has many great tools that make the life of data scientists easier. However, processing data with tools like awk or sed can be complex because of their stringly-typed nature. pxi is a great alternative that helps you getting things done without having to take care of formatting data yourself. It has a focus on performance and is versatile and extensible. You should give it a try:

$ npm i -g pxi

--

--