Data Engineering With Rust And Apache Arrow DataFusion 1/4 — Introduction

MatthieuL
5 min readJul 24, 2022

--

Welcome to the introduction of my article series “Data Engineering With Rust And Apache Arrow DataFusion.” You can access the next part here.

I love playing with the Rust language to create simple and efficient command-line tools. A CLI program must be simple and composable with a “good enough” set of features following the Unix philosophy.

But simplicity is not the primary concern of classical data processing tools. Instead, the frameworks are complex, with many dependencies, and designed with multiple network services. As a result, using these tools frequently involve a costly infrastructure.

This series of articles shows how to leverage the Apache Arrow ecosystem and the Apache Arrow DataFusion framework to build a simple data processing command-line tool. This framework is based on the Arrow format and enables quick and easy data transformation of CSV or Parquet files.

Learning Path and Objectives

My objective is to build a simple command-line tool with the following features:

  • First, read and process end-user command-line arguments.
  • Next, read a Parquet or CSV file using the Apache Arrow DataFusion crate.
  • Then, apply a simple transformation to a dataset.
  • Finally, write the results as a Parquet or CSV file.
A simple app to load, transform and write CSV or Parquet files — Image by Author

I describe the implementations of these features using Rust in a three-part series of articles.

Note: the code sections of this series use the syntax of Literate Programming [1]: each double angle bracket block (e.g. <<code-block>>) refers to a code definition expanded in place in the final program source code.

Part 1 describes how the application can read and validate user options based on the Clap crate.

You can see below the command-line interface of our final application with the defined options.

Note that cargo run -- run the binary produced with all the argument that follows the two dashes.

$ cargo run -- --help

mdata_app 0.1.0

USAGE:
mdata_app [OPTIONS] --input <INPUT> --output <OUTPUT> [SUBCOMMAND]

OPTIONS:
-f, --format <FORMAT> Output format [default: undefined] [possible values: undefined, csv,
parquet]
-h, --help Print help information
-i, --input <INPUT> Input path
-l, --limit <LIMIT> Limit the result to the first <limit> rows [default: 0]
-o, --output <OUTPUT> Output path
-s, --schema Display the schema
-v, --verbose Verbose level
-V, --version Print version information

SUBCOMMANDS:
eq Add equality test in the where clause
help Print this message or the help of the given subcommand(s)

End-users can specify input and output paths with an associated format. Some options are added to trace log verbosity (--verbose) and check the inferred schema (--schema).

I define a subcommand enabling custom help messages and validation for custom data transformation. For example, I add a simple filtering operation using the subcommand eq with two arguments:

  • <COLUMN>: the filter column name.
  • <VALUE>: the filter value.
$ cargo run -- eq --help

mdata_app-eq 0.1.0
Add equality test in the where clause

USAGE:
mdata_app --input <INPUT> --output <OUTPUT> eq <COLUMN> <VALUE>

ARGS:
<COLUMN>
<VALUE>

OPTIONS:
-h, --help Print help information
-V, --version Print version information

In Part 2, I describe the read/write operations and the implementation of the filter transformation using the Apache Arrow DataFusion framework.

In this process, I show how to manage user input validation and set up custom transformations.

In the final Part 3, I show test strategies for data applications. In particular, I highlight two concerns:

  • Unit-testing of small, individual parts of code using the Rust integrated testing framework.
  • Acceptance testing of the data functionalities and end-user command-line interface with custom fake generated datasets.

For example, I have generated a simple CSV file using the fake-rs crate to show the application’s capabilities.

This sample dataset has four columns: id, col_key, col_bool, and col_value. Each column is randomly sampled from a specification written with the fake-rs crate.

$ head tests/inputs/test.csv

id,col_key,col_bool,col_value
1627,A971hmwwXKSMCUh,true,0.51049227
1525,0C8NOuocYR,false,0.92079586
1506,EfUQ6DIOpbkhI2dvW,true,0.6692676
1883,HzVbQk7gWmH6,false,0.13097137
1759,pRqw8E0qK4,true,0.7207628
1192,1Ueyu3oV6XLL0,true,0.6408737
1699,8ipERPr9HpwT,true,0.3659184
1283,v8YjQXj,false,0.5097597
1208,UameWzlzuUSMWMG0V,false,0.1739344

With the command-line application, we can read this CSV file and rewrite it in the Parquet format with an output size limit.

$ cargo run -- --input tests/inputs/test.csv \
--output tests/outputs/out_test \
--format parquet --limit 5

This command produces the following result.

$ tree tests/outputs/out_test

tests/outputs/out_test
└── part-0.parquet

0 directories, 1 file

And finally, we use our filter operator to transform our dataset. In this case, we filter all the rows with a true value in the column “col_bool”.

$ cargo run -- -i tests/inputs/test.csv \
-o tests/outputs/test_csv_filtered \
eq col_bool true

This filter operator produces the following result.

$ head -q tests/outputs/test_csv_filtered/part-*

id,col_key,col_bool,col_value
1627,A971hmwwXKSMCUh,true,0.51049227
1506,EfUQ6DIOpbkhI2dvW,true,0.6692676
1759,pRqw8E0qK4,true,0.7207628
1192,1Ueyu3oV6XLL0,true,0.6408737
1699,8ipERPr9HpwT,true,0.3659184
1388,n36xdSEM4Rkks,true,0.18349582
1461,FQlJEx4fSdZYW2,true,0.855797
1916,JplFowr,true,0.103708684
1215,GayUaRMjjwbfGTllC,true,0.2439853

Note: the output format is a directory with several “part-x.csv” files, each file a CSV. Why? Apache Arrow DataFusion engine can process big volumetry of data and use partitioning to scale its operations.

Wrap-up

Three parts series — Image by Author

To wrap up, we start from the simple cargo binary template and improve it iteratively following these steps :

Part 1 — A CLI Application with Clap

  • Step 1 — Define the project directories and the dependencies.
  • Step 2 — Define the program structure and describe the main code blocks.
  • Step 3 — Parsing the command-line arguments using the Clap crate.

Part 2 — Load, Transform & Write with Apache Arrow DataFusion

  • Step 4 — Define some utilities and struct to manage errors.
  • Step 5 — Load, Transform and Write data.

And, coming in the next few weeks :)

Part 3 — Test Your Data App

  • Step 6 — Unit testing and temporary file/directory generation.
  • Step 7 — Acceptance testing of a command-line program with fake generated data.

Enjoy!

References

[1] D. E. Knuth, “Literate programming,” The computer journal, vol. 27, no. 2, pp. 97–111, 1984.

--

--

MatthieuL

Matthieu Lagacherie | Computer Science Geek | AI Architect / Personal views on tech, programming and ML