Grokking Qri

Chris Whong
Jan 21 · 5 min read

grok /ɡräk/ verb understand (something) intuitively or by empathy


Qri is a new way to work with unruly datasets. We’re trying to solve several of the problems that plague data scientists and analysts from versioning to documentation to publishing/sharing.

But what is qri even? Is it a database? A data format? A sharing platform? The answer to all of these is yes, and in this post I’ll try to clarify what qri is, how it works, and how you can use it today in your data workflows.


Raw Data + Context = a better “dataset”

“Send me that dataset” usually means “email me that CSV/XLSX”. When the file arrives on your computer, whether emailed or downloaded, it’s usually detached from all of the goodies that you need to make use of it.

Metadata? It’s on the website/in the share drive. Context? Go read the slack thread from last week to get a better idea of why we collected the data. Find Bob and ask him why he encoded column X without zero-padding. Karen made a readme.txt, but didn’t copy her updated version to dropbox, so there are 20 more lines that you don’t have. 🤯

What if you could package up a table of data with structured metadata and a readme, and they always lived together? With qri, that’s what a dataset is. We call these discrete parts of a dataset components.

Qri’s data format bundles tabular data, metadata, scripting, schema rules, readmes, and commit logs. A change to any component constitutes a new immutable version of the dataset.

In addition to the raw table of data (the body), the metadata (the meta), and a readme (the readme), we also have a few more components you might not have known you needed:

  • the transform — a python-like script file that can update the body automagically.

So that’s a qri dataset. But where do datasets live? They aren’t files, so you can’t see them in a folder on your computer like you can with CSVs. They live in qri’s internal datastore, so you need to use qri tools to interact with them. That’s super-easy using Qri Desktop, our graphical interface. If you’re comfortable with the terminal you can use the Command Line Interface.

Qri Desktop’s collection view shows all of your local qri datasets, and allows for quick editing and versioning of components. The command `qri list` shows the same information in the command line (see below)
$ qri list1   chriswhong/nyc_plaza_program_sites
NYC Plaza Program Sites
linked: /Users/chriswhong/datasets/nyc_public_plazas
16 kB, 78 entries, 13 errors
2 chriswhong/only_two_rowscsv
My cool dataset
linked: /Users/chriswhong/datasets/only_two_rowscsv
36 B, 2 entries, 4 errors
3 chriswhong/pluto_19v1_10kcsv_1
My Dataset
/ipfs/QmXciFZ7CZj3PfauaXSpKd6amyUpWh4qiPhPGywFbzjhWa
6.9 MB, 9999 entries, 5392 errors
4 chriswhong/simple_csv
/ipfs/Qme1KedPB7assGKfznH5jQb1vGsZyNkcwVkkGpuhrq2XYp
47 B, 2 entries, 2 errors
5 chriswhong/world_bank_population
County Total Population - World Bank
...

Dataset Versioning & Naming

Me a few seconds ago: “So that’s a qri dataset”. Me right now: “Actually, that’s not a dataset, it’s a version of a dataset”. 😇 What we call a dataset at qri is a combination of one or more versions under a single dataset name.

Think about how you might do version control with a folder and several files on your computer. You could name the folder for the dataset, then append version numbers to the CSVs:

The filesystem-version-control-system. We want to make sure you never type _FINAL into a filename again

With qri, you name the dataset and then start making versions under that name. Each version is created at a specific time, so it’s always crystal clear what the latest version is. All of the older versions are immutable; they can live forever as under the dataset name as a record of changes over time.

Make changes, commit. Make changes, commit. Any change to a component constitutes a new version of the overall dataset!

To recap, a qri dataset has a name, consists of one or more chronological versions (user x saved at time y), and each version includes components (body, meta, etc).

Let’s go one level higher. All qri datasets are associated with exactly one user. To use qri, you must establish your identity. This is what allows the system to keep track of who made changes and when. We use username/dataset_name notation to refer to datasets. (e.g. chriswhong/usgs_earthquakes, or b5/world_bank_population)

Just like with code, the latest version of a dataset is probably most important to most users most of the time, but keeping the full history around for reference is extremely valuable.

Github users may find our dataset naming similar to the username/repository_name notation used to version code. Just like with code, the latest version of a dataset is probably most important to most users most of the time, but keeping the full history around for reference is extremely valuable.

So, let’s recap again. Qri users have datasets. Datasets are made of versions. Versions are made of components. Components each have a specific model and purpose for storing either the data itself, or some other useful information about the data.

🤓 The qri user namespace is filling up quickly. You’d better sign up now and get your short, numberless username while it’s still available!


Publishing and Sharing Datasets

Here’s the best part about qri: everything we’ve detailed above is easily portable without the use of files. Read that again. Qri datasets can be moved around quickly and effortlessly, all without using files.

Qri.cloud is your dataset archive in the sky. Your datasets live on your computer, but when you’re ready to share them with the world, you can publish them to qri.cloud. Everything goes together. The data, the metadata, the schema, etc. You can push one version or the entire version history.

Once your dataset is published to qri.cloud, it has its own discoverable and shareable web page that shows a preview of the dataset. You, the user, also get a public profile listing all of your published datasets. It’s like your own personal open data portal! (Here’s mine)

The qri cloud preview page for the dataset b5/world_bank_population

Qri datasets can be moved around quickly and effortlessly, all without using files.

Instead of sending a CSV file, share a link to a published qri dataset!. Other Qri users can clone your dataset from qri.cloud just as easily as you published it! When it arrives, it has everything you published it with… the readme, the metadata, the data itself. It’s all there!


Give it a try!

You can try out qri right now. Download Qri Desktop, fire it up, and drag in a CSV file to create your first dataset. Add some metadata, add a readme, add some column validation. Take a look at our desktop quickstart tutorial, or read the rest of our docs. Jump into our discord server or give us a yell on twitter, we’d love to hear how your first experience with qri goes.

qri.io

Qri (pronounced “query”) is a distributed version control system for datasets

Chris Whong

Written by

Urbanist, Mapmaker, & Data Junkie. Outreach Engineer at Qri.io

qri.io

qri.io

Qri (pronounced “query”) is a distributed version control system for datasets

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade