Qri is a new way to work with unruly datasets. We’re trying to solve several of the problems that plague data scientists and analysts from versioning to documentation to publishing/sharing.
But what is qri even? Is it a database? A data format? A sharing platform? The answer to all of these is yes, and in this post I’ll try to clarify what qri is, how it works, and how you can use it today in your data workflows.
Raw Data + Context = a better “dataset”
“Send me that dataset” usually means “email me that CSV/XLSX”. When the file arrives on your computer, whether emailed or downloaded, it’s usually detached from all of the goodies that you need to make use of it.
Metadata? It’s on the website/in the share drive. Context? Go read the slack thread from last week to get a better idea of why we collected the data. Find Bob and ask him why he encoded column X without zero-padding. Karen made a readme.txt, but didn’t copy her updated version to dropbox, so there are 20 more lines that you don’t have. 🤯
What if you could package up a table of data with structured metadata and a readme, and they always lived together? With qri, that’s what a dataset is. We call these discrete parts of a dataset components.
- the transform — a python-like script file that can update the body automagically.
- the commit — information about the latest version of the dataset, including who modified the data
- the structure — CSV data is highly portable and interoperable, but can’t easily store things like column types and validation rules. The structure component does just that.
So that’s a qri dataset. But where do datasets live? They aren’t files, so you can’t see them in a folder on your computer like you can with CSVs. They live in qri’s internal datastore, so you need to use qri tools to interact with them. That’s super-easy using Qri Desktop, our graphical interface. If you’re comfortable with the terminal you can use the Command Line Interface.
$ qri list1 chriswhong/nyc_plaza_program_sites
NYC Plaza Program Sites
16 kB, 78 entries, 13 errors2 chriswhong/only_two_rowscsv
My cool dataset
36 B, 2 entries, 4 errors3 chriswhong/pluto_19v1_10kcsv_1
6.9 MB, 9999 entries, 5392 errors4 chriswhong/simple_csv
47 B, 2 entries, 2 errors5 chriswhong/world_bank_population
County Total Population - World Bank
Dataset Versioning & Naming
Me a few seconds ago: “So that’s a qri dataset”. Me right now: “Actually, that’s not a dataset, it’s a version of a dataset”. 😇 What we call a dataset at qri is a combination of one or more versions under a single dataset name.
Think about how you might do version control with a folder and several files on your computer. You could name the folder for the dataset, then append version numbers to the CSVs:
With qri, you name the dataset and then start making versions under that name. Each version is created at a specific time, so it’s always crystal clear what the latest version is. All of the older versions are immutable; they can live forever as under the dataset name as a record of changes over time.
To recap, a qri dataset has a name, consists of one or more chronological versions (user x saved at time y), and each version includes components (body, meta, etc).
Let’s go one level higher. All qri datasets are associated with exactly one user. To use qri, you must establish your identity. This is what allows the system to keep track of who made changes and when. We use username/dataset_name notation to refer to datasets. (e.g. chriswhong/usgs_earthquakes, or b5/world_bank_population)
Just like with code, the latest version of a dataset is probably most important to most users most of the time, but keeping the full history around for reference is extremely valuable.
Github users may find our dataset naming similar to the username/repository_name notation used to version code. Just like with code, the latest version of a dataset is probably most important to most users most of the time, but keeping the full history around for reference is extremely valuable.
So, let’s recap again. Qri users have datasets. Datasets are made of versions. Versions are made of components. Components each have a specific model and purpose for storing either the data itself, or some other useful information about the data.
🤓 The qri user namespace is filling up quickly. You’d better sign up now and get your short, numberless username while it’s still available!
Publishing and Sharing Datasets
Here’s the best part about qri: everything we’ve detailed above is easily portable without the use of files. Read that again. Qri datasets can be moved around quickly and effortlessly, all without using files.
Qri.cloud is your dataset archive in the sky. Your datasets live on your computer, but when you’re ready to share them with the world, you can publish them to qri.cloud. Everything goes together. The data, the metadata, the schema, etc. You can push one version or the entire version history.
Once your dataset is published to qri.cloud, it has its own discoverable and shareable web page that shows a preview of the dataset. You, the user, also get a public profile listing all of your published datasets. It’s like your own personal open data portal! (Here’s mine)
Qri datasets can be moved around quickly and effortlessly, all without using files.
Instead of sending a CSV file, share a link to a published qri dataset!. Other Qri users can clone your dataset from qri.cloud just as easily as you published it! When it arrives, it has everything you published it with… the readme, the metadata, the data itself. It’s all there!
Give it a try!
You can try out qri right now. Download Qri Desktop, fire it up, and drag in a CSV file to create your first dataset. Add some metadata, add a readme, add some column validation. Take a look at our desktop quickstart tutorial, or read the rest of our docs. Jump into our discord server or give us a yell on twitter, we’d love to hear how your first experience with qri goes.