Smarter Data Publishing with Qri

Chris Whong
Oct 14, 2020 · 5 min read

Data scientists, data analysts, data enthusiasts, data junkies… there are many titles we can apply to ourselves as people who work with data, and finding, getting, transforming, and munging data are daily tasks we all share in common. Data Publisher, however, is probably a bit harder to come by in the data world. Consumers are many, but publishers are few. At qri, we are trying to level that playing field, making it as easy to publish and share your data as it is to find and download data published by others.

In short, we want you to add publisher to your list of data practitioner adjectives.

Open Data as an Operational Process, not a Separate Program

Everyone’s got data, but not everyone has a data catalog. Why? Data catalogs take time and money to build and maintain. They’re a big lift. Whether you build it or buy it, it’s a project that needs staffing, design, stakeholder buy-in, and all the rest. Once it’s live, you need to maintain it, keep your content fresh, publicize new datasets, etc. Open Data/data publishing becomes a program, a process removed from the actual business activities that produce and manage data at an organization.

A helpful analogy comes from the world of open source software. Many organizations share code openly, Qri included. But the source code we share on the web isn’t the output of an established “Open Code Program” that takes source code from code owners and prepares it for public consumption. We simply allow anyone to see our source code in the same place we are actively working on it.

What if we could do the same with datasets? What if published datasets lived closer to the source? What if the team that manages a dataset could publish it on their own, pour all of their institutional knowledge into a readme file, write their own column descriptions, manage their own metadata, and control when new versions are made public?

We built qri to make working with datasets more like working with code. Sharing doesn’t have to be a secondary process that requires a lot of time and energy. Instead, it’s a simple command just like git push that you can do effortlessly when the time is right.

Data, Metadata, and Schema Ride Together

In the qri universe, we’ve defined “dataset” as not just the rows, columns, and values themselves, but as the complete set of metadata, readme, structure, and schema. Unlike most data catalogs where the title, description, keywords, and other important metadata live in a separate database, qri datasets have them baked-in.

Qri’s data format bundles tabular data, metadata, scripting, schema rules, readmes, and commit logs. A change to any component constitutes a new immutable version of the dataset.

Any change to a component, no matter how small, is cryptographically signed and constitutes a new immutable version of the overall dataset. Downstream this means users of a dataset can know who made changes and when they were made, giving them a higher degree of confidence in a dataset’s provenance.

Craft Datasets Locally, Share When Ready

Just like with version control for code, you can make all the changes you want locally. Add some rows, fix a typo, add a section to the readme, rename a column, change a column type… you can make these small changes locally to get your dataset looking right before you hit the publish button. Or, if you went down the wrong path, you can throw out your changes and start over again with an earlier version.

Make changes, commit. Make changes, commit. Any change to a component constitutes a new version of the overall dataset!

The decentralized nature of qri means you, the data owner, can manage all the things normally handled in a data catalog’s admin UI locally, whether you’re planning to share the data with your team or with the public. Your local Qri collection is part of a distributed network of dataset collections. Each dataset version is portable, so you can share it across the office or across the world with ease and know that all of your collaborators are looking at the same thing.

How to Get Started

This all sounds great in theory, but what can you do with it right now? You’ve got data to share, right? Export that google sheet to file, download your data from that fitness app, or query a table in your database. All you need to get started is a CSV.

Qri Desktop is home base for managing your local qri datasets with a graphical interface.

Drag your CSV into Qri Desktop, fill in a title, and… BAM! You just created a new qri dataset. Add a readme, add a description, enter column descriptions (you know, all the things your CSV is missing that help make the data useful). Once you’re ready, make a commit, and.. BAM! Your dataset now has two versions in its history! When you’re ready, hit that publish button, and your dataset will show up with a preview on for the world to find and use.

The preview page for nyc-transit-data/turnstile_daily_counts_2020 displays the readme, dataset history, and code snippets for integrating into other data tools

On top of a dataset preview, we also provide a listing for each user. It’s like your own personal data catalog similar to your list of public projects on github. Yours is available at{yourusername}. Here’s mine!

Once you’ve gotten the hang of crafting data with qri desktop, you can graduate to qri CLI and automate your data flows. Tack a qri save on to the end of your ETL workflow to create immutable snapshots, publish the ones that pass QA/QC. Use our python library to archive your work-in-progress dataframes in jupyter instead of cluttering your working directory with CSVs. The integration possibilities are endless!

Qri (pronounced “query”) is a distributed version control…