Basis: The Data Protocol for the Next 70,000 Years

*DRAFT — Work in progress*

We are entering the next era of knowledge. The first 70,000 years were based on human language, the next 70,000 will be based on data. Information, knowledge, and experience will increasingly be shared not with letters, words, and stories, but with bits, bytes, datasets and models. Experience will come not through the human senses, shared with stories, but through device sensors, shared via the internet.

There are two options for this future of knowledge: it can be private, siloed, and exploited — to the narrow benefit of Google, Facebook, and the CCP — or, it can be open, shared, and collaborated on, owned collectively, as a global ecosystem of knowledge, for the benefit of all.

Basis is the quest for the second of these possible futures.

Basis is the world’s first open and collaborative data ecosystem. Our mission is to build an open ecosystem of datasets, data sources, analyses, models, and visualizations, each of which can be shared, connected, and forked by anyone. This ecosystem is powered under the hood by our open source framework that you can think of as “bash for data”, and makes building and connecting modular data components simple and and safe with a functional reactive approach.

At the heart of Basis are data functions, pure operations on data written in Python, SQL, R, or javascript that operate on datablocks. These datablocks have optional schemas — lightweight definitions of data structure and semantics — that can be shared across functions and projects. Schemas allow a community of analysts, data scientists, and data engineers to collaborate with a common language of data.

These components come together on to power the full stack of data work: ingestion, ETL, analysis, modeling, visualization, dashboards, and export to destinations. When everything speaks the same language, you get huge power.

To illustrate this, say we want to build a Customer Lifetime Value model based on a history of customer transactions. I could build this data function in python, like so:

<python clv model>

This data function takes as input a datablock with a schema of type Transaction. That schema defines a number of columns and other metadata.

<transaction schema yaml>

The power of schemas in a collaborative ecosystem like Basis is that now I can use this same CLV model with many different data sources. Like my Stripe charges data that I ingested with my Stripe connector, which has a StripeCharge schema that implements the common Transaction schema above:

<stripe charge schema yaml>

Now I can run my CLV model data function on my StripeCharge data with no extra work — the components know how to talk to each other already!

But of course, it’s not just Stripe data that knows how to talk this common language, the whole Basis ecosystem does. So now I can also plug my ShopifyOrder into the same CLV model, or my Patreon recurring payments, or my own business application data.

<collage of shopify patreon schem yaml>

Now we can all collaborate on building the best CLV model we can, as a whole ecosystem, and we all benefit, instead of re-inventing the wheel inside every organization, making the same gotcha bugs over and over again, and never progressing the state-of-the-art.

The benefits of having a common data language, in one ecosystem, go beyond the ability to re-use components. They also give semantic metadata that any component can use to give you amazing things for free! For instance, Basis lets you build visualization and dashboard data functions in JS / React. These react functions operate on datablocks just like a python or SQL function, but output user interfaces instead:

<example of js function>

This means we can use schemas to automatically build visualizations that match the data we have. Our StripeCharge data implements the common Transaction schema, and our Transaction schema implements the even more basic Timeseries schema:

<timeseries yaml>

So now we can, in one click, plug our Stripe data into any of the charting components that work with Timeseries data:

<visualization of timeseries data>

Even more powerful, we can build documentation right into all the tools and user experience, up and down the data stack. Because it’s all in one unified ecosystem, the end user visualizing the metrics dashboard can, in two clicks, see exactly how the data was produced, what each column means, and any errors or gotchas along the way.

The benefits of connected data go beyond structural and semantic similarities as well. Much of the most important data in the world is public and open — government, macroeconomic, financial, sports, and healthcare. That means we can all build not just the tools and analyses for data, but the actual cleaned, documented datasets as well.

Going back to our Stripe data, a common problem is converting transaction amounts between currencies. Sometimes this is for accounting purposes, or for customer understanding. Normally this would mean subscribing to a foreign exchange API, ingesting data, and doing conversions. An onerous process. In Basis, forex data is already open, public and cleaned, ready for you to use. Even better, others have built functions to handle the nuances of converting currencies for you, so it is as simple as connecting an existing function into your flow:

<currency conversion function py>

(and/or <screenshot of app UI for building node from existing function>)

The goal is a global community that is building the future of knowledge, in the open. We believe access to the world’s data will be a fundamental right for future generations. To that end, we’ve pledged to make basic analysis on Basis free forever: we will never charge for basic access to open data on our platform. In addition, Basis Pro is available to students, educators, and academics free of charge. We believe it’s a win for society, the advancement of knowledge, and the ecosystem for everyone to have access to the world’s data.

Request early access to the Basis platform here, priority goes to open source contributors and researchers.

Building knowledge tools. Founder @ Twitter: @squaredloss