Flipside’s Chainwalkers and The Redshift COPY Command

Dan Kleiman
Flipside Crypto
Published in
6 min readJul 30, 2019

At Flipside Crypto, we recently launched our Chainwalkers program, a collaborative blockchain parsing project that allows us to easily decode and extract on-chain data for any blockchain.

We are actively working with dozens of projects to directly ingest their on-chain data and transform it into a standardized series of events that do 2 things:

  1. more accurately inform our ranking algorithms
  2. form a foundation for more advanced customer analytics

In a previous post, we talked about the different types of customer analytics we can derive from transforming blockchain data. Now that we’ve ramped up ingestion from many more chains, we wanted to explore what a slightly more scaled out process looks like.

Baseline Event Ingestion: Millions Per Day

Currently we are ingesting millions of events PER DAY, which is not a crazy throughput yet, but just looking at a daily average is misleading.

The actual process of on-boarding and integrating projects means that the ingestion loads are much more varied. When we integrate with a new project and ingest their whole chain from the beginning, we get spikes that are easily millions of events per minute.

One the one hand, we don’t want to have to do anything special for high-throughput spikes like this. On the other, we’re also trying to be cost-conscious as a young startup, and we don’t want to over-provision our ETL tools just for intermittent spikes. What kind of system can we deploy to handle these workloads?

The answer is, we cheat! We use a robust ETL tool that barely breaks a sweat with these spiky workloads, that happens to be the native data load operation in Redshift: COPY.

First Pass at COPY

Early on, we were writing custom parsers for a few chains and ingesting that data into tables that mapped 1:1 to each chain, e.g. an ethereum_events table or a bitcoin_transactions table. Naturally, our first thought was to scale out this process to additional chains. For each new chain, we would point our parsers at a node, decode the results of a block of data from an RPC call, and copy the data into the special table for that chain.

The expanded process looks like this:

“One Chain, One Table” Chainwalker Ingestion

You are batching each chain’s events and can pretty easily find a cadence where you can parse all the blocks that have been generated in whatever batching interval you choose.

Here, you would be tuning the ETL process to the block generation times of each chain, and balancing that against roughly how many events per block they generate. This is all starts to get very specific and varied across chains, though.

The biggest dealbreaker for us was how we would calculate the stats about each chain to drive our analytics insights. Either you are repeating queries across chains and varying table names, which becomes a pain to maintain, or you are writing custom stats calculations on a chain by chain basis.

At the heart of Chainwalkers is the idea that we can work with projects directly to translate their individual use cases and quirks into a more universal event model that captures on-chain activity.

If we can do that — normalize on-chain events into a single standard model — we can build a much more efficient analytics generation process and a much simple ETL process.

“All Chains, One Table”

And as it turns out, in our initial “one chain, one table” model, we’re not really leveraging any of Redshift’s native data loading superpowers, which is what we’re going to focus on in the rest of this post.

Parallelism and Failed Deduping

Initially when we started experimenting with “all chains, one table” for Chainwalkers, Redshift started blowing up.

Battling for Inserts

When we let each Chainwalker run on its own, we also let it handle its own INSERTs into the centralized table. The flow would go:

  • parse a single batch of events
  • create a batch-specific staging table
  • COPY the data for the batch into staging
  • DELETE any overlap between staging and the final target table, i.e. The Big Central Events Table
  • INSERT staging data into the main events table
  • DROP the staging table

While this pattern of staging-delete-insert-drop is an important one to use to ensure data deduplication and replay-ability, without parallel Chainwalkers it was causing problems.

A column-oriented database like Redshift is not really designed to do rapid updates like this. There is no native concept of unique constraints or fallback logic to handle duplicate data. We were trying to enforce that in our ETL jobs, but we were specifically breaking on the last two steps above — we were attempting to delete and insert too frequently, breaking the serializability of the operations in the database engine.

Some applications require not only concurrent querying and loading, but also the ability to write to multiple tables or the same table concurrently. In this context, concurrently means overlapping, not scheduled to run at precisely the same time. Two transactions are considered to be concurrent if the second one starts before the first commits. Concurrent operations can originate from different sessions that are controlled either by the same user or by different users.

-from the Redshift docs

Again, the throughput itself was not a problem at all. We’re still pretty light when it comes to records per minute of data. But we needed a new ETL pattern to handle the collision of dozens of jobs all trying to dedupe their writes to the same table at the same time.

Flexible Parallelism + Staging as a Buffer

We needed away to avoid multiple jobs attempting to dedupe at any given time. In a sense, we went back to “easy mode” from the first stage of “one chain, one table”, where there was only ever one job deduping and inserting data into one table at a time.

We introduced a unified staging table that all Chainwalkers will COPY into at whatever cadence they are processing on-chain events:

Unified Staging Table, Batch Updates

We’ve stopped all the independent jobs from fighting over cleanly inserting their data and let Redshift chug along with parallel inserts to a staging table. That gives us the flexibility to scale out more chains running at the same time, but also makes running a high-throughput backfill for any new chain almost unnoticeable.

There’s a second phase to “staging”, though.

We use a time-based batching strategy to isolate a set of events from staging, dedupe them, and do a single insert into the main events table. Now that there is only one job doing batching, deduping, and inserting, the target events table never ends up in that confused state where jobs are asking for records that were just deleted or fighting for memory.

Lessons for Future ETL

Our end-to-end flow now supports any chain, with its own parser, feeding into a single staging table and being deduped into a unified events table, like this:

Scalable Event Ingestion with Chainwalkers

Now, as we expand the Chainwalkers program even further, we can focus on collaborating with projects to decode and analyze their data, instead of wondering if our ETL infrastructure will hold up.

In hindsight, as we tuned the Chainwalkers ETL flow, the question was always : how does this tool match the workload we want to perform? As workloads changed, we understood our tools better based on whether or not they held up to new performance demands.

If you’re passionate about challenges like this and diving into blockchain data, we’d love for you to explore our open roles!

--

--