Introducing the Civil Events Crawler

Published in

Civil

10 min readAug 6, 2018

This is part one of a two-part series. Read part two here.

The core of Civil’s journalism protocol is built upon smart contracts written for the Ethereum blockchain. These contracts provide interfaces to access the token curated registry, newsroom data and other components of the marketplace. Once deployed, they are publicly accessible to those interested in interacting with them. You can read about them in our whitepaper or look at the contract code in Civil’s Github repository.

While the smart contracts provide a secure, decentralized place for this data to live, there are still a few challenges to overcome:

Reading this data directly from a contract off the blockchain can be cumbersome, especially when trying to aggregate data.
Many developers don’t yet know how to interact with the Ethereum blockchain. (Where are the contracts? How do I connect to an Ethereum node?) This is a problem for our goal of accessibility and transparency.
It can be very expensive to run a job and store data in a smart contract.

To solve some of these issues, we decided to build an off-chain system that will watch for interactions on our contracts, run jobs to process that data, and wrap it in a GraphQL/REST API for internal and external use. This solves challenge #1 & #2 by providing open access to a web standard API. Issue #3 will be solved by processing and storing data off-chain.

As we’ll talk more in our “next steps” section below, it is a near term goal to make this more generic for others developers in the community to capture, process, and/or serve their own contract data.

A High Level View

The following diagram is a high-level overview of the system as it exists today. The system is broken down into three main components:

The crawler component watches for interactions on the smart contracts. The metadata of these interactions are stored for our records and for processing.
The processor component which retrieves data from these interactions and performs some processing on them, which could be aggregation of stats, newsroom state updates, published articles etc. This component persists the results of that work for later retrieval.
The API component which surfaces the processed data to the outside world.

This system is written and developed in Go version 1.10. Aside from its well-documented pros (and cons), we chose Go because of staff familiarity, cross-compilation of binaries, and the tooling available via the Ethereum library go-ethereum/Geth. We will talk a bit more about how we are using Go in a future post.

Now let’s talk about each major component and how it functions. The next few sections are a bit more technical, so feel free to reference the diagram above as we step through each one.

Watching for Interactions

To watch for contract interactions, we leverage Ethereum event logs. In short, events can be added to a smart contract and emitted to represent particular interactions with it. Emitted event logs are recorded to the blockchain. There are various uses for events, such as inter-contract communication or basic data storage, but we will use them as a means to trigger jobs based on the event. These events can carry some data along with it, like addresses, numbers, and strings.

The system will watch for Civil specific events. An example of a Civil event is when a Newsroom applies to Civil’s Token Curated Registry or TCR. When the code to apply to the TCR completes, an _Application event is emitted by the TCR contract. The _Application event contains data on the newsroom address, application deposit, applicant address, and application end date. We capture this event to create or update the target newsroom.

First we will use the abigen command from the ethereum/go-ethereum project to generate code wrappers around our Civil Solidity smart contracts. This allows us to access the calls and events defined in the contracts via Go. These code wrappers contain methods that wrap around each event type in the contract. The Filterer<EventName> methods in the wrapper retrieve existing event logs for a particular event. Then, the Watch<EventName> methods in the wrapper start up a process that watches for new event logs for a particular event.

Given that each event type has their own wrapper for both Watch and Filterer, we wrote a code generator that takes all the events in each contract type and creates a multiplexer from multiple types of events to a single event type called Event. Event encapsulates all the data from the specific event type and any crawler specific metadata.

Normalizing to Event allows the consumer to handle a single event type for any contract rather than manage multiple types with different parameters. Persisting to a single table, collection, or queue is easier with a single type.

The code generator generates the Filterers and Watchers as separate modules in the crawler component. Since each event is handled individually in its own goroutine process, the set of events to track can be easily controlled via configuration.

When the crawler component starts, the Filterers first retrieves and persists existing events, then starts up the Watchers to listen for new incoming events. As events come in on the Watchers, they are also persisted. The last block number of the events seen by both modules are recorded. On failure or next startup, the Filterers will use this last block number to determine from when to retrieve events.

Existing Event data is stored to a persistence layer abstracted by a set of Go interfaces. This means we can implement these interfaces any way we wish and store to memory, to a database, pushed onto a pipeline (ex. NSQ, SQS, RabbitMQ, Kafka), or any combination of these. Our reference implementation is backed by Postgresql which is our v1 store, but as outlined later in this post, we are looking to migrate to a different implementation in the future.

Massaging the Events

Once the Events are persisted and are available for retrieval, we now have a record of all the events that have been emitted by our smart contracts. That is valuable in itself, but now we can do some work on those events to update the state of our newsrooms and content. One of the purposes of this system is to allow us at Civil to easily retrieve data about the state of our newsrooms and their published content.

The state of the newsroom reflects where the newsroom is in the Civil community’s governance process. Is the newsroom whitelisted? Is it being challenged? Is it waiting for appeal? Content published by a newsroom includes new published content as well as revisions to existing content.

The events processor handles this component. When a governance-specific event comes in for one of those listings, the processor determines whether the newsroom needs to be created or updated based on the current state of the newsroom and the event.

For example, if a listing is in a challenged state, the next event for the listing can be ChallengeFailed or ChallengeSucceeded. If it is either of those events and the newsroom is in a Challenged state, the newsroom is updated to the state defined in the event. There is also logic to ensure that the events are handled in the correct order.

Handling published content events differs in that we persist all the revisions for each content item published by a newsroom. This allows us to retain a history of each content item and aligns with how newsroom contracts handle and store revisions.

Whenever a content item is revised or published anew, a RevisionUpdated event is emitted by the newsroom contract. Each of these revisions are added to the content persistence and are referenced by the contract address. This allows us to pull out all the content for a newsroom as well as retrieve the revisions to every one of those content items.

Currently, the events processor runs periodically to process new events and store them to the processed data persistence via a cron job. The time between runs aligns with when new blocks are created in Ethereum as that is the time when events are emitted. There might be times where this process runs and there are no new events or the process runs and misses a new block that has just come in with events. This can be remedied by replacing the cron with a producer/consumer model in a later iteration of the system. By consuming events, the process will only be triggered when an actual event comes in.

APIs

Once the processed data is persisted, the data can then be accessed by clients via a web standard REST/GraphQL API.

The APIs can service internal applications like the Civil governance Dapp, a suite of analytics tools, and/or be useful for those outside of Civil who want to build on top of Civil data. The public API endpoints and “how to access” information will be documented in a future post.

In further iterations of the system, the plan is to provide endpoints to allow real-time streaming of these events to consumers.

Things to Figure Out

As we are working on this, we have a few things in the “let’s run this thing and figure these out later” bucket.

Handling blockchain re-organizations gracefully. Right now, we look at any emitted events without checking to see if they were forked out of the canonical chain. Some systems have a “waiting period” where the events are only considered final after X blocks have passed (# of confirmations) since the event appeared. This, or something similar, can be added if there are noticeable anomalies in the data.
Handling failure of the watcher connections (websocket connections) where we check to see if any new blocks have been emitted since the failure. There is currently reconnect logic, but no call to pull events via the filterers yet. We need to test out the connection on live systems to determine if reconnect times are within tolerance.
This is a broader problem for the Civil ecosystem, but if gas costs increase, we need to figure out ways to mitigate cost to maintain state in this system. This primarily affects newsroom content as newsrooms currently decide whether they want to publish content to their smart contract. If gas costs are high, they may chose not to do so, limiting the data the crawler receives.

Up Next

So now that we have the foundation of capturing, processing and serving a specific set of Civil events, what are we thinking about for the next iteration or two of the crawler?

Generic Crawler Library

As mentioned in the introduction, we want to extract a generic library for any developer to easily create a crawler from their smart contracts. Currently, the code is a bit integrated with Civil smart contracts, but with a little bit of work, the crawler code can be refactored into a separate library that allows easy use with other smart contracts outside of Civil. This could be a great tool for smart contract developers to capture and process contract data needed for their own Dapps.

Opening Up the Public API

We plan to open up the Civil smart contracts API to the public for the community to build and hack on. It will take some work to set up the public authentication along with access key distribution.

Queues / Streaming

As mentioned above, we can switch from persisting events into a database to a publisher/consumer model, where we can queue up jobs for each incoming event with workers that process and persist the results of the event.

Making this update benefits us in a few ways. Firstly, this will allow us to scale up each layer according to any increasing capacity needs. Secondly, we can decouple the layers and perhaps open up per-layer services depending on where we go in the future. Lastly, alongside the APIs, we can create streams of processed data to which public and internal consumers can subscribe.

Beyond the Horizon

There are a few long term goals for the project, but we will touch on one of them here.

Civil Archiver

This would be an extension of the crawler to watch for valid content published by newsrooms and “pin” (or make permanent) the data into a node for a decentralized storage system like IPFS or Swarm. The idea is to ensure that Civil newsroom content will be decentralized and always available. Ideally, multiple parties besides Civil will run the archiver to ensure both that the content is available and to add additional storage nodes to the network.

When the data is pinned, we can build a layer to view this decentralized data via an HTTP gateway. This can be really valuable to centralized publications who want to ensure that their content is always available to the public regardless of any future occurrences. In the event that the centralized publication no longer exists, the content can still be viewed via the archiver off of the decentralized storage network.

How You Can Help

We are going to open source some of this code, mainly the crawler component for filtering and watching for events. The other components may be opened up at a later date.

We are actively working on the code, but feel free to read it, hack it, or use it. There are a few items in this post that could use some help. Please submit any feedback, issues, or contributions at our Github repository.

If you are interested in our smart contracts, Dapp, and other tools we have built, check out our main monorepo.

You can also reach out to us on Telegram, or send an email to developers@civil.co.