Designing a Data Warehouse for the Enterprise Graph

Published in

Inside Rimeto

9 min readAug 5, 2019

At Rimeto, our mission is to help organizations be more effective through better collaboration and knowledge sharing. The company was founded in late 2016 and today our customers range from large educational institutions to Fortune 500 companies. Accordingly, the expectations around reliability and customer service are high. In this blog post, we’ll discuss how we set up a data warehouse and analytics system that helps us meet those demands, how we made decisions about that tech stack, and what we learned along the way about the benefits of a data warehouse for an early stage enterprise startup.

Use Cases and Goals

When we started this project, we had a few key use cases in mind that we hoped our new data system would facilitate:

Customer reporting: Help our customers understand their return-on-investment from Rimeto and meet contractual reporting obligations. In the past, this required significant time investment from engineers and customer representatives.
Product analytics: Help our product team understand how customers use our product and ground product decisions in data versus a “hunch”.
System analysis: Help our engineering team understand and improve our systems and respond to issues more quickly.

Each of these use cases was compelling independently, and together they made a strong case for prioritizing investment in our data infrastructure. Importantly, these use cases all serve different audiences within Rimeto with varying levels of technical ability. This means that logging into the production database to dash off a SQL query is not an option. Neither is reading code to understand what a particular piece of data means. So in addition to serving the use cases above, we also require that our data is:

Well-Organized: The data must be discoverable even without understanding the technical details of the underlying systems.
Accessible: The data must be available via a straightforward, easy to learn tool that doesn’t require an engineering background to use.
Reliable: The data must be correct and have clear SLAs for availability.

Technical Requirements

Given our use cases, goals, and existing technology stack, we compiled a list of technical requirements for our data warehouse and analytics system.

Secure By Design

Our number one technical requirement was to follow best-in-class, standards-based practices to ensure the security of our data warehouse architecture. Security has been baked into Rimeto’s technology stack from the very beginning, and so any data warehouse solution needed to meet our high standards.

Unified Home for All Our Data

Second, we required a unified home for all of our data where it is easy to access and join together various data elements across several different subsystems. At a high level, there are two main sources of our data: production data and event logs.

Production Data: These data sets contain all of the data that powers the product experience. A previous post discusses in-depth how we are building the enterprise graph at Rimeto but, in short, we model all of our production data as a graph which we represent using only two tables (nodes and edges) and shard our across many db instances to support our scale. We rely heavily on a JSON properties field in each table to store arbitrary information germane to each node or edge type. This imposed a strict constraint on our data warehouse: we needed a system that works well with semi-structured data (in particular, JSON blobs) of arbitrary size. Our flexible schema and distributed architecture has historically made it challenging to perform certain kinds of queries, and so a major goal of our data warehouse was to simplify these analyses.

Event Logs: We log lots of data from our many production systems. It’s incredibly powerful to be able to connect our log data to our production data. This enables important analytics for our customers (“When we started using Rimeto, how many people in each of our offices adopted it in the first week?”) as well an enhanced ability for us to understand and improve the reliability and performance of our product.

Cost

Finally, we wanted to minimize cost. For us, cost is not just about dollars and cents. We also wanted to control for development and maintenance costs as well as ease of use. The system should be easy to set up and maintain so that we can keep most of our engineering efforts focused on our core product. Non-engineers should be able to easily query our data sets in seconds to allow everyone fast answers to their data questions.

Final Architecture

With these requirements and use cases in mind, we were ready to make hard decisions about how to implement our data warehouse and analytics systems. There are many offerings in this space and we spent a few weeks demoing and comparing products. Here’s the tech stack we chose which has been working well for us for the past 6 months:

Data Warehouse (Storage + Processing)

Most importantly, we needed a physical storage space for our data and a processing engine that works well with that storage. We chose Snowflake as a unified data storage and data processing solution. Because Snowflake is fully managed, we knew that we wouldn’t need to spend time configuring or monitoring compute resources and that it would easily scale with us as our data sets and demand for interactive analytics grew. In addition, Snowflake has good native support for semi-structured data structures: the syntax for working with JSON data is easy and intuitive, and it gracefully handles arbitrarily large JSON blobs. Snowflake’s querying engine is fast enabling real-time analytics on our data sets. Snowflake also meets our stringent security requirements.

Orchestration

Having chosen a data warehouse, we needed a way to orchestrate data flow into that warehouse and post-processing of that data so that it is easy to analyze. We chose Airflow, a scheduling tool designed around data processing needs and open sourced by Airbnb, as our scheduler. Airflow makes it easy to define and schedule DAGs (sets of related tasks) through Python scripts. Given several members of our team had prior experience working with Airflow, Airflow’s active developer and user communities, and its status as a Top-Level Project at the Apache Software Foundation, we felt confident going this direction. We contracted with Astronomer for additional Airflow operational tooling and support.

To ingest production data, we wrote custom Airflow operators that query our production MySql instances, stream data through the Airflow server and push it up to AWS S3 daily. We use Snowflake’s secure data ingestion capabilities to then pull that data from S3 into Snowflake. We took a similar approach to ingest our event logs: custom Airflow operators pull logs from multiple streams in CloudWatch into S3, then securely push them to Snowflake throughout the day.

Data Visualization

Once we had data flowing we naturally wanted an easy way to visualize that data. We connected Mode Analytics to our Snowflake instance as our data visualization layer. This integration was simple and took under 30 minutes to set up. Mode allows us to issue SQL queries to Snowflake and turn the results of those queries into interactive and attractive graphs with a few mouse clicks. It helps our team share common queries and collaborate on reports. We have a number of reports set up in Mode that visualize our core metrics, help us track our business and operations, and provide visibility into our engineering systems.

Keys to Success

As we were building this system, there were a few important principles we adhered to that we think are key to a successful implementation. We hope to expand on these further in upcoming blog posts.

Productionalize Your Data Warehouse

We believe that our data and analytics system should be treated as first class citizens and given the same level of monitoring and visibility as our production systems.

To ensure proper visibility, we wrote custom Airflow callbacks that push metrics about task success and failure to CloudWatch and configured alarms that notify our on-call when tasks fail or scheduling gets stuck. We also make extensive use of data quality checks to alert us if the data in our tables do not meet some baseline set of assumptions that should hold for that data.

For example, after we ingest our production data sets we run a data quality check that makes sure we imported data from each MySql shard. This ensures there are no “silent” failures in our ingestion process. If that data quality check fails it (1) fires an alert to our on-call system so an engineer can triage and fix promptly and (2) blocks any downstream tasks that would read that data so that we stop bad data from propagating as early in the flow as possible.

The Schema Is the Product

Once data lands in Snowflake we launch a number of jobs that reshape and standardize that data to make it easier to work with. Designing an accessible schema is perhaps the most under-appreciated component of setting up a data analytics system. Well designed table schemas and curated metrics make the difference between being able to answer common questions in 5 lines of easy SQL vs 50 lines of complex logic. Thinking carefully about common use cases for your data and designing tables that help standardize metrics and simplify common queries is critical to creating a data warehouse that is usable and provides consistent answers to data questions. We invested early in in thoughtfully designing core data sets that would be easy to work with and would standardize our metrics.

Pay Off

We quickly started seeing dividends from this work and now view this as an essential tool for daily operation. Below are a few highlights.

Launch dashboards (Customer Reporting)

After we’ve launched a new customer, their anticipation quickly builds and everyone wants to know how users are engaging with Rimeto. In the past, all of the engagement and adoption data was buried deep within our system logs, which would require an engineer to query and analyze. Today, we’ve coupled our data warehouse solution with a data visualization platform that allows our business teams to not only create more customizable reports, but also save a significant amount of time. In addition, data visualizations have really resonated with our customers and have been a powerful way for us to focus everyone on the insight within the numbers.

Roadmap planning (Product Analytics)

As we set out to prioritize our work for Q2 of this year, we had a laundry list of features and improvements to the search experience. As a first step we improved instrumentation and then we were able to quickly analyze actual usage to corroborate our hunches around priority. As it turned out, our initial prioritization was over-weighted towards edge cases around filter usage reported by power users and was missing some more broadly impactful improvements to how we matched name variations. We were able to quickly pivot our roadmap to focus on building out these improvements.

Async job optimization (System Analysis)

As part of our core service, we run a number of “async” jobs that pull customer data into our enterprise graph and maintain the integrity of that graph. These jobs produce a heavy load on our MySql instances and other infrastructure, so we we are always looking for ways to make them more efficient. While we already had good visibility into the status of recently run jobs, we didn’t historically have an easy way to aggregate job data to understand long term trends. Co-locating production data and logs allowed us to quickly do just this. We split our jobs into three primary types and plotted cumulative compute time spent on each type daily, shown below. We quickly noticed that the run time of “Type B” jobs had been steadily growing for months and had become 70% of our total compute time. We focused our efforts on making these jobs more performant and were able to cut the total compute time for this type of jobs in half!

Cumulative compute time spent on async jobs at Rimeto by job type.

Conclusion

Building a data warehouse and analytics system is not always top-of-mind for startups. It’s a non-trivial undertaking in time and money but we have already found the investment paying dividends in terms of operational efficiency and product development. Furthermore, by building this system early in the life of our company we have ensured that we have one “source of truth” for data which helps create a common understanding of our business as the company grows. One of our company values is to “play the long game” and this is a good example where up front engineering investment has set us up for long term success.

We’re hiring and looking for talented engineers to help us build the world’s best workplace tools.