At Alan, we make data a core piece of our decision processes.
Data is imported and stored in our data warehouse (Postgres) from many different sources: production data from our applications, third parties (intercom, trello, activity tracking, Hubspot, …), and data from a few tightly integrated actors from the health ecosystem.
Yet, gathering all this data into one place is not enough. Users of the data tend to separately run overlapping analyses. Work is duplicated, errors more likely, and every user creates “local metrics” that are not generalizable and not consistent with each other. If it’s raw data, making data easily accessible turns into a nightmare.
Prepared data should be centralized
Raw data, even “clean”, is not ready for use. To make it actionable, it requires treatment, and this treatment needs to be harmonized:
- Denormalization: scattered in different tables and schemas, imported from different sources, the data needs to be joined together to be useful.
- Semantics: names need to be adapted to best practices, and key concepts defined elsewhere should be ready to use.
- Computed fields & aggregations: KPIs and statistics are built from raw data.
To this effect, a layer that abstracts knowledge about data is needed. This layer has two missions: exposing a data API to the Alan team and creating the data corresponding to this API. This layer should be readily usable by all data users, transparently defined, easy to maintain and have documentation embedded.
Abstracting knowledge in code
We decided to use database views to solve this problem. These views were initially a quick solution to the repeated need of
JOINing tables and renaming fields. In a few months, they grew organically into an easily extensible framework that serves all the data needs at Alan. We call it the “Synced Views”.
The principle is simple: we use SQL to define this data layer, in views directly stored in our data warehouse. The code defining these views is treated like any code produced by and for a team: using version control, reviews, and tests.
We built the
synced_views module that takes the code defining the views from a Github repository and “syncs” them to a remote database as a continuous integration step (using dear Circle-CI). These views are the basis for all our analyses and applications down the road.
Views are defined as SQL queries in individual files. The name and schema of a view are determined by the file’s name and path:
When a branch is merged to master on Github, our continuous integration provider syncs the views to the remote data warehouse, automatically.
When a new commit is pushed to our GitHub repository, we trigger a “dry run” sync that will effectively test the syntax of the views that have been changed or added since the last master commit.
We identified 5 key areas of application for the data at Alan: Growth, Product, Insurance, Operations, Business Metrics (more on them in a later post). Each has its own “application” database schema, which hosts the views corresponding to the area of the problem they characterize.
To support dependency between views in a clean way, we have the following rules:
- No view in a schema can depend on a view from another schema.
- An additional schema named
coregathers views that can be used in any application schema.
- Views within a single schema depend on each other following a model of computation stages. For instance, views stored in the folder
insurance_1are synced before views in the folder
insurance_2, so that the latter can depend on the former. They all end up in the
These are the building blocks of the framework, but we expect it to evolve as our needs expand.
Example & Code
To reward the readers that read that far (thank you!), we share a few snippets extracted from our current synced view module for illustration.
We use Hubspot as the CRM that powers our amazing sales team. The data from Hubspot is imported from their APIs in raw form by ETLs managed on airflow, before being prepared by the synced views. The file
templates/core/hubspot_deals.sql defines the synced view that presents the information about the deals managed by our sales team. Here is an excerpt of the file, that demonstrates typical usage of the synced views :
The file above directly defines the query that will be synced as a view. We use Jinja templating to do the heavy lifting in a simple way. Here is a jinja template that syncs the views:
The main module renders it with code that looks like:
Key Features & Next steps
Here are the key features of the running version of our
Dependent syncing: The ability to declare dependencies between the views, through stages (discussed above)
Historized views: We didn’t talk about it here, but we support a simple syntax in the definition of the view that allows us to retrieve, as a function, a version of the view that emulates past data from current data. Selective testing: The ability to not test all the views all the time, but only the ones that have been modified in the sister branch when comparing to
Materialization: All views are now materialized and refreshed every day for better performance. A user can still access the real-time version of the views by adding
__rt as a suffix to the view name.
Missing-only: Detects and syncs again only views that are missing. This is useful when a dropped table or view has removed synced views that depend on it.
On the roadmap:
Versioned syncing on master: Currently, when merging to master, all the views are re-synced. This process is long and wasteful. We can leverage version control (git) to detect what view definitions have changed since the last successful syncing, by storing in the target DB the corresponding commit hash.
Dependent testing: Currently, testing depends on views currently synced to the db. That prevents us from testing changes on several dependent views (since testing doesn’t actually persist the views).
Embedded documentation: This is one of the key requirements so that the framework truly fulfills its mission of a global data API for any user in the company. We have not yet tackled this point, but the vision is to collocate documentation and the code in the files that define the views.
Open sourcing: The syncing module would easily be re-usable for teams that don’t have a solution to prepare and share their data seamlessly, and would like to buy-in this framework. We believe however that the project is missing a few key features (above) to be really useful and start the open-sourcing discussion.
Beyond the roadmap of the module itself, the data community is working hard to make data usable and used by the Alan team. This goes beyond the data layer that creates and publishes this data, and requires appropriate tools and training — and we’re far from done!
Taking a step back, the framework of synced views allowed us to quickly build on our current infrastructure to serve the data needs of Alan. Yet, it ends up converging with a more traditional ETL framework that would build summary tables in succession, using a tool like airflow to orchestrate it.
Compared to a classic ETL tool, the current framework shows one main limitation: the inability to have arbitrary code create the data. This framework is thought as a step to prepare data that already exists — we actually use airflow to create the data upstream.
We believe our synced views framework has a few benefits that are worth keeping investing in it:
- Simplicity & Maintainability: No DAG to explicitly manage, no python code to write. Adding a new view is as simple as creating a new SQL file with a query written in it.
- Declarativeness: The structure doesn’t reflect “how” it’s done, but “what” is done. Anyone, regardless of technical ability, can easily find the definition of a given view and field.
- Light & Portable: The module responsible for operating the sync is only a few hundreds line of codes, and doesn’t require knowledge of heavy libraries (like ariflow or pandas). All the computation is done in the database.
- Standard: Any prepared data is in a view created with the same tool, the same way. We can focus on data best practices (cardinality, naming, concept definition, documentation), not engineering ones (architecture of the code, pipeline management, migrations).
- Dependency management: Views have dependency management embedded. This allows us to easily know what downstream views should be recomputed without explicitly maintaining the dependency graph.
- Realtime: a user can always use the non-materialized views to access the most recent version of the prepared data. For instance, this is super useful today for the dashboards of our sales community: they want to see their names on the wall quickly!
If you’ve read that far, you’re passionate about enabling a team with data they can use. We are too! We believe it’s core to a company’s success, that’s why we’re hiring to develop our data community. Drop us a line at email@example.com