Alan’s solution to sharing data knowledge: “Synced Views”

Antoine Lizée
Sep 25, 2019 · 7 min read
Photo by Alfons Morales on Unsplash

At Alan, we make data a core piece of our decision processes.

Data is imported and stored in our data warehouse (Postgres) from many different sources: production data from our applications, third parties (intercom, trello, activity tracking, Hubspot, …), and data from a few tightly integrated actors from the health ecosystem.

Yet, gathering all this data into one place is not enough. Users of the data tend to separately run overlapping analyses. Work is duplicated, errors more likely, and every user creates “local metrics” that are not generalizable and not consistent with each other. If it’s raw data, making data easily accessible turns into a nightmare.

Schema of our main data architecture with an accurate depiction of the value created when raw data is used directly.

Prepared data should be centralized

  • Denormalization: scattered in different tables and schemas, imported from different sources, the data needs to be joined together to be useful.
  • Semantics: names need to be adapted to best practices, and key concepts defined elsewhere should be ready to use.
  • Computed fields & aggregations: KPIs and statistics are built from raw data.

To this effect, a layer that abstracts knowledge about data is needed. This layer has two missions: exposing a data API to the Alan team and creating the data corresponding to this API. This layer should be readily usable by all data users, transparently defined, easy to maintain and have documentation embedded.

Abstracting knowledge in code

The principle is simple: we use SQL to define this data layer, in views directly stored in our data warehouse. The code defining these views is treated like any code produced by and for a team: using version control, reviews, and tests.

We built the synced_views module that takes the code defining the views from a Github repository and “syncs” them to a remote database as a continuous integration step (using dear Circle-CI). These views are the basis for all our analyses and applications down the road.

Description

When you consider SQL like any other production code, you can leverage the existing ecosystems of test, review and continuous integration to create reliable processes and long-term value.

Views are defined as SQL queries in individual files. The name and schema of a view are determined by the file’s name and path: synced_views/templates/<schema_name>/<view_name>.sql

When a branch is merged to master on Github, our continuous integration provider syncs the views to the remote data warehouse, automatically.

When a new commit is pushed to our GitHub repository, we trigger a “dry run” sync that will effectively test the syntax of the views that have been changed or added since the last master commit.

Structure

To support dependency between views in a clean way, we have the following rules:

  • No view in a schema can depend on a view from another schema.
  • An additional schema named core gathers views that can be used in any application schema.
  • Views within a single schema depend on each other following a model of computation stages. For instance, views stored in the folder insurance_1 are synced before views in the folder insurance_2, so that the latter can depend on the former. They all end up in the insurance schema.

These are the building blocks of the framework, but we expect it to evolve as our needs expand.

Example & Code

We use Hubspot as the CRM that powers our amazing sales team. The data from Hubspot is imported from their APIs in raw form by ETLs managed on airflow, before being prepared by the synced views. The file templates/core/hubspot_deals.sql defines the synced view that presents the information about the deals managed by our sales team. Here is an excerpt of the file, that demonstrates typical usage of the synced views :

The file above directly defines the query that will be synced as a view. We use Jinja templating to do the heavy lifting in a simple way. Here is a jinja template that syncs the views:

The main module renders it with code that looks like:

Key Features & Next steps

Dependent syncing: The ability to declare dependencies between the views, through stages (discussed above)

Historized views: We didn’t talk about it here, but we support a simple syntax in the definition of the view that allows us to retrieve, as a function, a version of the view that emulates past data from current data. Selective testing: The ability to not test all the views all the time, but only the ones that have been modified in the sister branch when comparing to master.

Materialization: All views are now materialized and refreshed every day for better performance. A user can still access the real-time version of the views by adding __rt as a suffix to the view name.

Missing-only: Detects and syncs again only views that are missing. This is useful when a dropped table or view has removed synced views that depend on it.

On the roadmap:

Versioned syncing on master: Currently, when merging to master, all the views are re-synced. This process is long and wasteful. We can leverage version control (git) to detect what view definitions have changed since the last successful syncing, by storing in the target DB the corresponding commit hash.

Dependent testing: Currently, testing depends on views currently synced to the db. That prevents us from testing changes on several dependent views (since testing doesn’t actually persist the views).

Embedded documentation: This is one of the key requirements so that the framework truly fulfills its mission of a global data API for any user in the company. We have not yet tackled this point, but the vision is to collocate documentation and the code in the files that define the views.

Open sourcing: The syncing module would easily be re-usable for teams that don’t have a solution to prepare and share their data seamlessly, and would like to buy-in this framework. We believe however that the project is missing a few key features (above) to be really useful and start the open-sourcing discussion.

Beyond the roadmap of the module itself, the data community is working hard to make data usable and used by the Alan team. This goes beyond the data layer that creates and publishes this data, and requires appropriate tools and training — and we’re far from done!

Conclusion

Our architecture with the Synced Views piece in place. Ouf!

Taking a step back, the framework of synced views allowed us to quickly build on our current infrastructure to serve the data needs of Alan. Yet, it ends up converging with a more traditional ETL framework that would build summary tables in succession, using a tool like airflow to orchestrate it.

Compared to a classic ETL tool, the current framework shows one main limitation: the inability to have arbitrary code create the data. This framework is thought as a step to prepare data that already exists — we actually use airflow to create the data upstream.

We believe our synced views framework has a few benefits that are worth keeping investing in it:

  • Simplicity & Maintainability: No DAG to explicitly manage, no python code to write. Adding a new view is as simple as creating a new SQL file with a query written in it.
  • Declarativeness: The structure doesn’t reflect “how” it’s done, but “what” is done. Anyone, regardless of technical ability, can easily find the definition of a given view and field.
  • Light & Portable: The module responsible for operating the sync is only a few hundreds line of codes, and doesn’t require knowledge of heavy libraries (like ariflow or pandas). All the computation is done in the database.
  • Standard: Any prepared data is in a view created with the same tool, the same way. We can focus on data best practices (cardinality, naming, concept definition, documentation), not engineering ones (architecture of the code, pipeline management, migrations).
  • Dependency management: Views have dependency management embedded. This allows us to easily know what downstream views should be recomputed without explicitly maintaining the dependency graph.
  • Realtime: a user can always use the non-materialized views to access the most recent version of the prepared data. For instance, this is super useful today for the dashboards of our sales community: they want to see their names on the wall quickly!

If you’ve read that far, you’re passionate about enabling a team with data they can use. We are too! We believe it’s core to a company’s success, that’s why we’re hiring to develop our data community. Drop us a line at jobs@alan.eu

Alan

Alan is the first digital health insurance company in…

Alan

Alan is the first digital health insurance company in Europe. We revolutionise health insurance by focusing on user experience with excellent price-quality ratio health plan. We are the first new insurance licensed in France since 1986.

Antoine Lizée

Written by

Built the first version of Alan’s engineering team and infrastructure. Now building Alan’s data community. Likes to ship clean solutions to hard problems.

Alan

Alan is the first digital health insurance company in Europe. We revolutionise health insurance by focusing on user experience with excellent price-quality ratio health plan. We are the first new insurance licensed in France since 1986.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store