Controlling data warehouse consumers via DWH API

Uzi Blum
Engineering @ Applift
4 min readNov 28, 2019

In this post I describe the challenge of Data Warehouse consumption by users and systems and how we have solved it in Applift. In case you cannot wait until the end, spoiler we have developed our own homegrown application that is now open source for you: https://github.com/applift/dwh_api

Background

As part of a data strategy of an organization, one of the critical activities is to harmonize a variety of data sources into a single database (i.e. sources from system such as production, marketing, sales, etc.). This database is called DWH (data warehouse) and allows one to easily access all data in one centralized location. It is fast in data retrieval, contain meaningful business terms and is a one stop shop for all of the answers. To build a DWH, data is extracted, loaded and transformed from APIs, databases, files and other data sources.

Accessing a DWH is usually done via query language (SQL) or BI tools such as Tableau, and is performed by analysts and business users. As the DWH becomes a more centralized tool and the single source of truth in an organization, there is a growing need to access it, not only by Analysts and business users, but also by engineering and app developers. The increased usage brings new challenges and a question whether traditional solutions such as SQL, BI tools and standard API are the right interfaces to access the DWH.

Problem

Consuming data from the DWH with traditional interfaces such as SQL, BI tools and API development is problematic for several reasons.

Consumption via SQL -

  1. Authorization — providing access for each user and controlling which table/ column schema can be accessed, can be challenging
  2. Authentication — accessing a database may require exposing private IPs and networks
  3. Technology gap — Not all users want/ know how to use SQL
  4. Backward compatibility — DWH technology can change (i.e. moving from Postgres to MySQL) and may require different SQL syntax.

Consumption via a BI tool -

  1. Additional limited storage — Many BI tool use In memory solution to store prepared data. This limits the amount of historical data and granularity level a user can access
  2. Licensing — BI tools require reader/ viewer license for each user
  3. Interface — Automated system cannot access reports as a data source and prefer to have a direct DB access or API access

Consumption via API

Traditional API solutions require engineering effort to develop (i.e. Java development) which leads to slow development. An endpoint needs to be well defined (by the product owner) and a series of tests need to be conducted before the end point can be consumed.

DWH Layer diagram: Sources and Consumptions view

Solution

To solve these problems we need a way to give back the power to data owners in the organization, so they can decide how exactly the data needs to be consumed. We need a simple way to do so, in order to allow the team of data owners (BI developers and Data engineers, etc.) to focus on their main task: reliable, harmonized and centralized data.

DWH API is a product that allows a user to create an API endpoint (exposing a data set) via web interface using SQL only. This end point can be consumed like any other REST API solution and requests history can be tracked.

Creating an endpoint: Configuring a new endpoint is done by writing SQL. A developer can do that purely via web interface. First the developer configures the database access by providing the database credentials (currently support Postgres). Then the user define the endpoint by creating a SQL. Later the user can add additional configurations such as parameters to filters, time-out as well as who can access it (via authorization layer)

Screenshot DWH API: Defining endpoints via DWH API interface

Consuming an endpoint: Consuming the end point is done via REST API — GET method. The consumer passes the endpoint uri, token, filters, paging and other parameters. The results are either CSV or JSON format

Screenshot DWH API:Consuming endpoints via DWH API swagger

Tracking: The system tracks every API call request. The log information containing which endpoint was accessed, by which token, from which IP, the parameters requested, duration of execution etc.

Benefits

The DWH API became a centralized solution in Applift with more than 50 endpoints serving internal and external systems and users. It is used by internal system who need DWH data for exposing performance and grading, analysts who access deep historical data, and external partners who want to see online feeds.

Within a few minutes/ hours, an analyst or a developer can create a new endpoint. While not only reducing development time dramatically, the DWH API has increased flexibility of making small modifications.

As a result of the DWH API implementation, more questions can be answered, integration can be done faster, and better decisions are taken.

What is next?

We have been using DWH API for the last 2 years in Applift and decide that it is time for other data owners in organization to benefit from it.

Download it here: https://github.com/applift/dwh_api

Install it on the cloud or on your premise and connect to your own database or data warehouse to easily expose your data via API to your consumers.

Give us feedback or contribute to improve this solution.

--

--

Uzi Blum
Engineering @ Applift

Enthusiast about innovation and great ideas. Data driven everything