“TheReporter” — An extensible reporting API framework

By: Chen-Yu Li, Rob Tan, Yi-Ting Li

Rob Tan
6 min readJan 2, 2020
TheReporter

The problem

  • Data and business insights are a competitive edge for most organizations. Hence they need to deliver relevant and up-to-date information to users quickly and efficiently.
  • But this information is spread across many different data repositories. For example, our company’s data is stored in data warehouses such as Redshift and Snowflake, data lakes in S3, NoSQL databases, and in our internal accounting and ERP systems such as Salesforce and Intacct. We also need to be access data by calling out to our partners’ external APIs.
  • This information needs to be accurate, be as up-to-date as possible and be easily presented for friendly consumption. Having to build reports that span these disparate systems is a daunting task. Moreover as our business evolves, the demand for new types of reports and data repositories will continue to grow. So our engineering team stepped back and asked, “what can we do to ease the development pain of surfacing useful data to our users?” And in so doing, could we also accelerate the time to deliver new types of reports? The term “report” is used loosely here. It encompasses any information (fine-grained or aggregated, transactional or statistical, real-time or historical) that may be of interest a user or a system.
  • Another important concern is that of data security and privacy. We need to ensure that we have proper access controls so that users can see information only they are allowed to see. How can we come up with a unified way of expressing these access control rules and enforcing them?

The requirements

We quickly realized we had to build a system that would meet the following requirements:

  • Allow developers to easily contribute new reporting endpoints (minimal friction in deploying)
  • Make it easy to expose data from new data sources and repositories
  • Have a common language to create and update access controls to data
  • Have a unified way for the user to express filter criteria (regardless of data SQL or NoSQL data stores)

The system we built is called “TheReporter”. It is an all-encompassing service to enable the creation, management and execution of information retrieval (reporting) endpoints. It is important to note that TheReporter is not a data processing platform because it assumes the data to be retrieved has a priori been transformed and curated by other systems (such as ETL pipelines and data-processing applications).

Architecture

When we designed TheReporter, we analyzed the anatomy of a reporting API endpoint and looked carefully at the metadata surrounding an endpoint. Was there a way to formalize the description of an endpoint so that we can automate its creation and maintenance and therefore relieve developer pain? The answer was yes — an endpoint always has the following characteristics about it:

  • A URL path
  • The set of all possible columns and their metadata (including name, type and access control)
  • The default set of columns to be returned
  • The maximum number of rows which can be returned
  • Access control (who is permitted to call this endpoint or see a column?)
  • The retrieval logic for the data to be returned (i.e. the data getter)
  • The default filter to apply if none is given by the user

We came up with a JSON syntax to define the metadata for a reporting endpoint and called it a Reporting Endpoint Package (REP). Basically it looks something like this:

TheReporter’s foundation rests on the concepts of its DSL, the Reporting Endpoint Package and Serverless (AWS Lambda). The following diagram shows the high level architecture:

ACL (Access Control)

The ACL scheme used by TheReporter is based on a home-grown authorization system called Janus. It allows the user to express access level controls at the user, department and organizational level. Our intention is to make ACL providers pluggable (in a future release) so that the ACL scheme itself is flexible and can be adapted to a specific provider.

Filters

The concept of TheReporter filter is to provide a uniform language to query different data sources, for example, Redshift, Snowflake, Athena and others.To extend the flexibility, we came up with a protocol to translate a nested JSON structure into SQL or to the query language of the underlying datastore. TheReporter filter syntax allows users to customize queries on-demand. Under the covers, the filter parser uses the Depth First Search algorithm to parse the filter’s JSON structure. It has 2 key definitions:

  1. Condition keyword
    “single” means only a single boolean expression
    “or” / “and” if the condition has more than one boolean expression

2. Condition expression
{ “column”: “xxx”, “operator”: “=”, “value” :”xxx”}

For example, the filter can be as simple as:

which translates to:

A more complex filter would look like:

which translates to:

AWS Lambda and API gateway

We opted for using AWS Lambda and API gateway because it was the most pain-free way to get TheReporter up and running in a short time. Another reason is the horizontal-scaling benefits of lambdas. Since our users do not hit our report endpoints continuously and traffic is often spiky, lambdas provide us with unlimited scalability and elasticity (subject to the constraints of the underlying data store).

So what happens when a client makes a request to a reporting endpoint? The following diagram demonstrates the workflow of TheReporter handling an API request.

  1. A user makes a call to the Reporting endpoint.
  2. The REP package associated with the endpoint is looked up from the database.
  3. A request is sent to the Authentication/Authorization service. If not authorized, a negative response is sent back to the user.
  4. The filter parser parses the filter and forwards the query to the Data Getter.
  5. The Data Getter calls to underlying data store to execute the query. The resulting data is then sent back to the user.

Conclusion

The goal of TheReporter is to empower developers and business analysts to create new reporting endpoints with minimal development and deployment overhead. We have seen the benefits of its use in many of the reporting products in the company. As we add new data getters, its adoption will continue to expand across the company.

--

--