ELT, Data Platforms And SQL: Extracting and Transforming Data From RESTful Services
How to build a reliable ELT solution for RESTful services, taking into account everything from RESTful API throttling to maintaining support for foreign key constraints.
Introduction
We’ve been working on a fully functional ELT solution on top of a data platform in PostgreSQL for the past few months. A 100% PostgreSQL-based data platform could be an ideal choice if you’re seeking a cost-effective, streamlined solution for your data needs. PostgreSQL provides several advantages in terms of flexibility and ease of use, particularly if you don’t require endless linear scalability (Big Data Is Dead).
There’s a lot to cover, so we’ve split this article into two parts. The first part will cover the key design decisions involved in building a reliable ELT solution and data platform, while the second part will focus on implementing the solution using PostgreSQL.
The focus of this article is to synchronize existing third-party data that is only available through RESTful APIs with an internal data platform. It’s worth noting that while the solutions outlined in this article are effective in this specific context, they may not be applicable in other situations, such as real-time data consumption.
The Opinionated TLDR
The Opinionated Architect’s TLDR:
- This article focuses on PostgreSQL.
- Prefer to synchronize data at the database layer before using RESTful APIs. Leverage tools like Prequel, Foreign Data Wrappers and replication before resorting to using RESTful APIs to synchronize databases.
- Create data platforms around Idempotent Software Architecting. It is imperative to be able to handle potential duplicate data (including accidental duplicate RESTful API calls).
- A data platform should be able to handle extracting data ‘out of order’. If one entity relies on another entity (see Foreign Key Constraints), design a system that can transform the dependent entity before the entity it depends upon.
- Favor ELT over ETL. Simplify transformation logic by extracting and loading data into your data platform before performing transformations (ELT) rather than transforming and then loading the data (ETL).
- When designing a solution, architects should consider SQL instances as potential compute. SQL is a complete language that can be used like any other programming language. Transformations can be performed on a SQL instance, so keep this in mind when architecting a data platform solution.
- Side note. Data Science should be split into two parts: ETL and Analytics. Data Science should focus on analytics, while engineers handle the complexities of ETL. Note: In some cases, the job of Analytics is to analyze the quality of data. In these cases, Analytics should be involved in the ETL process.
RESTful APIs and Providing Access to ALL Your Customers’ Data
Summary: Use solutions like Prequel to give access to all of your customers’ data. Only resort to REST as an offering if this isn’t possible.
This section is for people providing data to their customers. Let’s quickly discuss how we believe you should expose all of your customers’ data before we delve into the details of ELT from RESTful APIs.
REST is designed to handle resources in a standardized way. Resources typically represent a single entity (which may contain other entities). The core principle of REST is that everything is a resource, identified by its unique URI, and the entity’s state is manipulated with GET, POST, PUT, PATCH, and DELETE operations (see CRUD).
The intention of REST is to CRUD a single entity (which may contain other entities). Using REST as a data source for extracting and loading all entities from a third party is not really the intent of REST. While it is technically possible to use REST in this way, it is not a recommended approach. REST is designed to handle resources in a more granular way, and using it as a data source for bulk data transfer can be inefficient, un-reliable, complex to implement, and may not scale well.
Even Worse: We Lose Data Integrity in Transit
It is likely that entities are stored in a solution that has constrained the data to assure data integrity. The destination of the data is also most likely constrained to assure data integrity.
Using RESTful APIs to keep two data sets in sync requires the following steps:
- Pull the data from the storage solution and transform it into a format useful for data transit: such as JSON. At this point, all of the constraints on the data and the type are lost!
- The data is then pulled into a transformation process where the data most likely has to be re-verified before it is loaded into storage.
While this process of syncing is happening, we also need to consider network outages, throttling, and other issues that may arise.
Now, let’s dive into the architectural details of keeping loading data into a data platform using REST.
ELT vs ETL
Summary: In our data platform solution, we opted to use ELT instead of ETL.
Traditionally, the Extract, Transform, and Load (ETL) approach is used to bring data into a data platform. This involves using compute to:
- Extract data from a data source (which is often placed in a temporary location like a Message Queue or Databricks ETL).
- Transform that data using compute that monitors the location where data has been extracted. This step may require querying other data sources, which adds overhead.
- Load the results of the transformation into one or more locations (like Elastic Search and a relational database).
However, in cases where transforming and loading can occur within the data platform itself (using a PostgreSQL instance in our case), a simpler approach is possible using Extract, Load, and then Transform (ELT):
- Extract and immediately load the raw data into a relational database instance.
- Perform transformation logic within the database instance (using PostgreSQL Procedures, for example). This allows for easier, and faster, access to all existing data within the data warehouse, making transformation logic simpler. Need to retrieve additional information about an Invoice entity from a User entity? Simply use an INNER JOIN on the User table during transformation.”
Viewing SQL Instances As Compute
Summary: We view SQL as a complete programming language. Transformations, along with associated business logic as needed, are done within SQL functions and procedures; sometimes on SQL instances dedicated to compute.
Historically (see Multitier Architecture), a SQL instance running any SQL flavor (PostgreSQL, MySQL, SQL Server, Oracle, etc.) is used solely as a database abstraction layer for persisting and querying data. Business logic compute is confined to the Business Layer, traditionally written in languages like Node, Python, C++, Ruby, etc. However, given that SQL is a complete language, business logic could also be written in SQL. A SQL instance can be used as the Business Layer, and is viewed as compute.
One advantage of this approach is that an instance can contain both persistence and compute, and later a compute instance can be spun up, leveraging features like foreign data wrappers.
Note on Scaling SQL Instances
In situations where scalability is not a deciding factor in a data platform, we highly recommend considering the approach presented in this article. Note that although PostgreSQL instances are not currently relatively easy to scale compared to other forms of compute (such as Sever-less Computing), that doesn’t mean PostgreSQL can’t be scaled.
Message Queuing, Scaling and SQL
Summary: For ELT, we can use PostgreSQL as our message queue, handling the heavy lifting of persisting the data before it is transformed.
Replacing Message Queues
We do not recommend replacing message queues with SQL instances for all aspects of an architecture. Message queues can be used for processing data beyond data warehouses. For example, a message queue can be used to publish events, and different business departments can listen to a channel on a message queue, triggering on the event as needed.
SQL as a Persisted Message Queue
Let’s consider a message as a blob of data that contains information about one or more entities. A message queue service (Apache Kafka, RabbitMQ, ActiveMQ, AWS SQS, AWS Kinesis, Google Cloud Pub/Sub, Microsoft Azure Service Bus, etc.) in the context of ETL can provide us with:
- Durability: Messages are stored and maintained in the message queue even if the producer or consumer goes offline.
- Scalability: Message queues can handle a large number of messages, allowing for horizontal scalability to handle increased traffic.
- Asynchrony: Message queues allow applications to communicate asynchronously, which can improve performance and efficiency.
- Decoupled Communication: Applications can communicate through message queues without the need for a direct connection, which decouples the applications and makes them more flexible and independent.
- Reliable Delivery: Message queues provide mechanisms for ensuring the delivery of messages, such as acknowledgment mechanisms and retry policies.
- Filtering: Message queues can filter messages based on specific criteria, allowing consumers to receive only the messages that are relevant to them.
- Transformation: Message queues can transform messages before delivering them, enabling applications to integrate with each other even if they have different data formats.
- Publish-Subscribe Model: A message queue enables a publish-subscribe pattern where messages are published to a queue by a producer and then delivered to one or many consumers.
Other than scalability, a PostgreSQL instance can support all of these features. For example, the Publish-Subscribe Model is supported by SQL-Notify.
SQL Can Be Efficient
Although SQL does not scale linearly (for example, Apache Spark can process 60 Terabytes of data), SQL can initially scale quickly and cheaply. SQL instances are always running making compute immediately available. It is not unheard of to transform 40,000 entities in under one second on a small SQL instance.
RESTful Endpoints Are Throttled
RESTful endpoints are commonly subjected to throttling, where the number of API calls a system can make is limited. Services often limit API calls (five a second is not unheard of), and in some cases, they throttle based on the compute required to service the API request.
As a result, throttling can become a bottleneck in data synchronization, making it a crucial factor to consider when selecting tools and frameworks for ELT (or ETL) processes.
Considerations
When designing a robust ELT system using a RESTful API, several requirements must be considered:
- Almost every endpoint is going to require pagination. Keep this in mind when designing a robust ELT system.
- Use RESTful API endpoints that can retrieve multiple entities with a single call. If this is not possible, the extraction time will significantly increase. In the case where a single entity is pulled per call, a pre-query within the data warehouse may be required to get a list of entity ids to required to make the API call.
- Retrieve only data that has not been previously extracted whenever possible. Ideally, the API should support a “pull all data that has been updated after this date” feature, but some APIs do not.
- Pagination is likely required for an endpoint, and a failure in the middle of running through the pages requires a full restart. However, if the API supports a feature to order the data by the last updated timestamp, it’s possible to write logic to start pulling data from the oldest date of the last successfully pulled page from the server.
- Throttling needs to be integrated into a robust extraction solution, as many services rate limit API calls.
The Final Requirements
To build a robust RESTful API extraction system and data platform, we need to consider several aspects:
- Rate Limiting: A robust extraction layer must limit the number of calls made based on the calls per second and the processing overhead of the data source.
- Retry With Throttling: A robust extraction layer should support multiple retries when reading from a RESTful endpoint. However, each retry must be throttled: waiting an increasing amount of time between each retry.
- Pagination: A robust extraction layer must support pagination, especially for RESTful API endpoints that may contain millions of entities.
- Single Entity API Calls: In cases where we can only retrieve one entity at a time, such as a user, we need the ability to construct a URL that includes the entity ID of each user. The entity ID for each user can be obtained from a database query, view, or table.
- Multiple RESTful Sources: Our extraction layer needs to support extracting data from multiple data sources such as Google, Meta, Active Campaign, etc.
- Data Driven: Our extraction layer leverages the features of SQL by encoding all data required to make a RESTful API call in tables.
- Source of Truth: For a given entity, the data may come from multiple sources. A robust data platform should support the merging of entity information from multiple sources: noting that for each piece of information there is a single source of truth.
- Credentials Encoding: Often, access credentials such as API keys and passwords are required to make API calls. This data must be encrypted at rest.
- Logging: Robust extraction and load layers need effective logging of what has occurred during the extraction and load process.
- Transactions: A robust extraction and load layer needs to support the idea of a “transaction” where multiple RESTful API endpoints must be called successfully before transformations can be applied.
- Notification: When there are failures that cannot be resolved by retrying, the extraction and load layers should be designed to notify one or more key personnel of the issue. This notification can be sent via email, SMS, or other methods, and should include details about the error, such as the endpoint, the time of failure, and any relevant error messages.
Implementation
Implementation details are provided in the second article (which we are working on now).
Conclusion
In conclusion, a 100% PostgreSQL-based ELT solution and data platform can provide a cost-effective and streamlined solution for data needs. The first part of this article covered key design decisions in building a reliable ELT solution and data platform, including the use of PostgreSQL as a complete programming language for business logic, and favoring ELT over ETL for simpler transformation logic. The article also provided recommendations for providing access to all customers’ data through solutions like Prequel, and using RESTful APIs as a last resort for synchronization.
Additionally, the article discussed the advantages of viewing SQL instances as potential compute and using PostgreSQL as a persisted message queue for ELT. A use case was provided to illustrate the complexities of extracting and loading information from RESTful endpoints, and the importance of considering aspects such as rate limiting, pagination, and logging in building a robust RESTful API extraction system.
Overall, a robust ELT system should support multiple data sources, handle duplicate data, extract data out of order, and provide effective logging and notification mechanisms for failures. Implementation details for the solution using PostgreSQL are provided in the second part of the article.