Large Results from a GraphQL API and Google BigQuery

Published in

Nerd For Tech

4 min readAug 14, 2024

In today’s data-driven world, APIs need to return large datasets efficiently without overwhelming the client or network. This is especially important for serverless data warehouses like Google BigQuery, which can handle massive datasets but may struggle with returning large results through an API, such as a GraphQL API.

In this article, we’ll explore a strategy for returning large results from a GraphQL API using Python, the Strawberry framework, and BigQuery. We’ll discuss how to improve performance, reduce latency, and maintain consistency by using temporary tables and caching.

Problem with Pagination

When a GraphQL API query returns a large dataset, such as millions of records, transmitting this data can cause problems. It might be too much for the client to handle, or it could lead to timeout errors, especially with services like API Gateways.

Traditional pagination is a common approach to handle large datasets in APIs by breaking them into smaller parts based on page number and page size. However, with very large datasets, like those in BigQuery, this method can become inefficient due to repetitive query execution. Each page request might re-run the same complex query, which can be costly and slow when dealing with large datasets.

To solve these problems, we can use BigQuery to store query results in a temporary table and cache the table’s name for later requests.

Using Temporary Tables and Caching

To handle large datasets in your GraphQL API, you can use temporary tables and caching. This approach processes the data once and reuses it for subsequent requests, improving efficiency and reducing latency.

Step 1: Execute the Query and Store Results in a Temporary Table

When the API receives a query, the first step is to execute the full SQL statement in BigQuery. The result of this query is automatically stored in a temporary table within BigQuery. This is a built-in feature of BigQuery that occurs with every query execution.

To paginate through results efficiently without re-running the query for every page request, you can add columns for total_pages, page_number, and total_records. These totals can be sent to the client as part of the response. The API should allow the client to set the page size, making pagination details a part of the API query input and results.

You can add pagination columns to the SQL query using this method:

SELECT
	...,
	CEIL(ROW_NUMBER() OVER() / page_size) AS page_number,
	CEIL(COUNT(1) OVER() / page_size) AS total_pages,
	COUNT(1) OVER() AS total_records
FROM ...

Step 2: Cache the Temporary Table Name for Reuse

Instead of running the complex SQL query every time, you cache the name of the temporary table. This can be done by creating a hash of the SQL query and storing it in a memory cache (like Redis). This way, the API can quickly find the name of the temporary table when the same query is requested again. The API can retrieve details about the result table from the BigQuery job object’s destination property in Python.

This caching method ensures that the API can quickly access the precomputed results stored in BigQuery, fetching only the needed page number and reducing the need for repeated query execution.

Step 3: Paginate Results Using the Cached Table Name

Once the result table name is cached, future requests for other pages of the same SQL command can simply query the result table in BigQuery. The API fetches the requested page from this temporary table without re-running the original query.

Consider setting a TTL (Time-To-Live) for the cached name to retain it only as long as needed, ensuring that the client receives reliable results.

Details of Technical Implementation

From an implementation perspective, it’s interesting to tell about designing of your GraphQL API and creating a hash for SQL command.

Designing the API

First, let’s look at how the types for the GraphQL API might be designed. The input and return types should include pagination information.

Input Type:

@strawberry.input(description="Input parameter to specify pagination")
class PaginationInput:
    page_size: int = strawberry.field(description="Size of data page (default: 10000)", default=10000)
    page_number: int = strawberry.field(description="Page number (default: 1)", default=1)
    
@strawberry.input(description="Data object filter")
class DataFilterInput:
    pagination: Annotated[Optional[PaginationInput], strawberry.argument(description="Pagination settings. Optional.")] = None
    codes: Annotated[Optional[list[str]], strawberry.argument(description="List of codes to filter by. Optional.")] = None
    is_flag: Annotated[Optional[bool], strawberry.argument(description="Filter by flag status. Optional.")] = None

Return Type:

@strawberry.type
class PaginatedResult:
    items: list[DataObject] = strawberry.field(description="The list of items in this pagination window.")
    total_pages: int = strawberry.field(description="Total number of pages.")
    total_records: int = strawberry.field(description="Total number of records.")
    page_number: int = strawberry.field(description="The current page number.")
    page_size: int = strawberry.field(description="The number of items in each page.")

Query Resolver:

def data_objects(info: Info, search_filter: DataFilterInput) -> PaginatedResult:
    ...

You may also want to use Generic types. This approach allows your GraphQL API to maintain a uniform contract for all input and return types.

Calculating the SQL Query Hash

The SQL hash can be calculated with the following method:

def calculate_checksum(input_string):
    input_bytes = input_string.encode('utf-8')
    sha256_hash = hashlib.sha256()
    sha256_hash.update(input_bytes)
    checksum = sha256_hash.hexdigest()
    return checksum

Conclusion

Whether you’re building data-heavy applications or looking to optimize your API’s performance, integrating this approach into your workflow can enhance both user experience and system efficiency. While this technique is technology-agnostic, the specific implementation may vary depending on the technology stack you are using. For example, you might need to develop custom logic for storing the result in a temporary table or caching, but the core algorithm remains conceptually the same.