Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

You envision BigQuery, but AlloyDB is probably what you need

--

TL;DR: Think BigQuery is the only answer for your data needs? Think again! This post dives into AlloyDB, Google Cloud’s HTAP powerhouse that blends transactional speed, analytical insights, and AI magic. Discover why AlloyDB might be the perfect fit for your hybrid workloads, even if your boss just wants something “future-proof.”

During my daily job as a Cloud & Data Architect, I frequently evaluate Google BigQuery’s suitability as a replacement for clients’ existing data warehouse solutions. This process usually involves a deep dive into their current infrastructure, assessment of data volumes, performance requirements, and future scalability needs. It is a complex journey, where collaboration and open-mindedness are the key success for such initiatives.

This story is about a real-world project where we guided a customer through the complexities of migrating their data to Google Cloud, despite facing some… enthusiastic decisions from a particular C-suite executive, whose amusing pronouncements have been used as titles for each section.

Despite their initial lean toward BigQuery, we worked together to understand the real needs and limitations they were facing, and in the end, we were able to guide them towards a more tailored (spoiler: hybrid) solution that truly met their organization’s unique needs.

“Listen, I don’t want anyone messing with our… uh… data thing. If it ain’t broke, don’t fix it, am I right? Just plug this BigQuery in and make it work, no fuss!”

While clients are consistently impressed by BigQuery’s exceptional performance and scalability, they often express concerns about the potential need for schema modifications to fully leverage its capabilities. This hesitation stems from a desire to minimize disruption to existing data pipelines and ETL processes, as well as concerns about the complexity and effort involved in schema migrations, fully understandable!

While BigQuery initially gained renown for its prowess with flat, denormalized tables, the performance gap between various schema designs has narrowed considerably in recent times. Such evolution comes from continuous advancements in BigQuery’s query engine and storage optimizations, enabling it to efficiently handle a wider range of data schemas. Consequently, the decision to denormalize or maintain a more normalized schema is less critical than back in the past.

“I want us to be… you know, cutting-edge, top of the line! Like those guys in Silicon Valley… future-proof, that’s the ticket!”

BigQuery is undergoing a significant transformation, fueled by Google’s integration of Gemini, its flagship LLM model, almost everywhere. Just to name a few benefits of such integration, Gemini enhances BigQuery’s ability to understand natural language to automatically generate optimized queries and translate them across different dialects of SQL. Furthermore, it enables users to quickly extract insights from their data (and metadata) through automated analysis and visualization. And don’t forget BigQuery ML!

This perfectly aligns with clients’ desire for a future-proof solution that can adapt to evolving data landscapes and analytical needs.

Our clients frequently seek products that go beyond the traditional data warehouse offering capabilities that extend into data science, machine learning, and advanced analytics. This is precisely what BigQuery embodies today.

Then, why don’t we advise for BigQuery by default? What are the key factors to consider when determining if BigQuery is the right tool for the job?

“This BigQuery thing… it seems pretty impressive. Our people tell me it’s the way to go — aligns perfectly with all our, uh, data stuff. Sounds like a winner to me!”

BigQuery, with its serverless architecture and impressive scalability, it’s particularly useful for organizations dealing with massive datasets and complex analytical workloads.

These two aspects are central to my initial assessment when evaluating BigQuery’s suitability for a client, and based on my experience can be evaluated by prioritizing three key areas of interest:

  • Data Volumes: Questions about the scale of data involved, current volumes and the data growing rate usually arise during the first interviews.
  • Data Change Management: I usually perform an analysis of how frequently data is updated, the complexity of these changes, and the already in-place mechanisms for managing schema evolution and data versioning.
  • Data Access Patterns (my favorite one): This phase includes an examination of how data is queried, including the types of queries, frequency, concurrency, and performance expectations.

These points of discussion often provide sufficient insight to determine whether BigQuery will simplify or complicate the IT department’s operations (because it can, trust me), other times we start with proof-of-concepts to further validate our initial assessments and provide more concrete evidence.

Interestingly, in approximately half of the cases, after a thorough evaluation with the customer, we collectively decide on alternative solutions. This decision is typically influenced by the following evidences:

  • Low Data Volume: If the client’s data volume is relatively small, the advantages of BigQuery’s massive scalability might not be fully realized, making other solutions potentially more cost-effective.
  • Transactional Query Patterns: If the data warehouse is also queried using transactional-style queries (frequent, small, targeted inserts, updates and retrievals), a traditional relational database might be a more suitable choice.
  • Batch Deletes and Re-inserts: The practice of batch deleting and re-inserting entire tables (not just partitions) can be resource-intensive in BigQuery, and it’s a symptom of a not well established big data culture. Moreover, using this pattern clients cannot benefit from the storage cost optimization that comes from BigQuery storage pricing model.
  • Frequent Use of LIMIT Clause: BigQuery’s pricing model is based on the amount of data processed by each query. While LIMIT clauses restrict the number of rows returned (say it loud, RETURNED!), BigQuery, as most of the engines out there, still processes the entire dataset before applying the limit. In this case, since lot of customers don’t want to modify queries during the first iteration, a solution offering more predictable and stable cost management might be a more suitable option.

Essentially, we discover that their requirements transcend the conventional boundaries of On-Line Analytical Processing (OLAP) and On-Line Transaction Processing (OLTP), necessitating a solution that can handle high-volume transactions with the analytical capabilities of a data warehouse — a hybrid approach, called also Hybrid transaction analytical processing (HTAP).

“HTAP? Sounds like something my doctor would say! Alright, you’ve got my attention. Explain this… HTAP thingamajig. And keep it simple, I’m not a tech whiz!”

An HTAP database is like a Swiss Army knife for clients’ data. It’s designed to handle both the high-frequently transactions, and the big-picture analysis that helps clients make smart business decisions. This dual functionality, under certain conditions, eliminates the need for separate transactional and analytical databases, simplifying data management and reducing infrastructure complexity.

Accordingly to Gartner,

“HTAP architectures support the needs of many new application use cases in the digital business era, which require scalability and real-time performance.”

Traditionally, bridging the gap between operational data and analytical insights involved a cumbersome process. Data from OLTP systems, responsible for handling day-to-day transactions, had to be extracted, transformed, and loaded (ETL) into separate OLAP systems designed for analytical processing. An ETL process is often composed by complex data pipelines, intricate transformations, with a negative effect on the ability to gain timely insights from operational data.

By enabling both transactional and analytical processing within a single unified system, HTAP eliminates the need for creating such pipelines, allowing clients to centralize their data, and providing a single source of truth for both operational and analytical workloads.

You may think: “Yes, but what about performance?”

In recent years, HTAP systems have undergone significant advancements, achieving remarkable efficiency and performance levels. This surge in capability has fueled their growing popularity, driven in part by valuable contributions from the open-source community, with contributions such as the PingCAP’s paper that gave a big boost to this kind of solutions.

And, you know what?! there is an amazing HTAP database also in Google Cloud — AlloyDB

“Alloy? Now that’s something I know a little about! Pretty sure my car’s made of that stuff, right? Makes it nice and sturdy. But hold on… you’re talking about something else, aren’t you?”

(No, that’s just me exaggerating things)

AlloyDB is a fully managed database service offered by Google Cloud, that provides a compelling solution for organizations seeking a high-performance, 100% PostgreSQL-compatible database that can be deployed anywhere, whether the infrastructure resides in the cloud or on-prem (with AlloyDB Omni).

Google recognized the limitations of traditional databases in handling the scale, performance, and analytical demands of today’s applications. Leveraging its deep expertise in distributed systems, high-performance computing, and artificial intelligence, Google engineered AlloyDB to address these challenges by combining the familiarity and ecosystem of PostgreSQL with the best of Google’s technology.

In terms of performance, Google achieved impressive results by decoupling the storage and compute layer, leveraging Google’s foundational storage system, Colossus (yes, the same storage layer of BigQuery). But AlloyDB is not merely a “Postgres Enterprise” solution; it’s more akin to “Postgres on steroids” (a term playfully borrowed from Airflow’s definition of “CRON on steroids”, and yes, I’m an Airflow fan too).

Indeed, AlloyDB includes a dynamic columnar engine that is capable of dynamically adapting to different workload characteristics. And the results, accordingly to Google’s benchmarks are stunning; AlloyDB is more than 4x faster than standard PostgreSQL for transactional workloads and up to 100x for analytical ones, that’s perfect for hybrid scenarios like the one we were talking about!

“That’s great and all. But is this thing actually future-proof or are we going to be stuck with a pile of junk in a year? I don’t want to waste my time and money on something that’s going to be obsolete before I even figure out how to use it.”

AlloyDB has already embraced the GenAI era, simplifying and enhancing many aspects of database interaction for both database users and administrators.

Gemini, for instance, empowers users to write SQL queries using natural language , making data accessible to a wider audience. Database administrators, on the other hand, can leverage AI-powered features that provide insights into daily operations, such as index creation, suggesting new indexes that could potentially enhance overall performance.

In general, with its foundation in PostgreSQL, AlloyDB benefits from a unique advantage: it rapidly integrates innovations coming from the vibrant PostgreSQL community, further enhanced by Google’s own advancements.

A recent example of this collaborative innovation is the enhancement of PostgreSQL with the ability to efficiently manage vectors, enabling storage, querying, and indexing using Hierarchical Navigable Small World (HNSW) indexes.

In the GenAI era, capturing the meaning of data like images, text, or audio involves transforming that information into numerical “vectors.” These vectors represent the data in a high-dimensional space, effectively capturing its essence. Databases capable of performing vector similarity search can then identify related items, enabling applications to provide recommendations from unstructured data and build Retrieval Augmented Generation (RAG) systems.

Google engineers have significantly optimized this functionality within AlloyDB, achieving remarkable performance gains for vector similarity search. The result is that, currently, AlloyDB delivers up to 4x faster vector queries, accelerates index creation by up to 8x, and significantly reduces memory consumption compared to standard PostgreSQL, and again, that’s great from a user perspective.

Overall, Google is heavily investing in AlloyDB, continuously improving its features and capabilities.

Notably, this powerful engine is also available for deployment in private environments through AlloyDB Omni. Omni even allows organizations to connect their own locally-hosted LLMs, enabling the creation of truly private AI solutions. But that’s another story…

…Stay tuned for more

--

--

Google Cloud - Community
Google Cloud - Community

Published in Google Cloud - Community

A collection of technical articles and blogs published or curated by Google Cloud Developer Advocates. The views expressed are those of the authors and don't necessarily reflect those of Google.

Danilo Trombino
Danilo Trombino

Written by Danilo Trombino

Google Cloud Architect @ Datwave. Specialized in data-driven solutions for global partners. Love for music and HiFi audio.

Responses (1)