GOOGLE CLOUD PLATFORM — A NEXTGEN CLOUD TECH STACK

PRABHUKKARTHI STB
Google Cloud - Community
7 min readJul 22, 2021
Source: Self

Today’s Google Cloud Platform (GCP) is not merely evolved on the fly as one of the other Cloud Platform on the market!

It is the symbol of Google’s world-class top-notch, widely used global products like Search Engine, YouTube, Gmail, etc., which used its own built in-house technology with high-end advanced capabilities which has been experimented over a decade of time. Google started making those in-house technologies available to the public as GCP and has entered the cloud market recently, but came with many advanced capabilities that one could never imagine!

To date, approximately 228+ GCP Services/Google APIs are part of the GCP technology stack provided by Google!!! New services are emerging as well.

Source: Google Developer Relations Team

GOOGLE BIGQUERY

Of the various GCP services available, the most unique service is the Google BigQuery which is considered to be the Next Gen Data Warehouse Giant evolving in the market. Lets see the features of Google BigQuery how it proves the above accreditation.

Google BigQuery is a cloud-based analytical data warehouse solution offering the following advanced capabilities:

Server-free Service

To use BigQuery, there is no need to procure or install any storage servers to enable BigQuery. It is a serverless architecture which is fully managed by GCP in terms of allocation of storage and computing power. Only user activity is to activate the BigQuery API if it is an existing GCP project to launch with BigQuery. For new projects, it is auto-activated.

Auto-scalable Feature

BigQuery have the capability to scale up petabyte memory in terms of storage and computation. Based on the inbound data volume, BigQuery automatically allocates the necessary storage and based on the amount of data processed, it allocates computational power as slots (i.e. CPUs).

No-Ops Overhead

There is no operational downtime for operational activities such as software patch upgrades, cleanup of stale data storage, etc. All operational activities will be carried out by BigQuery without affecting the workloads being performed at that time.

High Availability

BigQuery replicates the data stored in COLOSSUS with other regional data centers so that even the main data center descends completely, the other data center acts as a fail over replica to serve the client request without any disruption.

Pay As You Go Model

Google adheres to the pricing strategy of Pay as you go model i.e. You will have to pay for the stuff you used. It basically follows the seconds level pricing. For BigQuery pricing, storage and processing are dealt with separately. Storage will be charged based on the volume of data stored and computation will be charged based on the volume of data got processed. This model is more customer centric compared to other vendor pricing strategies as it avoids ultimately paying for unused resources since resources have already been procured.

Batch Ingestion

BigQuery allows Batch Ingestion that is free to bring data to BigQuery in batches. This can be done using SQL Queries or BQ CLI commands, making it more economical for clients to perform migration activities or to load daily incremental batch data for analytical purposes. In such a case, the charge applies only to storage and processing.

Streaming Ingestion

BigQuery allows Streaming Ingestion that supports bringing data to BigQuery in a streaming fashion but chargeable. This can be done through BigQuery API and also other scenarios such as data could flow from Pub/Sub through a Dataflow Streaming pipeline to BigQuery as the data sink. The Streaming insert supports real-time data processing and ML forecasting. In such cases, charges apply for streaming, storage and processing.

Data Transfers

BigQuery supports the data transfer to BigQuery from various sources like Cloud Storage, YouTube and even from other cloud resources like Amazon S3, Terradata, redshift, etc., on a scheduled basis. It also supports data backfills due to any outage or gaps in data migration.

CAPACITOR — Self-Optimizing Storage Technology

BigQuery have Optimized Storage Technology — Capacitor, a columnar format that is a self-optimizing feature tuned based on incoming data that helps to store data in COLOSSUS, a compressed distribution file system. It follows a variety of encoding techniques as well as an encryption process to achieve this. The capacitor will be tuned to the new data and will continue to update the storage mechanism and align the data stored in COLOSSUS accordingly.

Structs & Arrays

Leveraging the BigQuery Storage technology, it introduces special types of data type, namely Structs and Arrays. Structs are containers or tables nested in the main table which makes it possible to avoid joins to get better performance compared to other traditional data warehouses. Arrays are the repeated fields that store column values within arrays, helping to optimize storage to prevent static column values from being duplicated. Using UNNEST queries, we can unnest the table and retrieve the data.

Geographical Data

BigQuery supports geo-location data and offers various geographic functions to perform the various operations and even has a data type namely Geography to store it.

Scheduled Queries

BigQuery has its own scheduler for scheduling queries and workloads in BigQuery. Standard SQL is supported and starts with a minimum scheduling frequency of 15mins.

Federated Queries

BigQuery allows federated queries that may have external sources as well i.e., allows requests to be executed outside BigQuery whether it can be Cloud Storage or it can be traditional databases as well by establishing the connection via the EXTERNAL_QUERY function.

Scripting

Like other traditional data warehouses, BigQuery also supports scripts such as user-defined functions (UDF) and procedures. UDF can be written in SQL or Javascript, which gets the input and performs some operations and returns the output. It can be either temporary or persistent. Procedures are containers that contain various SQL scripts, including DMLs run sequentially. Procedures are always persistent.

BI Engine For Reporting

Reporting tools such as Data Studio, Looker, Tableau, etc., can query its data source in BigQuery through special reporting feature called BI Engine which executes queries in seconds to give results that uses the in-memory storage. To use this feature, you must define in advance the in-memory storage that is chargeable.

Query Cache

BigQuery writes the query output into the cache table. This result will be available during 24hrs and if the request is executed again within this timeframe, then the result is retrieved from the cache table free of processing charges if it meets certain conditions like executing exact query text, the data in the actual table should not be updated in the meantime, etc., This helps to avoid the computation charges for the same result set.

Secured Access Control

BigQuery supports highly secure access controls ranging from high-level to ultra-low-level. It supports at organization, folders, projects, datasets, tables and has recently begun to support very low-level column and row security. With respect to column and row level access controls, even users with higher level permissions always required special permission to access them.

BigQuery ML

BigQuery is the unique data warehouse where it offers to build Machine Learning (ML) patterns and train within the data warehouse instead of switching to separate ML capabilities. The models are trained where the data is actually stored, thereby eliminating the data transfer and latency involved. BigQuery currently supports basic ML models and offers user-friendly SQL to construct models in a matter of hours.

BigQuery Omni

The key hallmark is the BigQuery Omni that supports the hybrid cloud model. Leveraging its separation between computing and storage, it offers to analyze the data residing in other cloud platforms such as AWS, AZURE without carrying out data transfer with the help of Anthos. Only processing costs will be incurred as a result of this approach, which opens the door to multi-cloud solutions!

Jupiter Network

Google’s backbone — Jupiter — a petabit network capability which makes the BigQuery as a unique data warehouse. Due to its petabyte scale, data transfer and retrieval are faster than anyone can imagine either between GCP services or hybrid cloud services that stands out for significant advancements over other cloud platforms!

BigQuery Reservations

BigQuery Reservations help in switching the pricing strategy. By default, pricing method is On-demand basis. This feature helps the customer to switch to buy the slots on a flat-rate basis, which helps their billing to be predicted or will be maintained at the same expected price. At times, they may switch between the two pricing models and have the flexibility to apply them only to specific projects or workloads as required.

Admin Resource Charts

BigQuery provides a new surveillance function — Admin Resource Charts for administrators include the slot usage, job concurrency and job performance in a particular window period at various levels like organization, folder, sub-folder, project, user or job levels. Normally, Cloud Logging/Monitoring only supports visualizing project-level statistics in the Cloud console. This feature now eliminates these barriers and provides a holistic view, especially for BigQuery administrators to plan for the future economically. The displayed statistics are in real time and it can store the 14-day history. This functionality is only activated if the customer is a BigQuery Reservation user.

Conclusion

As we have seen, several advanced capabilities of BigQuery that are intended for cost-effective solutions. In each feature, Google is well planned and reasonable about what prices can be given economically and which can be expensive. On the whole, BigQuery is a cost-effective data warehouse with various advanced features that no other cloud thinks of and always evolving, especially BigQuery Omni, ML that supports hybrid cloud with Jupiter’s premium network proves to be the Next Gen data warehouse. This unique Google BigQuery service is a testament to the power of the GCP platform!

In my opinion, while the marketplace is currently moving toward hybrid cloud solutions, but with these advances and modernization, GCP is growing strong to overtake the market and will dominate most of the Cloud business in the near future!

Skilled GCP Engineers will remain in demand for at least a few decades in the marketplace!

References

--

--

PRABHUKKARTHI STB
Google Cloud - Community

Senior AVP - GCP Business Consulting at HSBC || Google Cloud Certified Professional Data Engineer || Master of Technology — Data Analytics From BITS Pilani