Part 4 – The technology behind modern data analytical platforms

Nikhil (Srikrishna) Challa
Data Knowledge Hub
Published in
7 min readJul 25, 2020

This is the part 4 of the series, “Modernising a Data Platform and BigQuery mastery” . In this part, we will discuss about the underlying technology of Google’s BigQuery and few important concepts that contribute immensely in categorising BigQuery as a Modern day Datawarehouse.

A quick recap — In the first 3 parts of the series, we focussed on the concept of Modernisation, Datawarehouse modelling & fundamentals and Characteristics of a modernised data platform.

Distributed computing is a powerful mechanism which improves the efficiency and performance of a job. That reminds me of a mathematical problem from my primary school. If one person takes 10 days to complete a job, how much time will 10 people of equal capacity and capability take to complete the same job. That is exactly how the concept of distributed computing harnesses the power of multiple computational units.

These computational units may share same geography or may be spread across different geographical areas depending on the requirements, which means my servers can be spread across UK, USA and Asia Pacific.

The concept of parallel processing and distributed computing can be applied for both “storage” where the files of humungous size are stored on multiple servers and similarly for “Compute” where the processing loads are distributed across multiple servers.

In a modern data platform, the most important characteristic is agility. In previous articles we briefly discussed the scale of data and that it has enormously increased with time. The organisational data ranging between 100’s of Tera-bytes to peta bytes has become a new normal. We need massive, distributed storage mechanism as well as equally efficient and super performing processing engines to deal with such volumes.

Modern Analytical platforms bring these 2 together where the data can be stored on distributed file systems and at the same time use distributed processing engines to process the queries and help analysing the data quickly.

Data Platforms like Google BigQuery, AWS Athena & Microsoft Synapse Analytics offer rich features such as Interactive SQL, ability to store Petabytes of data and very fast analytical processing.

A small math around this:

If the requirement for on premise data platform is to process 4TB of data which is around 100 Billion rows of data and return results within seconds, it requires to setup a distributed computing architecture with parallel processing capabilities. The composition would look like a minimum of 300+ hard drives with 100Mbps disk speed, 300 Gigabit network, 3000+ cores (which is like 750, 4 core CPU laptops) PLUS the operational overhead, Availability, failure recovery, replication of data etc.,

Cloud service providers like Google, AWS and Microsoft take care of all the above and users like us just need to write the SQL queries and click on “Run Query” button on the console. Isn’t that modern? Serverless, No Ops!

Google, Amazon and Microsoft let us use all the resources that are needed for our analytical workloads for a price (Pricing aspects are out of the scope of this article)

Now we know how resource intensive it is to deal with Bigdata analytical processing.

The top CSP’s deal with it in different ways but the core principles have always remained the same. AWS Athena uses a technology called ‘Presto’ which was developed and open sourced by Facebook. Similarly, Google externalised their internal query processing engine “Dremel” with a user Interface in the form of BigQuery and the same is the case with Microsoft Synapse Analytics which runs on parallel processing and distributed technology.

What’s common between all 3 is that they are serverless and helps avoid operational overhead.

Let us focus on Google’s BigQuery to retain the context!

Google uses 4 different technologies to achieve the performance and display the characteristics of a modern data platform.

1. Colossus — A distributed file system to store massive amounts of data. This is how the hierarchy goes like: A server is a node/machine that has multiple cores and CPUs. Many Servers together form a rack and many racks together form a cluster and many clusters together form a data centre. Google has multiple data centres across the globe.

Google uses all of these to store massive data in a distributed fashion which typically is replicated 3X to avoid any data loss due to a node or rack or a cluster failure. This ensures that the data has 99.999% availability or chance of losing our data is 0.001%.

2. Jupiter — Architectures like Hadoop rely on “data locality” which is to have data storage and processing on same node (In simple words, I store the data and I process the data on same computer). This is normally done to reduce latency because data is co-located in the same machine. But BigQuery does the opposite. It decentralises the Storage and Compute. The data is stored in a different node, while compute happens in some other node. This is mainly to increase the availability and to ensure that the data is not lost even if the machine where the compute happens is destroyed/crashed/is non-functional. However, this means, the data needs to be brought into the machine where processing happens every time we perform an operation. That calls for a massive transfer of data across machines.

In order to achieve this, the network should be super-efficient. Google’s network (a.k.a Jupiter) is capable enough to do this and put things to perspective, Jupiter network can easily let a minimum of 100k machines transfer data @ 10GBPS at a given point-in-time.

So, the data can always be there on the storage and BigQuery’s analytical processing engine can read the data directly from storage every-time a user runs an SQL query.

3. Borg — In a distributed computing architecture, when we are dealing with 1000’s of cores and CPU’s, it needs someone/something to manage them (A Team Leader). That’s Borg. It allocates the resources to a job and not only that, it also takes care of failovers, power failures, network drops, machine crashes and ensures nothing impacts the user who clicked on ‘Run Query’.

In google’s own words, the user who has initiated a job/submitted a query , will never get to know what has happened in the background, even if someone pulled out a plug that has taken down the entire cluster.

All of such efficient management of resources is made possible by Borg.

Dremel — BigQuery is Dremel and Dremel is BigQuery. BigQuery is just an externalisation of Dremel which Google’s internal users operate on, while querying petabytes of data every day and thousands of queries.

What makes technologies like Dremel so fast?

They use columnar storage which is completely opposite to a transactional processing system which use row-based storage.

Let us see, how row-based storage and column-based storage are fundamentally different.

Data is stored in blocks on a disk. When an operation is performed, the operating system (OS) reads the data from these blocks and loads them to memory which makes any subsequent operation on the block very fast and easy.

Just like our brains take a bit more time in processing this article while reading it for the first time, but if you read it again, the interpretation becomes easier as all the content is there in your sub-conscious memory already and hence I encourage you to read it as many times as you can ;)

The block of data will be a row in row-based storage, however in the analytical systems like Dremel (BigQuery) which uses columnar storage the block of data will be a column.

When the OS loads the data to memory, unlike RDBMS, where a row is loaded, the entire column is loaded for a columnar storage model, making it more efficient for analytical processing. Hence row based is good for CRUD operations in relational databases while columnar based storage is good for analytics, aggregations etc.,

With columnar storage, only necessary columns are processed and not entire list of attributes.

(Never use (Select *) if you want to save money On cloud based analytical processing engines)

Another advantage is that the compression is more efficient compared to row based. Anybody who is aware of Datawarehouse designs will know that the underlying design of a DWH is always columnar based, however what makes Dremel super-efficient is it uses distributed computing on top of column-based storage.

The magic is the way it distributes the query across multiple computational units (thousands of them!) and collects the results to display them out. This is called Tree-architecture.

It uses master/slave mechanism, where a root server reads the query that is submitted when user runs the query. The root server reads the metadata of the tables and passes it on to the next layer.

Ex: Select A, Count(B) from Table T Group by A, is the query submitted by the user. The root server receives the query and sends the query to ’n’ nodes. Let us say, the query has been sent to 1000 computational units. All of them access the data on the Google’s file system which is common across the nodes. The layer that connects to the storage file systems is the “Leaf Nodes” and that connectivity happens via Jupiter network as shown in the figure.

Each node runs the query on its own against an independent partition of the table T as assigned by the root server.

Each node produces an output and all of them are aggregated on the way up.

The output of each node would look like

Output(Node1) = Select A, Count(B) from Partition P1 Group by A

Output(Node2) = Select A, Count(B) from Partition P2 Group by A

.

.

.

.

Output(Node N) = Select A, Count(B) from Partition Pn Group by A

The aggregation query looks like the below. It just merges the output from each node and displays it to the user. (Recalling the primary school’s mathematical problem as quoted at the start of this article. A job done by 1 person takes 10 days and if it is done by 10 people, it would be finished in just 1 day).

Select A, Sum (D) from UnionAll (Output(Node1)……Output(Node N)) Group by A.

Hope you enjoyed knowing about the underlying technology of modernised data warehouses like Google BigQuery. We will learn more about the concepts of Google BigQuery is next parts of this article.

There is a lot of similarity between how Dremel operates and how MapReduce which is a renowned Bigdata processing framework for large batch processing works on Hadoop. That’s for another article to discuss.

Happy Learning!

References: http://research.google.com/pubs/pub36632.html

--

--