GCP Crash Course: Big Data

Basic Concepts

Big Data may be summarized as:

  • Petabyte [semi-]structured storage and organization
  • Fast ([milli]seconds) query retrieval
  • Pipeline Processing → Batch and Streaming

Pipeline !?! Streaming !?!

If you have to perform complex processing on data, you may write a big procedure (old fashion) and run it. The problems with this approach are:

  • You probably have to run it in a single System → may lasts days or hours
  • If something goes wrong you have to start it all over again…and cry loud!
  • Longest the proc, more likely there are errors

So, just use the classical paradigm of Map-Reduce: lots of VMs, distributed storage, small tasks that may be computed one by one ….in a parallel pipeline!
Usually all computations on data are performed using Databases, that means → historical data, even if a couple of hours old.
But now it is possible to catch flying data, that is, surfing the tide from the very beginning!
For example, they notice an event from its first signs: tsunamis, traffic jams, commercial success (or flop) of a product/event, and so on.
Google is simply the best in handling DATA; it is part of their declared mission and scope.

The technologies to focus are:

Big Query:

Big Table:

DataFlow + Pub Sub:

Ask yourself

Think about the possible applications of real-time analytics:

  • What if systems became capable to detect events just before or soon after they are happening?
  • Why you will need either historical than real-time data?
  • And what is the role of ML in all that?

Cheatsheet

For most of the doubts you may refer to the doc Building Blocks

Officially the Big Data Products are:

Big Query: Analytics or Data Warehouse or OLAP, as you prefer. It means a DB not really performant for writes (usually made in batches, but it supports streaming, too) but a it is a champion for querying over lots of data. It don’t use indexes but it can activate fleets of server for selecting the data.

Storage and storing inexpensive…query paid by weight (TB).

DataFlow + Pub Sub: 2 different products that often work together. Pub/Sub is a serverless, large scale, real-time messaging used to catch da data the very moment that it is generated. Dataflow is besed on Apache Beam and it is a powerful way to process big data in steps. It doesn’t store data (uses Cloud Storage or BigQuery to do that).

Cloud Dataproc: Hadoop ecosystem in Cloud. If already using Hadoop it is a lift and shift solution. Otherwise, Google advises to use Dataflow and Bigquery.

DataLab DataPrep DataStudio are utilities, more or less. Be aware of the names: it is legitimate to think that someone is really lacking in imagination…..

Big Table: officially listed as a noSQL Database, but it is the product to use for IoT Data Ingestion and Time Series semi-structured Data in large quantities. It is a DB because is fast either in writing then in reading. It is the Dr.Strange of DBs. Used by Google for products like Maps and Gmail.

At the very base of Google Big Data there is COLOSSUS, Google Distributed File System. if it’s unknown to you go and get a look at it.

Demos /Video

Big query and Dataflow
Cloud Data Proc
Bigtable

Labs

Batch Load Data Into BigQuery

Analyzing Natality Data Using Datalab and BigQuery

Run a Big Data Text Processing Pipeline in Cloud Dataflow

Working with Google Cloud Dataprep

Predict Taxi Fare with a BigQuery ML Forecasting Model

--

--

Antonella Blasetti
GDG Google Developer Group & WTM Rome

Google Developer Expert Cloud. blasetti.cloud and Information Design. 40+ years of experience in Information Technology with a strong affinity with young people