GCP Crash Course: Big Data
Basic Concepts
Big Data may be summarized as:
- Petabyte [semi-]structured storage and organization
- Fast ([milli]seconds) query retrieval
- Pipeline Processing → Batch and Streaming
Pipeline !?! Streaming !?!
If you have to perform complex processing on data, you may write a big procedure (old fashion) and run it. The problems with this approach are:
- You probably have to run it in a single System → may lasts days or hours
- If something goes wrong you have to start it all over again…and cry loud!
- Longest the proc, more likely there are errors
So, just use the classical paradigm of Map-Reduce: lots of VMs, distributed storage, small tasks that may be computed one by one ….in a parallel pipeline!
Usually all computations on data are performed using Databases, that means → historical data, even if a couple of hours old.
But now it is possible to catch flying data, that is, surfing the tide from the very beginning!
For example, they notice an event from its first signs: tsunamis, traffic jams, commercial success (or flop) of a product/event, and so on.
Google is simply the best in handling DATA; it is part of their declared mission and scope.
The technologies to focus are:
Ask yourself
Think about the possible applications of real-time analytics:
- What if systems became capable to detect events just before or soon after they are happening?
- Why you will need either historical than real-time data?
- And what is the role of ML in all that?
Cheatsheet
For most of the doubts you may refer to the doc Building Blocks
Officially the Big Data Products are:
Big Query: Analytics or Data Warehouse or OLAP, as you prefer. It means a DB not really performant for writes (usually made in batches, but it supports streaming, too) but a it is a champion for querying over lots of data. It don’t use indexes but it can activate fleets of server for selecting the data.
Storage and storing inexpensive…query paid by weight (TB).
DataFlow + Pub Sub: 2 different products that often work together. Pub/Sub is a serverless, large scale, real-time messaging used to catch da data the very moment that it is generated. Dataflow is besed on Apache Beam and it is a powerful way to process big data in steps. It doesn’t store data (uses Cloud Storage or BigQuery to do that).
Cloud Dataproc: Hadoop ecosystem in Cloud. If already using Hadoop it is a lift and shift solution. Otherwise, Google advises to use Dataflow and Bigquery.
DataLab DataPrep DataStudio are utilities, more or less. Be aware of the names: it is legitimate to think that someone is really lacking in imagination…..
Big Table: officially listed as a noSQL Database, but it is the product to use for IoT Data Ingestion and Time Series semi-structured Data in large quantities. It is a DB because is fast either in writing then in reading. It is the Dr.Strange of DBs. Used by Google for products like Maps and Gmail.
At the very base of Google Big Data there is COLOSSUS, Google Distributed File System. if it’s unknown to you go and get a look at it.
Demos /Video
Big query and Dataflow
Cloud Data Proc
Bigtable
Labs
Batch Load Data Into BigQuery
Analyzing Natality Data Using Datalab and BigQuery
Run a Big Data Text Processing Pipeline in Cloud Dataflow
Working with Google Cloud Dataprep
Predict Taxi Fare with a BigQuery ML Forecasting Model