Cloud, Big Data and Data Science Platforms
Hello there! Today I will talk about Cloud, Big Data and Data Science and how Google Cloud can be the one platform that you will ever need. In this post, I will focus on the infrastructure needs of “data companies”, the companies that have data as their core asset. What is a data company? Any company which is driven by data, takes decisions based on data, improves products based on data is a data company. Examples of data companies include Google, Netflix, Uber, Airbnb, and Spotify. They they all share the same DNA of collecting large amounts of data, processing data, find patterns in data and deliver value through services and make money. This is more true for modern data companies like Spotify, Uber and Airbnb. Spotify, the biggest music company does not own a a single album; Uber, the biggest transportation company does not own a single transportation vehicle; Airbnb, the biggest lodging company does not own a single lodge. Data is their core asset.
In order to work with data effectively, you need several platforms.
- Cloud platform to provision resources at scale.
- Big Data platform to process the data.
- Machine Learning platform for learning patterns from data.
- Streaming platform for delivering value in real-time.
When you operate with Cloud providers like AWS / IBM / Azure, you are working with a multitude of systems. Your Cloud provider (AWS / IBM / Azure) provides the hardware resources; your Big Data platform (Hadoop / Spark) for processing data. You also need to have Streaming platform (Kafka / AWS Kinesis), Machine Learning platform (IBM Watson / Nervana) and Container platform or Cluster manager (Cloud Foundry / Mesos). Having so many platforms is a maintenance headache. Not only that, they don’t understand or work well with each other.
When you use Google Cloud, here is how all your platform needs are taken care of.
- Cloud: Google Cloud provides resources needed (VMs, Storage buckets, Databases, IAM, Monitoring tools, etc …). On top of Cloud, Google Cloud provides several services like App Engine, Cloud Shell (Bastion as Service) and takes care of nitigrities like SSH key management which reduce the DevOps needs. VMs boot 5 times faster, networking fabric is an order of magnitude faster, Cloud Storage and Disks are about 2 times faster, local SSDs are about 5 times faster compared to AWS and Microsoft offerings. Innovative sustained discounts gives the flexibility you need to procure resources as per your needs, with out going for long term lock in.
- Big Data: The biggest advantage with Google Cloud is its Big Data stack. Dataflow (Crunch petabytes of data with ease), Pub/Sub (Send the “Internet” through it 10 times in a day), Bigtable (Internet scale NoSQL data store for storing petabytes of data), BigQuery (The equivalent of Parquet columnar storage + Presto query engine running on custom hardware + Airpal for Web UI+ NoOps) are 5 years to decade ahead of other industry leading tools.
- Streaming: Cloud Pub/Sub is an Internet scale messaging system with flexible push or pull-style subscriptions. Its a global service, meaning that you can send messages to it in one zone/region and receive messages in another zone/region, without any extra cost. It can scale to millions of messages instantly. Provides data security and protection by encrypting data on the wire and at rest by default. It can work with variety of sinks and sources including Dataflow (for stream processing), Cloud Storage (for persistent storage), Cloud Logs, Cloud Monitoring, etc .,
- Data Science: Google Cloud offers a spectrum of services on data science side. It provides Machine Learning API’s like Vision API, Speech API, Translate API which make it easy for developers to get started with. Cloud ML, hosted version of TensorFlow, will enable data scientists to train ML models without worrying about infrastructure. If you like to develop new models, you can do so using TensorFlow, which is open source. If you are feeling like exploring data in an ad hoc way, you can try Cloud Datalab, a hosted version IPython Notebook. Datalab integrates really well with BigQuery and Cloud Storage.
- Containers: Google Cloud Container Engine (GKE) is a hosted version of Kubernetes, the most popular open source management solution for containerized applications. Kubernetes has nearly 1 K developers, 30 K commits, 15 K stars on GitHub. It comes with automatic bin packing, self-healing, horizontal scaling, seamless service discovery, built in load balancer, automatic rollouts and rollbacks, and more. Kubernetes make large scale container management easy.
When you work with Google Cloud, you are using one single coherent platform that serves all your needs for Cloud, Big Data, Streaming, Data Science and Container systems. You don’t need to fight with multitude systems that don’t work well with each other. Imagine the number of hours saved and productivity unleashed!