Kubernetes and Big Data: A Gentle Introduction

Published in

SFU Professional Computer Science

11 min readFeb 4, 2020

Authors: Max Ou, Kenneth Lau, Juan Ospina, and Sina Balkhi

This blog is written and maintained by students in the Professional Master’s Program in the School of Computing Science at Simon Fraser University as part of their course credit. To learn more about this unique program, please visit {sfu.ca/computing/pmp}.

*Photo by* ***Tom Fisk*** *from* ***Pexels***

Kubernetes, what is that?

Kubernetes has been an exciting topic within the community of DevOps and Data Science for the last couple of years. It has continuously grown as one of the go-to platforms for developing cloud-native applications. Built by Google as an open-source platform, Kubernetes handles the work of scheduling containers onto a compute cluster and manages the workloads to ensure they run as intended.

However, there is a catch: what does all that mean? Sure, it is possible to conduct additional research on Kubernetes, but many articles on the Internet are high-level overview crammed with jargon and complex terminology, assuming that most readers already have an understanding of the technical foundations.

In this post, we attempt to provide an easy-to-understand explanation of the Kubernetes architecture and its application in Big Data while clarifying the cumbersome terminology. However, we assume our readers already have certain exposure to the world of application development and programming. We hope that, by the end of the article, you have developed a deeper understanding of the topic and feel prepared to conduct more in-depth research on.

What are microservices?

A fictional Buy-a-Book online store with three microservices: Login, Buy and Return. Each microservice is decoupled from the rest of the app and is responsible for one specific task. The services interact with each other through APIs. (*source*)

To gain an understanding of how Kubernetes works and why we even need it, we need to look at microservices. There isn’t an agreed-upon definition for microservices, but simply put, microservices are smaller and detached components of a bigger app that perform a specific task. These components communicate with each other through REST APIs. This kind of architecture makes apps extensible and maintainable. It also makes developer teams more productive because each team can focus on their own component without interfering with other parts of the app.

Since each component operates more or less independently from other parts of the app, it becomes necessary to have an infrastructure in place that can manage and integrate all these components. This infrastructure will need to guarantee that all components work properly when deployed in production.

Containers vs. Virtual Machines (VMs)

***Left:*** *A containerized application. Each app/service runs on a separate container on* *Docker, currently the most popular and widely-adopted container technology.* ***Right:*** *Each app/service is running on a separate virtual machine placed on top of a physical machine. (source*)

Each microservice has its dependencies and requires its own environment or virtual machines (VMs) to host them. You can think of VMs as one “giant” process in your computer that has its storage volumes, processes and networking capabilities separate from your computer. In other words, a VM is a software-plus-hardware abstraction layer on top of the physical hardware emulating a fully-fledged operating system.

As you can imagine, a VM is a resource-consuming process, eating up the machine’s CPU, memory and storage. If your component is small (which is common), you are left with large underutilized resources in your VM. This makes most microservices-based apps that are hosted on VMs time-consuming to maintain and costly to extend.

A Docker Host can handle multiple containers, with each container defining a detached microservice. For example, one container holds all the files, the other defines the MySql database, the PHP backend is defined in another container and so forth. Extending the app (e.g. adding a Python-based machine learning model) is simply a matter of creating another container inside the Docker Host without affecting the other components. (*source*)

A container, much like a real-life container, holds things inside. A container packages the code, system libraries and settings required to run a microservice, making it easier for developers to know that their application will run, no matter where it is deployed. Most production-ready applications are made up of multiple containers, each running a separate part of the app while sharing the operating system (OS) kernel. Unlike a VM, a container can run reliably in production with only the minimum required resources. Therefore, compared to VMs, containers are considered lightweight, standalone and portable.

Diving into Kubernetes

We hope you are still on board the ride! Having gone through what are containers and microservices, understanding Kubernetes should be easier. In a production environment, you have to manage the lifecycle of containerized applications, ensuring that there is no downtime and that system resources are efficiently utilized. Kubernetes provides a framework to automatically manage all these operations in a distributed system resiliently. In a nutshell, it is an operating system for the cluster. A cluster consists of multiple virtual or real machines connected together in a network. Formally though, here’s how Kubernetes is defined in the official website:

“Kubernetes is a portable, extensible, open-source platform for managing containerized workloads and services, that facilitates both declarative configuration and automation. It has a large, rapidly growing ecosystem. Kubernetes services, support, and tools are widely available.”

Kubernetes is a scalable system. It achieves scalability by leveraging modular architecture. This means that each service of your app is separated by defined APIs and load balancers. A load balancer is a mechanism where a system ensures that each component (be it a server or a service) is utilizing the maximum available capacity to carry out its operations. Scaling up the app is merely a matter of changing the number of replicated containers in a configuration file, or you could simply enable autoscaling. This is particularly convenient because the complexity of scaling up the system is delegated to Kubernetes. Autoscaling is done through real-time metrics such as memory consumption, CPU load, etc. On the user side, Kubernetes will automatically distribute traffic evenly across the replicated containers in the cluster and, therefore, keep deployment stable.

Kubernetes allows more optimal hardware utilization. Production-ready applications usually rely on a large number of components that must be deployed, configured and managed across several servers. As described above, Kubernetes greatly simplifies the task of determining the server (or servers) where a certain component must be deployed based on resource-availability criteria (processor, memory, etc.).

Another awesome feature of Kubernetes is how it can self-heal, meaning it can recover from failure automatically, such as respawning a crashed container. For example, if a container fails for some reason, Kubernetes will automatically compare the number of running containers with the number defined in the configuration file and restart new ones as needed, ensuring minimum downtime.

Now that we have that out of the way, it’s time to look at the main elements that make up Kubernetes. We will first explain the lower-level Kubernetes Worker Node then the top-level Kubernetes Master. The Worker Node is the minions that run the containers and the Master is the headquarter that oversees the system.

Kubernetes Worker Nodes Components

Kubernetes Worker Nodes, also known as Kubernetes Minions, contain all the necessary components to communicate with the Kubernetes Master (mainly the kube-apiserver) and to run containerized applications.

Docker Container Runtime
Kubernetes needs a container runtime in order to orchestrate. Docker is a common choice, but other alternatives such as CRI-O and Frakti are also available. Docker is a platform to build, ship and run containerized applications. Docker runs on each worker node and is responsible for running containers, downloading container images and managing containers environments.

Pod
A pod contains one or more tightly coupled containers (e.g. one container for the backend server and others for helper services such as uploading files, generating analytics reports, collecting data, etc). These containers share the same network IP address, port spaces, or even volume (storage). This shared volume has the same lifecycle as the pod, which means the volume will be gone if the pod is removed. However, Kubernetes users can set up persistent volumes to decouple them from the pod. Then, the mounted volumes will still exist after the pod is removed.

kube-proxy
The kube-proxy is responsible for routing the incoming or outgoing network traffic on each node. The kube-proxy is also a load balancer that distributes incoming network traffic across containers.

kubelet
The kubelet gets a set of pod configurations from kube-apiserver and ensures that the defined containers are healthy and running.

Kubernetes Master Components

The Kubernetes Master manages the Kubernetes cluster and coordinates the worker nodes. This is the main entry point for most administrative tasks.

etcd
The etcd is an essential component of the Kubernetes cluster. It is a key-value store for sharing and replicating all configurations, states and other cluster data.

kube-apiserver
Almost all the communications between the Kubernetes components, as well as the user commands controlling the cluster are done using REST API calls. The kube-apiserver is responsible for handling all of these API calls.

kube-scheduler
The kube-scheduler is the default scheduler in Kubernetes that finds the optimal worker nodes for the newly created pod to run on. You could also create your own custom scheduling component if needed.

kubectl
The kubectl is a client-side command-line tool for communicating and controlling the Kubernetes clusters through the kube-apiserver.

kube-controller-manager
The kube-controller-manager is a daemon (background process) that embeds a set of Kubernetes core feature controllers, such as endpoints, namespace, replication, service accounts and others.

cloud-controller-manager
The cloud-controller-manager runs controllers that interact with the underlying cloud service providers. This enables cloud providers to integrate Kubernetes into their developing cloud infrastructure. Cloud providers such as Google Cloud, AWS and Azure already offer their version of Kubernetes services.

Kubernetes for Big Data

*Photo by* ***Manuel Geissinger*** *from* ***Pexels***

One of the main challenges in developing big data solutions is to define the right architecture to deploy big data software in production systems. Big data systems, by definition, are large-scale applications that handle online and batch data that is growing exponentially. For that reason, a reliable, scalable, secure and easy to administer platform is needed to bridge the gap between the massive volumes of data to be processed, software applications and low-level infrastructure (on‐premise or cloud-based).

Kubernetes is one of the best options available to deploy applications in large-scale infrastructures. Using Kubernetes, it is possible to handle all the online and batch workloads required to feed, for example, analytics and machine learning applications.

In the world of big data, Apache Hadoop has been the reigning framework for deploying scalable and distributed applications. However, the rise of cloud computing and cloud-native applications has diminished Hadoop’s popularity (although most cloud vendors like AWS and Cloudera still provide Hadoop services). Hadoop basically provides three main functionalities: a resource manager (YARN), a data storage layer (HDFS) and a compute paradigm (MapReduce). All three of these components are being replaced by more modern technologies such as Kubernetes for resource management, Amazon S3 for storage and Spark/Flink/Dask for distributed computation. In addition, most cloud vendors offer their own proprietary computing solutions.

*Google Trends* *comparison of Apache Hadoop and Kubernetes.*

We first need to clarify that there isn’t a “one versus other” relationship between Hadoop or most other big data stacks and Kubernetes. In fact, one can deploy Hadoop on Kubernetes. However, Hadoop was built and matured in a landscape far different from current times. It was built during an era when network latency was a major issue. Enterprises were forced to have in-house data centers to avoid having to move large amounts of data around for data science and analytics purposes. That being said, large enterprises that want to have their own data centers will continue to use Hadoop, but adoption will probably remain low because of better alternatives.

Today, the landscape is dominated by cloud storage providers and cloud-native solutions for doing massive compute operations off-premise. In addition, many companies choose to have their own private clouds on-premise. For these reasons, Hadoop, HDFS and other similar products have lost major traction to newer, more flexible and ultimately more cutting-edge technologies such as Kubernetes.

Big data applications are good candidates for utilizing the Kubernetes architecture because of the scalability and extensibility of Kubernetes clusters. There have been some recent major movements to utilize Kubernetes for big data. For example, Apache Spark, the “poster child” of compute-heavy operations on large amounts of data, is working on adding the native Kubernetes scheduler to run Spark jobs. Google recently announced that they are replacing YARN with Kubernetes to schedule their Spark jobs. The e-commerce giant eBay has deployed thousands of Kubernetes clusters for managing their Hadoop AI/ML pipelines.

So why is Kubernetes a good candidate for big data applications? Take, for example, two Apache Spark jobs A and B doing some data aggregation on a machine, and say a shared dependency is updated from version X to Y, but job A requires version X while job B requires version Y. In such a scenario, Job A would fail to run.

*Each Spark job is run on its own isolated pods distributed over nodes. (source*)

In a Kubernetes cluster, each node would be running isolated Spark jobs on their respective driver and executor pods. This setup would avoid dependencies from interfering with each other while still maintaining parallelization.

Kubernetes still has some major pain points when it comes to deploying big data stacks. For example, because containers were designed for short-lived, stateless applications, the lack of persistent storage that can be shared between different jobs is a major issue for big data applications running on Kubernetes. Other major issues are scheduling (Spark’s above-mentioned implementation is still in its experimental stages), security and networking.

Consider the situation where node A is running a job that needs to read data stored in HDFS on a data node that is sitting on node B in the cluster. This would greatly increase network latency because data, unlike in YARN, is now being sent over the network of this isolated system for compute purposes. While there are attempts to fix these data locality problems, Kubernetes still has a long way to really become a viable and realistic option for deploying big data applications.

Nonetheless, the open-source community is relentlessly working on addressing these issues to make Kubernetes a practical option for deploying big data applications. Every year, Kubernetes gets closer to becoming the de facto platform for distributed, big data applications because of its inherent advantages like resilience, scalability and resource utilization.

So long my friend.

In this article, we have only scratched the surface of what Kubernetes is, its capabilities and its applications in big data. As a continually developing platform, Kubernetes will continue to grow and evolve into a technology that is applied in numerous tech domains, especially in big data and machine learning. If you find yourself wanting to learn more about Kubernetes, here are some suggestions on topics to explore under the “External links” section. We hope you enjoyed our article about Kubernetes and that it was a fun read.

External links:

Official Kubernetes documentation
https://kubernetes.io/docs/home/

Official Docker documentation
https://docs.docker.com/

Cloud Computing — Containers vs Vms, by IBM
https://www.ibm.com/blogs/cloud-computing/2018/10/31/containers-vs-vms-difference/

Kubernetes in Big Data Applications, by Goodworklabs
https://www.goodworklabs.com/kubernetes-in-big-data-applications/

Should you use Kubernetes and Docker in your next project? Daniele Polencic at Junior Developers Singapore 2019
https://www.youtube.com/watch?v=u8dW8DrcSmo

Kubernetes in Action, 1st Edition, by Marko Luksa
https://www.amazon.com/Kubernetes-Action-Marko-Luksa/dp/1617293725/ref=sr_1_1?keywords=kubernetes+in+action&qid=1580788013&sr=8-1

Kubernetes: Up and Running, 2nd Edition, Brendan Burns, Joe Beda, Kelsey Hightower
https://www.amazon.com/Kubernetes-Running-Dive-Future-Infrastructure/dp/1492046531/ref=sr_1_1?keywords=kubernetes+up+and+running&qid=1580788067&sr=8-1