Method of configuring local storage solution for legacy data warehouse

Published in

Modern data warehouse

7 min readJun 9, 2022

Traditional legacy analytic mpp data warehouse applications were optimized to discard unnecessary rows before sending them over the network to perform work, but modern cloud storage solutions negate this optimization by performing their own distribution of the data. This limitation plus the inability to schedule data partitions where the data resides (in case of failures) are big factors in decision making for an organization to move their legacy data warehouse workloads to a Cloud Native Platform. One of the solution to solve this in my opinion is mentioned as follows.

This document mainly focuses on -

A high performant storage solution that can be deployed on a Container Orchestration (such as OpenShift) running on on-prem with locally attached drives and provides close to bare metal drives performance.
Solution supports resiliency and redundancy; provides an interface to select supported redundancy profile.

Legacy analytic data warehouse mainly cares about performance and wants to run workloads as far as possible very close to bare metal layer having full control and knowledge about I/O stack. This provides a deployable high performant optimized storage layout tailored for legacy databases on a Cloud Native Platform that provides a bare metal performance with offering enough controlling / configurable flexibility to data warehouse application.

We call this solution Horcrux going forward in the document.

The solution uses LVM, RAID1, DM, MD and iSCSI technologies based on an optimized storage topology to create a faster and resilient storage layout over local storage and remote iSCSI devices.
This solution gets deployed using Kubernetes CSI driver onto a Cloud Native Platform such as Kubernetes or OpenShift.
Storage topology is integrated with Kubernetes Scheduler to take an optimized decision while placing a data partition application pod to a node.

Storage layout ensures resiliency and redundancy.

HA LVM Cluster (availability zone)

HA-LVM chosen over CLVMD because of active/passive nature and performance
Depending on total deployable nodes, there could be more than one HA LVM Cluster.
By default, 7 nodes per HA LVM cluster. There will be an interface provided to change it.
Each HA LVM cluster can have maximum 16 nodes.
One storage controller pod to manager all HA LVM clusters.
Pods deployed as a daemonset across cluster
Pods expose all drives as iscsi targets
Pods mount all remote drives as iscsi initatior

Kubernetes CSI controller

It is responsible for handling a PVC create request (storage allocation request).
It is an extension of Kubernetes provided CSI apis to support an external storage provisioner.

CreateVolume(ctx context.Context, req *csi.CreateVolumeRequest)

DeleteVolume(ctx context.Context, req *csi.DeleteVolumeRequest)

ControllerExpandVolume(ctx context.Context, req *csi.ControllerExpandVolumeRequest)

CreateSnapshot(ctx context.Context, req *csi.CreateSnapshotRequest)

DeleteSnapshot(ctx context.Context, req *csi.DeleteSnapshotRequest)

It’s running as a “deployment” (a K8S term) on one of the OpenShift master node.
It will be deployed by Openshift operator.

CSI Node plugin

It is responsible for attaching / mounting / binding a PVC to an application.
It is an extension of Kubernetes provided CSI apis to support an external storage provisioner.

NodeStageVolume(ctx context.Context, req *csi.NodeStageVolumeRequest)

NodeUnstageVolume(ctx context.Context, req *csi.NodeUnstageVolumeRequest)

NodePublishVolume(ctx context.Context, req *csi.NodePublishVolumeRequest)

It is running as “Daemonset” (K8S term). Basically, on all horcrux labelled worker (by default all worker nodes).
It will be deployed by Openshift operator.

Storage Node pod

Runs on all Horcrux labelled worker nodes (by default all worker nodes).
Runs as a “Statefulset” (K8S term).
Responsible for setting up iSCSI initiators and targets.
It has a light weight REST server running. REST is used to handle communication to/from storage node, CSI pods and storage controller.
On demand, provide provisioning status and local volume status to Storage controller pod.
It will be deployed by Openshift operator.

Storage controller pod

Running on one of the Openshift master node.
Handles storage layout provisioning by keeping track of “storage pod” provisioning status.
Initiate a REST request to one of “storage node” to create VG for that cluster. If “storage node” is not quorum leader (pacemaker), it forwards that request to storage pod that’s also packemaker leader.
Monitoring health of storage HA cluster(s) and can initiate resync operation if spare capacity is allocated for a cluster.
It will be deployed by Openshift operator as a deployment kubernetes object.

Scheduler

A custom kubernetes scheduler will be installed which will deploy pods to the baremetal node that the primary mirror volume lives on.
If the node/disk where the primary mirror volume lives is inaccessible, the scheduler will deploy the pod on the node containing the secondary mirror volume.
The scheduler application will need to communicate with the storage rest server in order to determine the correct node to deploy a pod onto.

Operator

Horcrux uses “Openshift Operator” to deploy all its elements such as storage node pod, storage controller pod, node plugin pod, and controller plugin pod.
It’s written using operator-sdk and golang.
It uses Openshift internal registry to pull and push horcrux images.
Horcrux CRD (Custom Resource Definition) interface provides the following configurable options
NUM_NODES_PER_CLUSTER (by default: RAID7)
SPARE_CAPACITY_PER_CLUSTER (by default: 0)
PROFILE (RAID1, RAID5 ) (by default: RAID1)
HORCRUX_SCHEDULER (by default: enabled)

Operator deployment flow

Horcrux operator controls deployment. Once Horcrux operator and CRD deployed, it’s reconciled loop kicks in and deploys other pods.

Cluster Provisioning flow

Once all storage node pods deployed, storage controller pod will takeover to setup Horcrux cluster with storage layout. In this example, let’s say there are 3 storage nodes. It basically goes through 3 states.

state:NodeDiscovering

Storage controller pod tries to discover all storage node pods.

state:NodeReady

Storage controller pod discovered storage node pods and they are up and running.
At the point in time, storage node pods are ready to accept REST request.

state:DevicesDiscovered

Local and iSCSI remote devices discovered
Pacemaker cluster is setup

state:Running

HA LVM storage layout is setup.

There is an intermediate ‘ING’ state for the above 2 states as well.

Storage Volume Claim

Availability-zone

Defines a boundary for number of nodes and persistent Volumes.

We will have a storage class for each availability zone, and while creating a PVC an appropriate one should be used.

PVC will be provisioned from a given availability zone.

Drive failure

In the event of a drive failure, the pods running on the physical node will not be evicted. They’ll continue to run, however the storage volumes attached to these pods will now be read from their mirrors that are stored on a remote node. In the background, the storage system should begin replicating the volumes from the failed drive evenly across the cluster to protect against data loss. A disk replacement procedure will be followed, and the primary mirror volumes will be restored to their original drive. At this time performance will return to normal for the user with zero downtime. In the event of a second drive failure, the behaviour will be the same as single disk failure as long as there are replicas of all the failed volumes. If a second drive fails and contains the only complete replica of a volume (for example if replication of a failed volume has not yet completed), the storage system won’t be able to provide access to the storage, and the pods will need to be evicted. Data may be lost, and an admin may need to manually try to recover data.

Node failure

In the event of a node failure, pods which run on that node will be forcibly evicted. The custom scheduler will deploy the pods on the nodes where their volume’s secondary mirror partitions live. Once the pods are moved the system will run with degraded performance, since some physical nodes will now be oversubscribed with work. A node replacement procedure will be followed, and once the new node is added the data will be replicated back to the new node. At this point, even though they are running in a degraded state, we don’t want to evict pods to the new node since this will cause a service interruption. The admin will be able to run a command during scheduled maintenance which will evict necessary pods and schedule them back to the new node.

co-author: Mike DeRoy Aniket Kulkarni