Deploying LinkedIn DataHub on AWS

Optimizing Data Management for Scalability and Reliability

Published in

Data Reply IT | DataTech

10 min readFeb 21, 2024

In today’s data-driven landscape, effective data management solutions are essential for businesses striving to maintain a competitive edge. LinkedIn DataHub has emerged as a powerful tool for organizing and leveraging vast amounts of data efficiently. However, deploying such a solution, especially ensuring scalability and reliability, poses challenges.

This article will explore the intricacies of deploying LinkedIn DataHub on Amazon Web Services (AWS), providing insights into the deployment process and best practices for seamless integration. It will address common issues encountered during deployment and provide specific solutions to overcome them, along with detailing the configurations employed.

By harnessing AWS’s robust infrastructure and LinkedIn DataHub’s capabilities, companies can optimize their data management processes, enabling effective scaling of operations.

From resource provisioning to cloud architecture configuration, this article will delve into step-by-step deployment procedures, providing specific details on the configurations used. Whether for small startup or forlarge enterprise, understanding how to deploy LinkedIn DataHub on AWS is crucial for unlocking its full potential and maximizing the efficiency of your data management efforts.

In the deployment process of LinkedIn DataHub on AWS, a shell script file (.sh) was utilized to simplify the execution of all necessary commands. This article will delve into the detailed examination of the script, elucidating purpose and functionality of each command.

It is important to highlight that this guide is specifically crafted for the scenario of deploying LinkedIn DataHub on a Linux machine hosted on Amazon Web Services (AWS). While the principles discussed may be applicable to other environments, the focus remains on providing actionable insights and configurations relevant to the setup.

deploy_datahub_v3.sh

At the beginning of the shell script, the name of the cluster was set as a variable.

clusterName = "DataHub-Cluster-Test"

This variable clusterName holds the identifier for the DataHub cluster, facilitating its consistent reference throughout the script.

Kubectl

The utility known as kubectl, part of Kubernetes, provides a command-line interface for interacting with Kubernetes clusters. With kubectl, it is possible to execute various commands to deploy software applications, monitor and administer cluster resources, as well as access and analyze log data.

# kubectl 
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
curl -LO "https://dl.k8s.io/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl.sha256"
echo "$(cat kubectl.sha256)  kubectl" | sha256sum --check
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
kubectl version --output=yaml

The first command uses curl to download the kubectl executable from the latest stable version of Kubernetes. It utilizes two nested curl commands: the first one retrieves the latest stable version of Kubernetes, while the second one actually downloads the kubectl executable for the Linux operating system and amd64 architecture.

The second command downloads the SHA-256 checksum file associated with the kubectl executable from the same stable version of Kubernetes used in the first command. This file will be used to verify the integrity of the downloaded kubectl executable.

The third command computes the SHA-256 checksum of the downloaded kubectl executable and checks it against the checksum provided in the previously downloaded kubectl.sha256 file. This step ensures that the downloaded kubectl file has been transferred correctly without alterations.

The fourth command installs the kubectl executable into the /usr/local/bin/ directory with appropriate permissions. It is executed with administrator privileges (sudo) to ensure proper installation and accessibility of the executable to all users on the system.

Finally, the last command retrieves the installed kubectl version and other information about the Kubernetes server. The — output=yaml option specifies that the output will be formatted in YAML, making it easier for users to read and interpret the data.

By following these steps, it will be possible to download, verify, and install the kubectl executable correctly on Linux system, ready to interact with Kubernetes clusters.

Eksctl

eksctl is a tool designed to simplify the creation, management, and maintenance of Kubernetes clusters on Amazon Web Services Elastic Kubernetes Service (EKS). With eksctl, it is possible to automate the provisioning and management process of Kubernetes clusters, reducing significantly operational complexity.

# eksctl  
sudo curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | sudo tar xz -C /usr/local/bin

# cluster 
eksctl create cluster --name $clusterName --version 1.27 --region eu-west-1 --nodegroup-name $clusterName-NodeGroup --node-type t3.medium --nodes 3 --with-oidc

The first command uses curl to download the eksctl binary file from the official Weaveworks repository on GitHub. The downloaded file is then extracted (using tar) into the /usr/local/bin directory, which is commonly used for system executable programs on Linux.

This second command creates an Amazon EKS cluster using the eksctl tool. It specifies parameters such as the cluster name, Kubernetes version, region, node group name, node type, number of nodes, and enabling OIDC integration for IAM.

OIDC — OpenID Connect

OIDC (OpenID Connect) is an authentication layer built on top of the OAuth 2.0 protocol. It provides a simple identity layer that allows clients to verify the identity of end-users based on the authentication performed by an authorization server, as well as to obtain basic profile information about the end-users. In essence, OIDC enables single sign-on (SSO) and access control for web and mobile applications.

These commands facilitate the integration of IAM (Identity and Access Management) OIDC (OpenID Connect) provider with an Amazon EKS (Elastic Kubernetes Service) cluster.

# OIDC
oidc_id=$(aws eks describe-cluster --name $clusterName --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
aws iam list-open-id-connect-providers | grep $oidc_id
eksctl utils associate-iam-oidc-provider --cluster $clusterName --approve

The first command retrieves the OIDC issuer URL for the specified EKS cluster using the AWS CLI (Command Line Interface). The issuer URL is then extracted and stored in the oidc_id variable.

The second command lists all the IAM OIDC identity providers in the AWS account and filters the output to find the one that matches the OIDC issuer URL obtained in the previous step

The last command associates the IAM OIDC provider identified. in the previous step with the specified EKS cluster. This association allows Kubernetes to authenticate using IAM roles, enabling fine-grained access control within the cluster. The --approve flag confirms the association without prompting for user confirmation.

Service account EBS and IAM

Before deploying applications within an Amazon EKS (Elastic Kubernetes Service) cluster, it is crucial to establish the necessary configurations and permissions. This ensures secure access and efficient management of resources within the Kubernetes environment. The following commands are essential steps in this preparatory process.

# SERVICE ACCOUNT EBS E IAM
kubectl apply -f my-service-account.yaml
account_id=$(aws sts get-caller-identity --query "Account" --output text)
oidc_provider=$(aws eks describe-cluster --name $clusterName --region $AWS_REGION --query "cluster.identity.oidc.issuer" --output text | sed -e "s/^https:\/\///")

export namespace=kube-system
export service_account=ebs-csi-controller-sa

sed -i "s/\$namespace/$namespace/g; s/\$service_account/$service_account/g; s/\$oidc_provider/$oidc_provider/g" aws-ebs-csi-driver-trust-policy.json

aws iam create-role --role-name AmazonEKS_EBS_CSI_DriverRole_$clusterName --assume-role-policy-document file://aws-ebs-csi-driver-trust-policy.json --description "my-role-description"


aws iam attach-role-policy \
  --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy \
  --role-name AmazonEKS_EBS_CSI_DriverRole_$clusterName
  
kubectl annotate serviceaccount -n $namespace $clusterName eks.amazonaws.com/role-arn=arn:aws:iam::$account_id:role/AmazonEKS_EBS_CSI_DriverRole_$clusterName

eksctl create addon --name aws-ebs-csi-driver --cluster $clusterName --service-account-role-arn arn:aws:iam::$account_id:role/AmazonEKS_EBS_CSI_DriverRole_$clusterName --force

service_role_name=$(aws eks describe-cluster --name $clusterName --query "cluster.roleArn" --output text | cut -d'/' -f2)
worker_role_arn=$(eksctl get nodegroup --cluster $clusterName -o json | jq -r '.[].StackName' | xargs -I {} aws cloudformation describe-stack-resources --stack-name {} --query 'StackResources[?ResourceType==`AWS::IAM::Role`].PhysicalResourceId' --output text)
aws iam attach-role-policy --role-name $service_role_name --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy
aws iam attach-role-policy --role-name $worker_role_arn --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy

kubectl create -f storage-class.yaml

The series of commands outlined above are pivotal in establishing a robust environment conducive to deploying applications within an Amazon EKS (Elastic Kubernetes Service) cluster. Below, each step in more detail.

Initially, the script collects essential information regarding the AWS account and OIDC provider. This information serves as the foundation for subsequent configuration steps, ensuring seamless integration with AWS services.

Following this, two critical environment variables, namely namespace and service_account, are defined and initialized. These variables play a crucial role in configuring the operational context within the Kubernetes cluster, facilitating streamlined interactions with various resources.

The sed command is then employed to dynamically modify a JSON file (aws-ebs-csi-driver-trust-policy.json) based on the values of the previously defined environment variables. This ensures that the configuration aligns precisely with the requirements of the deployment environment.

Moving forward, an IAM role, AmazonEKS_EBS_CSI_DriverRole_$clusterName, is created with a meticulously crafted trust policy. This role delineates the permissible interactions with AWS resources, essential for maintaining security and integrity within the cluster.

Subsequently, the IAM policy AmazonEBSCSIDriverPolicy is attached to the IAM role, endowing it with the requisite permissions for the EBS CSI driver to seamlessly interact with AWS resources.

An annotation is then added to the Kubernetes service account, embedding the IAM role ARN within its metadata. This annotation serves as a crucial link, enabling the Kubernetes service account to leverage the permissions associated with the IAM role for efficient access control within the cluster.

Further bolstering the setup, an addon named aws-ebs-csi-driver is instantiated for the EKS cluster. This addon leverages the IAM role ARN to grant the necessary permissions to the EBS CSI driver, empowering it to efficiently manage EBS volumes within the cluster.

Then, IAM policies are attached to both the EKS cluster’s role and the worker nodes’ role. This ensures comprehensive access control, safeguarding the cluster against unauthorized access and ensuring seamless operation of the EBS CSI driver across all cluster components.

Finally, the script creates a storage class configuration defined in the storage-class.yaml file, completing the setup for EBS volumes within the Kubernetes cluster.

In summation, these meticulously orchestrated steps collectively lay the groundwork for a robust, secure, and efficiently managed environment, poised to support the seamless deployment and operation of applications within the Amazon EKS cluster.

Create SECRET

# CREATE SECRET
kubectl create secret generic mysql-secrets — from-literal=mysql-root-password=datahub
kubectl create secret generic neo4j-secrets — from-literal=neo4j-password=datahub

In this section, two Kubernetes secrets are being created using the kubectl command. These secrets, mysql-secrets and neo4j-secrets, are essential for securely storing sensitive information, particularly passwords, required for accessing MySQL and Neo4j databases respectively.

The first command kubectl create secret generic mysql-secrets --from-literal=mysql-root-password=datahub initializes a secret named mysql-secrets and incorporates a key-value pair. Here, the key is mysql-root-password, while the value is datahub, representing the password required for MySQL root access.

Similarly, the second command kubectl create secret generic neo4j-secrets --from-literal=neo4j-password=datahub creates a secret labeled neo4j-secrets. It also includes a key-value pair, where the key neo4j-password holds the password datahub, utilized for accessing the Neo4j database.

These commands serve to securely store sensitive data within the Kubernetes cluster, enhancing the overall security posture by preventing the exposure of passwords in plaintext format. This approach ensures that sensitive information is safeguarded and accessible only to authorized applications within the cluster.

Helm

Helm is a powerful tool for simplifying the management of Kubernetes applications through package management. It allows users to define, install, and upgrade complex Kubernetes applications with ease using pre-configured packages called charts.

# HELM
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

The first command utilizes curl to download the Helm installation script named get_helm.sh from the official Helm GitHub repository. The script is responsible for setting up Helm on your system.

The second command grants execute permissions (700) to the downloaded installation script get_helm.sh. This step is necessary to ensure that the script can be executed.

The last command executes the installation script get_helm.sh, which installs Helm on your system. It sets up all necessary configurations and dependencies to enable Helm for managing Kubernetes applications.

Upon completion of these commands, Helm will be successfully installed on your system, empowering you to effectively manage Kubernetes applications using Helm charts.

Installing DataHub using Helm

# DATAHUB (prerequisites + datahub)
helm repo add datahub https://helm.datahubproject.io/
helm install prerequisites datahub/datahub-prerequisites

In this section, the DataHub application, along with its prerequisites, is installed using Helm. DataHub is a data discovery and metadata management platform.

The first command adds the DataHub Helm repository to the local Helm configuration. This repository contains the necessary Helm charts for installing DataHub.

Next, the command helm install prerequisites datahub/datahub-prerequisites initiates the installation process. It installs the prerequisites for DataHub, such as databases, caches, and other dependencies, using the datahub-prerequisites chart from the DataHub repository.

Once the prerequisites are installed, DataHub itself is installed and configured automatically as part of the prerequisites. The DataHub application is not explicitly installed in this command, but rather as a dependency of the prerequisites chart.

After the installation, the following commands are typically executed to verify the status of the deployed resources:

#CHECK STATUS
kubectl get pods
kubectl get pvc
kubectl get pv

This command retrieves information about the pods, persistent volume claims (PVCs), and persistent volumes (PVs) currently present in the Kubernetes cluster. It helps administrators monitor the status and availability of these resources, facilitating efficient management of the cluster.

Configuration Files

In this section, the necessary configuration files required by the script discussed in the preceding section are showcased.

aws-ebs-csi-driver-trust-policy.json

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::$account_id:oidc-provider/$oidc_provider"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "$oidc_provider:aud": "sts.amazonaws.com",
          "$oidc_provider:sub": "system:serviceaccount:$namespace:$service_account"
        }
      }
    }
  ]
}

This JSON policy grants permission for a federated identity provider (specified by its ARN) to assume a role using Web Identity Federation (STS:AssumeRoleWithWebIdentity) within the AWS environment. The conditions specify that the audience (aud) and subject (sub) claims of the OIDC provider’s token must match specific values for the operation to be allowed.

storage-class.yaml

kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
  name: aws-pg-sc
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
  fsType: ext4

This YAML file defines a Kubernetes StorageClass named “aws-pg-sc” with metadata including annotations designating it as the default class. It specifies the provisioner as “kubernetes.io/aws-ebs” and sets parameters for the type of storage (“gp2”) and filesystem type (“ext4”).

my-service-account.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: ebs-csi-controller-sa
  namespace: kube-system

This YAML file creates a Kubernetes ServiceAccount named “ebs-csi-controller-sa” within the “kube-system” namespace.

Conclusion

Overall, the article thoroughly examined the process of deploying LinkedIn DataHub on Amazon Web Services (AWS), illustrating the challenges and best practices to ensure scalability and reliability. However, for the sake of simplicity in discussion, detailed steps such as modifying the security groups associated with the EKS cluster to allow inbound requests processing were not included, a crucial aspect to ensure a secure and functional environment.

In conclusion, implementing data management solutions like LinkedIn DataHub on AWS provides businesses with the opportunity to optimize data management processes and improve business operations. By harnessing the robust infrastructure of AWS and the capabilities of LinkedIn DataHub, companies can gain a competitive edge in today’s data-driven landscape. The combination of these platforms offers a powerful tool for organizing and optimizing data resources, enabling businesses to adapt and thrive in an increasingly competitive and data-oriented environment.

Giuseppe Brescia, Antonio La Macchia