Externalize Data and (Hive)Metadata to make the compute engine stateless

Mrudula Madiraju
The Startup
Published in
4 min readFeb 9, 2020

In this blog you will learn how and why to make your IBM Analytics Engine (1.2) cluster stateless by keeping your data and (hive) metadata outside of the cluster. We use IBM Cloud Object Storage and Databases For PostGreSQL

Overview

Separating storage from compute is a recommended paradigm that brings in flexibility and optimization of resources. The decoupling allows you to scale up (or scale down) either of the two, independently, without impacting the other. Specifically in the case of IBM Analytics Engine, it allows you to get into the cattle-vs-pets way of thinking. Need to move from a cluster on which a small POC was done to a cluster that has more capability? Need to move to a cluster with enhanced software/fixes? Messed up a cluster configuration and need to start all over? Need a HA or DR setup? Move On! Just get an additional/new compute engine/cluster!

Externalize Data & Metadata

Stateless clusters with data and metadata residing outside

In the above diagram we have two instances of Analytics Engine that have been configured to work with the same instance of IBM COS and PostgreSQL for storing data and hive metadata respectively. The Analytics Engine instances are being used only for running of the Hive jobs against the data in COS. Since they share the same data and metadata, data created from one Analytics Engine cluster can be read from the other.

Demo

FROM THE FIRST ANALYTICS ENGINE CLUSTER:

  1. On Analytics Engine instance1 baa-337, create an external table “employees” located on COS
Create table from the first cluster

2. load the employees table with some data

Load data into the table from first cluster

3. Select from the employees table

select from table on first cluster

FROM THE SECOND ANALYTICS ENGINE CLUSTER:

  1. On the second Analytics Engine cluster hzy-700the table employee is already listed and you can do the select directly.
show tables from second cluster
select from table on second cluster

Behind the scenes

Configuring the IBM Analytics Engine cluster to work with IBM Cloud Object Storage can be done in multiple ways. If you choose the static configuration and put the COS credentials in “core-site.xml”, it will show up like this in the Ambari UI.

Cluster configured to work with COS

Configuring the cluster to work with PostgreSQL can be again done via UI or using “adhoc customization”. After configuration, your cluster Hive configs will look like this.

Cluster configured to work with PostgreSQL as the Hive Metastore DB

The same configurations are repeated for the second cluster as well.

AdHoc Customization Script Example for PostgreSQL configuration

From the previous demo, two ways to configure the PostgreSQL DB were mentioned. (a) via the UI and (b) using the adhoc customization script. Read more about it here.

If you need to frequently create clusters, configuring via UI can get tedious. The “AdHoc Customization” feature can come in handy in such a scenario. All you need do is to create a shell script that gets executed against a cluster. When automated with the rest of cluster creation scripts, it makes life easy as a cluster administrator.

Specifically for the PostgreSQL customization, we make use of a ready shell script that is available in IBM Analytics Engine Github repo. And we make use of the following customization payload to make use of the above script.

To understand what the different fields of the payload are, look up here.

  • The first four fields below with “<changeme>” are related to your PostgreSQL db instance. Make sure you replace those fields with the appropriate values such as the user, password, hostname & port.
  • The script also expects the password of the Analytics Engine cluster instance. “clsadminpassword”
  • The last 4 “<changeme>” fields are for the location of the self signed certificate of the PostgreSQL cert. The script expects you to place the certificate in an object storage bucket and specify the endpoint of the COS instance, location of the certificate in the bucket, accessKey and secret.
"script": {
"source_type": "http",
"script_path": "https://raw.githubusercontent.com/IBM-Cloud/IBM-Analytics-Engine/master/customization-examples/associate-external-metastore-postgresql.sh"
},
"script_params":
[
"<CHANGEME>ibm_cloud_user_name<CHANGEME>",
"<CHANGEME>postgresdbpassword<CHANGEME>",
"ibmclouddb",
"jdbc:postgresql://<CHANGEME>db_host_name<CHANGEME>.databases.appdomain.cloud:<CHANGEME>db_port<CHANGEME>/ibmclouddb?sslmode=verify-ca",
"<CHANGEME>clsadminpassword<CHANGEME>",
"<CHANGEME>s3.us-south.cloud-object-storage.appdomain.cloud<CHANGEME>",
"/<CHANGEME>mrmadirabucket/postgres-feb4.cert<CHANGEME>",
"<CHANGEME>myCOSaccesskey<CHANGEME>",
"<CHANGEME>myCOSsecret<CHANGEME>"
]

Note the format of the certificate location. It is “/<bucket>/location-to-cert”. Also note that it expects the decoded plain text certificate in the bucket.

Place the decoded plaintext cert into bucket of your choice. In this case I have placed the cert with name “postgres-feb4.cert” in a bucket called “mrmadirabucket

Conclusion

This story demonstrates how you can make clusters stateless, how you can use the customization feature to automate the repetitive tasks of configuration. In conclusion, make sure to follow the prescribed Best Practices of using the IBM Analytics Engine clusters. In particular those for PostgreSQL configuration.

--

--

Mrudula Madiraju
The Startup

Dealing with Data, Cloud, Compliance and sharing tidbits of epiphanies along the way.