Connect your Scala Application from HBase to Bigtable

Anjali Chimnani
Google Cloud - Community
7 min readDec 15, 2022

Migrate your Scala code along with data migration from HBase to Google Cloud Bigtable

Migration of on prem infrastructure to Google Cloud is often followed by migrating Application code which connects to that infrastructure and consumes data from it. The code could be in different languages and thus, involve respective changes based on the infrastructure and application language. We will thus visit one such common requirement to migrate Scala applications connecting to Apache Hbase to connect and consume data from Google Cloud Bigtable.

Let’s say we have an HBase cluster with 1 table as “employee” and a Bigtable cluster with a table with the same name as shown below. Both the tables have 2 rows with row keys 1 and 2.

HBase:

scan 'employee'

Google Cloud Bigtable:

cbt read employee

The Scala Application connecting to HBase

We have a Scala application which connects to the HBase cluster above, reading data from the employee table with row key 1. The files contained within the application are:

  • App.scala (The application code which reads the data from the HBase cluster)
  • Connection.scala (The code which establishes the connection to HBase cluster which is consequently used by App.scala)
  • Pom.xml (Configurations for project dependencies, plugins, versions, etc. including HBase connection, configuration)

Note: The application code is simplified to focus on the changes required for migration from HBase to Bigtable.

App.scala

package com.example

import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration, TableName}
import org.apache.hadoop.conf.Configuration
import scala.collection.JavaConverters._

/**
* @author anjalichimnani
*/
object App {

/**
* Read data in the cluster connected to connection object from table name specified and for row key requested
*/
def readTable(connection: Connection, tableName: String, rowKey: String) {
println("Within readTable:")
val table = connection.getTable(TableName.valueOf( Bytes.toBytes(tableName) ) )

var get = new Get(Bytes.toBytes(rowKey))
var result = table.get(get)

println(result)

table.close()
}

/**
* Main function where the code execution initiates.
* It sets the variable values for cluster, table name and row key to be read
* Requests for a connection to cluster
* Calls readTable method to get data from the cluster.
*
* To run the code, specify HBase Cluster host/hosts
*/
def main(args : Array[String]) {

println( "Within Main Method:")
val zookeeper_quorum = "<HBase-cluster>"
val tableName = "employee"
val rowKey = "1"

val connection = Connection.connect(zookeeper_quorum)
readTable(connection, tableName, rowKey)

connection.close()

}

}

Connection.scala

package com.example

import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration, TableName}
import org.apache.hadoop.conf.Configuration
import scala.collection.JavaConverters._

object Connection {

/**
* Connect Method to take the cluster details, create a Hbase configuration and consequently a connection
* The method returns a Connection object which could be used to read/write/update data from the cluster which it is connected to.
*/
def connect(zookeeper_quorum: String): Connection = {
println("Within Connection:")
val conf : Configuration = HBaseConfiguration.create()
conf.set("hbase.zookeeper.quorum", zookeeper_quorum)

val connection = ConnectionFactory.createConnection(conf)

return connection
}

}

pom.xml

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>employee</artifactId>
<version>1.0-SNAPSHOT</version>
<name>${project.artifactId}</name>
<description>My wonderfull scala app</description>
<inceptionYear>2018</inceptionYear>
<licenses>
<license>
<name>My License</name>
<url>http://....</url>
<distribution>repo</distribution>
</license>
</licenses>

<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.12.6</scala.version>
<spark.version>2.4.8</spark.version>
<hbase.version>2.4.9</hbase.version>
<scala.compat.version>2.12</scala.compat.version>
<spec2.version>4.2.0</spec2.version>
</properties>

<dependencies>

<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>

<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>

<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<version>3.0.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs2</groupId>
<artifactId>specs2-core_${scala.compat.version}</artifactId>
<version>${spec2.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs2</groupId>
<artifactId>specs2-junit_${scala.compat.version}</artifactId>
<version>${spec2.version}</version>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.21.0</version>
<configuration>
<!-- Tests will be run with scalatest-maven-plugin instead -->
<skipTests>true</skipTests>
</configuration>
</plugin>
<plugin>
<groupId>org.scalatest</groupId>
<artifactId>scalatest-maven-plugin</artifactId>
<version>2.0.0</version>
<configuration>
<reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
<junitxml>.</junitxml>
<filereports>TestSuiteReport.txt</filereports>
<!-- Comma separated list of JUnit test class names to execute -->
<jUnitClasses>samples.AppTest</jUnitClasses>
</configuration>
<executions>
<execution>
<id>test</id>
<goals>
<goal>test</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.example.App</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

</plugins>
</build>
</project>

The output of the execution of the application is as below:

Migrating the Application to connect to Google Cloud Bigtable

To migrate the application, we only need to make changes in the connect method to use Bigtable configuration instead of HBase Configuration, pass in the appropriate connection parameters and specify Bigtable dependencies in pom.xml with appropriate versions.

First changes to Connection.class

Change the connection to use BigtableConfiguration connect method to create the connection. Thus, removing the earlier HBase configuration and connection.

To use BigtableConfiguration class, import it with com.google.cloud.bigtable.hbase.BigtableConfiguration

package com.example

import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration, TableName}
import org.apache.hadoop.conf.Configuration
import scala.collection.JavaConverters._
import com.google.cloud.bigtable.hbase.BigtableConfiguration;

object Connection {

/**
* Connect Method to take the cluster details, create a Hbase configuration and consequently a connection
* The method returns a Connection object which could be used to read/write/update data from the cluster which it is connected to.
*/
def connect(project_id: String, instance_id: String): Connection = {

println("Within Connection:")

val connection = BigtableConfiguration.connect(BigtableConfiguration.configure(project_id, instance_id))

return connection
}

}

Changes in App.scala

Since the connection configuration has changed from cluster list in HBase to project_id and instance_id in Bigtable, change the call to the connect method.

There are no other changes required to read the data when same table name and row keys exists in Bigtable as in HBase

package com.example

import org.apache.hadoop.hbase.client._
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{CellUtil, HBaseConfiguration, TableName}
import org.apache.hadoop.conf.Configuration
import scala.collection.JavaConverters._

/**
* @author anjalichimnani
*/
object App {

/**
* Read data in the cluster connected to connection object from table name specified and for row key requested
*/
def readTable(connection: Connection, tableName: String, rowKey: String) {
println("Within readTable:")
val table = connection.getTable(TableName.valueOf( Bytes.toBytes(tableName) ) )

var get = new Get(Bytes.toBytes(rowKey))
var result = table.get(get)

println(result)

table.close()
}

/**
* Main function where the code execution initiates.
* It sets the variable values for cluster, table name and row key to be read
* Requests for a connection to cluster
* Calls readTable method to get data from the cluster.
*
* To run the code, specify Bigtable Project ID and Instance ID
*/
def main(args : Array[String]) {

println( "Within Main Method:")
val tableName = "employee"
val rowKey = "1"
val project_id = "<google-cloud-project-name>"
val instance_id = "<bigtable-instance-id>"

val connection = Connection.connect(project_id, instance_id)
readTable(connection, tableName, rowKey)

connection.close()

}

}

Final changes in pom.xml

As we need the class com.google.cloud.bigtable.hbase.BigtableConfiguration, we need to specify the respective dependencies and versions. They are added below:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>employee</artifactId>
<version>1.0-SNAPSHOT</version>
<name>${project.artifactId}</name>
<description>My wonderfull scala app</description>
<inceptionYear>2018</inceptionYear>
<licenses>
<license>
<name>My License</name>
<url>http://....</url>
<distribution>repo</distribution>
</license>
</licenses>

<properties>
<maven.compiler.source>1.8</maven.compiler.source>
<maven.compiler.target>1.8</maven.compiler.target>
<encoding>UTF-8</encoding>
<scala.version>2.12.6</scala.version>
<spark.version>2.4.8</spark.version>
<hbase.version>2.4.9</hbase.version>
<scala.compat.version>2.12</scala.compat.version>
<spec2.version>4.2.0</spec2.version>
<bigtable.version>1.26.3</bigtable.version>
</properties>

<dependencies>
<dependency>
<groupId>com.google.cloud.bigtable</groupId>
<artifactId>bigtable-hbase-2.x-hadoop</artifactId>
<version>${bigtable.version}</version>
</dependency>

<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>

<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-server</artifactId>
<version>${hbase.version}</version>
</dependency>

<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
<scope>provided</scope>
</dependency>

<!-- Test -->
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<version>4.12</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.compat.version}</artifactId>
<version>3.0.5</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs2</groupId>
<artifactId>specs2-core_${scala.compat.version}</artifactId>
<version>${spec2.version}</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.specs2</groupId>
<artifactId>specs2-junit_${scala.compat.version}</artifactId>
<version>${spec2.version}</version>
<scope>test</scope>
</dependency>
</dependencies>

<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<!-- see http://davidb.github.com/scala-maven-plugin -->
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.3.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.21.0</version>
<configuration>
<!-- Tests will be run with scalatest-maven-plugin instead -->
<skipTests>true</skipTests>
</configuration>
</plugin>
<plugin>
<groupId>org.scalatest</groupId>
<artifactId>scalatest-maven-plugin</artifactId>
<version>2.0.0</version>
<configuration>
<reportsDirectory>${project.build.directory}/surefire-reports</reportsDirectory>
<junitxml>.</junitxml>
<filereports>TestSuiteReport.txt</filereports>
<!-- Comma separated list of JUnit test class names to execute -->
<jUnitClasses>samples.AppTest</jUnitClasses>
</configuration>
<executions>
<execution>
<id>test</id>
<goals>
<goal>test</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
<archive>
<manifest>
<mainClass>com.example.App</mainClass>
</manifest>
</archive>
</configuration>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>

</plugins>
</build>
</project>

After all the changes, the migrated code runs successfully as below:

Note: To run the code, you must provide the HBase cluster and Bigtable cluster details as specified within the “<>” in the code i.e. for the following:

//For HBase
val zookeeper_quorum = "<HBase-cluster>"

//For Google Cloud Bigtable
val project_id = "<google-cloud-project-name>"
val instance_id = "<bigtable-instance-id>"

To build and execute the jar, run the following in the directory where pom.xml exists:

mvn package

java -jar target/employee-1.0-SNAPSHOT-jar-with-dependencies.jar

That’s all folks!! You successfully migrated your Scala code to connect to Google Cloud Bigtable from HBase and read data from it.

References:

Further Reads for Google Cloud Bigtable:

Thank you for reading. Feel free to share your comments and further interests on the topic!!

--

--