☁️Huawei Cloud HBase Solution With MRS

Hatice Duyar Keskin

Published in

Huawei Developers

7 min readJan 25, 2024

Introduction

Hello Everyone !

Today, I will tell you about HBase, which is one of the systems that allows us to store data and perform analysis operations on Big data. We know that these systems, which are used locally, are integrated into the cloud world over time. Let’s learn together how the systems that have a place in this developing technology are used in Huawei Cloud?

Enjoyable reading 😊

What is HBase ?

As an important member of the Hadoop ecosystem, HBase is a column-oriented distributed database that is suitable for storing unstructured and semi-structured loose data. It features high reliability, performance, flexible scalability, and supports real-time read/write. HBase can process tables each with 1 billion.

HBase Architecture

HBase architecture components:

Master: Master is also called HMaster. In HA mode, HMaster consists of an active HMaster and a standby HMaster.Active Master: manages RegionServer in HBase, including the creation, deletion, modification, and query of a table, balances the load of RegionServer, adjusts the distribution of Region, splits Region and distributes Region after it is split, and migrates Region after RegionServer expires.Standby Master: takes over services when the active HMaster is faulty. The original active HMaster demotes to the standby HMaster after the fault is rectified.

Client: Client communicates with Master for management and with RegionServer for data protection by using the Remote Procedure Call (RPC) mechanism of HBase.

RegionServer: RegionServer provides read and write services of table data as a data processing and computing unit in HBase.RegionServer is deployed with DataNodes of HDFS clusters to store data.

ZooKeeper Cluster: ZooKeeper provides distributed coordination services for processes in HBase clusters. Each RegionServer is registered with ZooKeeper so that the active Master can obtain the health status of each RegionServer.

HDFS cluster: HDFS provides highly reliable file storage services for HBase. All HBase data is stored in the HDFS.

So, what kind of a table does HBase design for us when storing this huge data? Let’s examine the HBase table structure together.

HBase Table Features

HBase table components:

RowKey: Similar to the primary key in a relationship table, which is the unique ID of the data in each row. A RowKey can be a string, integer, or binary string. All records are stored after being sorted by RowKey.

Timestamp: The timestamp of a data operation. Data can be specified with different versions by time stamp. Data of different versions in each cell is stored by time in descending order.

Cell: Minimum storage unit of HBase, consisting of keys and values. A key consists of six fields, namely row, column family, column qualifier, timestamp, type, and MVCC version. Values are the binary data objects.

Column Family: One or multiple horizontal column families form a table. A column family can consist of multiple random columns. A column is a label under a column family, which can be added as required when data is written. The column family supports dynamic expansion so the number and type of columns do not need to be predefined. Columns of a table in HBase are sparsely distributed. The number and type of columns in different rows can be different. Each column family has the independent time to live (TTL). You can lock the row only. Operations on the row in a column family are the same as those on other rows.

Column: Similar to traditional databases, HBase tables also use columns to store data of the same type.

How do you think HBase protects itself in a possible worst-case scenario when storing such large data?

HBase HA Solution

HMaster in HBase allocates Regions. When one RegionServer service is stopped, HMaster migrates the corresponding Region to another RegionServer. The HMaster HA feature is brought in to prevent HBase functions from being affected by the HMaster single point of failure (SPOF).

The HMaster HA architecture is built based on Ephemeral nodes (temporary nodes) created in the ZooKeeper cluster.

Upon startup, HMaster nodes try to create a master znode in the ZooKeeper cluster. The HMaster node that creates the master znode first becomes the active HMaster, and the other is the standby HMaster.

It will add watch events to the master node. If the service on the active HMaster is stopped, the active HMaster disconnects from the ZooKeeper cluster. After the session expires, the active HMaster disappears. The standby HMaster detects the disappearance of the active HMaster through watch events and creates a master node to make itself be the active one. Then, the active/standby switchover completes. If the failed node detects existence of the master node after being restarted, it enters the standby state and adds watch events to the master node.

When the client accesses the HBase, it first obtains the HMaster’s address based on the master node information on the ZooKeeper and then establishes a connection to the active HMaster.

What commands are used to read and write this data in the HBase storage, let’s take a look together.

Hindex and Hbase Command:

HBase is a distributed Key-Value store. Data of a table is sorted in the alphabetic order based on row keys. If you query data based on a specified row key or scan data in the scale of a specified row key, HBase can quickly locate the target data, enhancing the efficiency.

However, in most actual scenarios, you need to query the data of which the column value is XXX. HBase provides the Filter feature to query data with a specific column value. All data is scanned in the order of row keys, and then the data is matched with the specific column value until the required data is found. The Filter feature scans some unnecessary data to obtain the only required data. Therefore, the Filter feature cannot meet the requirements of frequent queries with high performance standards.

HBase HIndex is designed to address these issues. HBase HIndex enables HBase to query data based on specific column values.

Write:

Put put = new Put(Bytes.toBytes("row"));
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q1"), Bytes.toBytes("valueA"));
put.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q2"), Bytes.toBytes("valueB"));
put.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q2"), Bytes.toBytes("valueC"));
table.put(put);

Query:

scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',>=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf1','q2',>=,'binary:valueB',true,true) AND SingleColumnValueFilter('cf2','q1',>=,'binary:valueC',true,true) "}

scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf1','q2',>=,'binary:valueB',true,true)" }

scan 'table', {FILTER=>"SingleColumnValueFilter('cf1','q1',>=,'binary:valueA',true,true) AND SingleColumnValueFilter('cf1','q2',>=,'binary:valueB',true,true) AND SingleColumnValueFilter('cf2','q1',>=,'binary:valueC',true,true)",STARTROW=>'row001',STOPROW=>'row100'}

Update:

Put put1 = new Put(Bytes.toBytes("row"));
put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q1"), Bytes.toBytes("valueA"));
put1.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q2"), Bytes.toBytes("valueB"));
put1.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q1"), Bytes.toBytes("valueC"));
put1.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q2"), Bytes.toBytes("valueD"));
table.put(put1);

Put put2 = new Put(Bytes.toBytes("row"));
put2.addColumn(Bytes.toBytes("cf1"), Bytes.toBytes("q3"), Bytes.toBytes("valueE"));
put2.addColumn(Bytes.toBytes("cf2"), Bytes.toBytes("q3"), Bytes.toBytes("valueF"));
table.put(put2);

Rolling upgrade is not supported for index data.

• Do not configure any split policy for tables with index data.

• Other mutation operations, such as increment and append, are not supported.

• Index of the column with maxVersions greater than 1 is not supported.

• The data index column in a row cannot be updated.

· The table to which an index is added cannot contain a value greater than 32 KB.

· If user data is deleted due to the expiration of the column-level TTL, the corresponding index data is not deleted immediately. It will be deleted in the major compaction operation.

· The TTL of the user column family cannot be modified after the index is created.

If the TTL of a column family increases after an index is created, delete the index and re-create one. Otherwise, some generated index data will be deleted before user data is deleted.

If the TTL value of the column family decreases after an index is created, the index data will be deleted after user data is deleted.

· The index query does not support the reverse operation, and the query results are disordered.

· The index does not support the clone snapshot operation.

· Index tables must use HIndexWALPlayer to replay logs. WALPlayer cannot be used to replay logs.

· When the deleteall command is executed for the index table, the performance is low.

· The index table does not support HBCK. To use HBCK to repair an index table, delete index data first.

Now let’s see how to integrate this technology into Huawei Cloud.

Huawei Cloud Hbase Solution With MRS

Select “MRS” option from Service List.

Click “Buy Cluster” button from MRS dashboard.

Select HBase component for cluster.

Conclusion

I used Huawei Cloud services in my article. I recommend that you pay attention to billing when using these services. You can turn off the services after using them.

You can provide feedback with comments and actions. Thanks..

References

Doc-detail-HUAWEI CLOUD

Doc-detail

-HUAWEI CLOUD Doc-detailsupport.huaweicloud.com

HCIA - Big Data Learning Path | HUAWEI CLOUD

This course is applicable to HCIA - Big Data V3.0. It covers Big Data Development Trend and Kunpeng Big Data…

edu.huaweicloud.com