Data Analysis with Huawei Cloud GaussDB (Data Warehouse Service)

Elif Meriç
Huawei Developers
Published in
10 min readJun 8, 2023
GaussDB (Data Warehouse Service)

Introduction

Hi, everyone! 😊 In this article, you will learn more about data warehouses and Huawei Cloud Data Warehouse Service GaussDB. To know more about GaussDB and data warehouses, we will make a data analysis practice. Let’s get started, enjoyable readings! ☕

What is Data Warehouse?

Data Warehouse is a system used to store and report data. Data from various sources is stored in a data warehouse for long-term storage and analysis after undergoing an extraction and transformation process (ETL). The data kept in the data warehouse can be accessed when necessary, and analysis and reporting processes can be performed. That is, data warehouses are a helpful structure for companies in decision making. Data warehouses, unlike databases, store historical information. In data warehouses, data can only be read, deleted or updated. In other words, only data uploading and accessing operations can be done in the data warehouse.

What are the Advantages of a Data Warehouse?

✔Users can access all data from a single system.

✔All collected data is stored in a uniform format.

✔In this way, it provides consistency and a more accurate interpretation.

✔Stores historical data, so it can give us the information about historical data

✔Stores structured, semi-structured, relational data

In the figure below, the differences between data lakes and data warehouses can be seen.

The Differences Between Data Lakes and Data Warehouses

In the figure below, a general data warehouse architecture can be seen.

Data Warehouse Architecture

What is Huawei Cloud GaussDB (Data Warehouse Service)?

GaussDB (DWS) is an online data processing database that runs on the Huawei Cloud infrastructure to provide scalable, fully-managed, and out-of-the-box analytic database service. There is no need to perform complex database management operations and monitoring. GaussDB (DWS) provides several types:

1️⃣ Standard Data Warehouse:

Standard Data Warehouses can analyze both hot and cold data, and it is a highly cost-effective solution. There is no limit for its storage and computing resources, and it can be elastically scaled. It is suitable for the converged analysis that requires integrated databases, warehouses, lakes, and marts. It is most suitable for OLAP.

2️⃣ Stream Data Warehouse:

It provides efficient time series computing and IoT real-time analysis capabilities based on the standard data warehouse and supports correlation between real-time and historical data.

3️⃣ Hybrid Data Warehouse:

It provides high-concurrency, high performance and low-latency transaction processing capabilities at low costs based on large-scale data query and analysis capabilities. It can be used to process HTAP hybrid loads. Additionally, it can be deployed in both standalone, or cluster mode.

What are the Differences Between OLAP and OLTP?

OLTP (On Line Transaction Processing)

The main aim of the OLTP systems is to process data, not analyze the data. In the OLTP systems, to perform several operations at the same time, these operations should be consistent, insulated and stable. Most of the data storage sytems used in everyday life are OLTP systems such as ATMs etc.

OLAP (On Line Analytical Processing)

OLAP is a system that provides data analysis for decision support systems and reporting. The main purpose of these systems, unlike OLTP, is not to process data, but to analyze it. Compared to OLTP, it performs analysis and reporting with much better performance.

Advantages of GaussDB (DWS)

✨ High performance

✨ High scalability

✨ Robust reliability

✨ Low cost

✨ Ease of use

Application Scenario

🎯 Enhanced ETL and Real-Time BI Analysis

Data Warehouse Service provides a platform for a BI system for collecting, storing, and analyzing massive volumes of data. We can perform real-time, data migration etc. scenarios with a cost-effectively high performance.

Enhanced ETL and Real-Time BI Analysis Scenario

Application Scenario: Supply Chain Requirement Analysis of a Company with GaussDB (DWS)

The purpose of this scenario: Analyzing the the order data of a company and its suppliers.

1️⃣ Analyze the revenue brought by suppliers in a region to the company. The results can be used to determine whether a local allocation center needs to be established in a given region.

2️⃣ Analyze the relationship between parts and suppliers to obtain the number of suppliers for parts based on the specified contribution conditions. The results can be used to determine whether suppliers are sufficient for large order quantities when the task is urgent.

3️⃣ Analyze the revenue loss of small orders.

Preparations

  1. Create a GaussDB (DWS) cluster.
  2. Obtain AK & SK of the account
  3. A cluster has been created and connected using Data Studio

Step 1: Create a cluster

Click on the Create Cluster button to create a cluster.

Note: Select the cluster region as Singapore for this scenario.

Creating a Cluster

Select the specifications as below.

Creating a Cluster

In the network configuration step, select the VPC, Subnet, Security Group. In the EIP part, select “Buy Now’, and for ELB, selec “Do not use”.

Creating a Cluster

Fill the required informations such as cluster name, administrator account and administrator password, etc.

Creating a Cluster
Creating a Cluster

Step 2: Install Data Studio Client to Connect to a Cluster

In this step, we will install Data Studio client. Navigate to the Connections tab in the DWS console, then click Download.

Download Data Studio Client

After downloading the client, decompress the file and double click on the Data studio.exe file and execute it. After that, create a New Database Connection and fill the required informations as below:

Note: Enter your Public Network Domain Name in the Host parameter. To check this value, go to the DWS console, click on the DWS cluster you have created, and then check the Public Network Domain Name parameter in the Connections part in the Cluster Information tab.

Creating a Database Connection in Data Studio Client

After connecting to the DWS cluster on Data Studio, we need to create schema and tables and then import the data into these tables from specified OBS bucket.

We will create a database table as indicated in the schema below:

TPC-H data tables

To create these tables, execute the following query in the gaussdb database. The query below will be create the schema named tpch and the tables.

CREATE schema tpch; 
set current_schema = tpch;

drop table if exists region;
CREATE TABLE REGION
(
R_REGIONKEY INT NOT NULL ,
R_NAME CHAR(25) NOT NULL ,
R_COMMENT VARCHAR(152)
)
with (orientation = column, COMPRESSION=MIDDLE)
distribute by replication;

drop table if exists nation;
CREATE TABLE NATION
(
N_NATIONKEY INT NOT NULL,
N_NAME CHAR(25) NOT NULL,
N_REGIONKEY INT NOT NULL,
N_COMMENT VARCHAR(152)
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by replication;

drop table if exists supplier;
CREATE TABLE SUPPLIER
(
S_SUPPKEY BIGINT NOT NULL,
S_NAME CHAR(25) NOT NULL,
S_ADDRESS VARCHAR(40) NOT NULL,
S_NATIONKEY INT NOT NULL,
S_PHONE CHAR(15) NOT NULL,
S_ACCTBAL DECIMAL(15,2) NOT NULL,
S_COMMENT VARCHAR(101) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(S_SUPPKEY);

drop table if exists customer;
CREATE TABLE CUSTOMER
(
C_CUSTKEY BIGINT NOT NULL,
C_NAME VARCHAR(25) NOT NULL,
C_ADDRESS VARCHAR(40) NOT NULL,
C_NATIONKEY INT NOT NULL,
C_PHONE CHAR(15) NOT NULL,
C_ACCTBAL DECIMAL(15,2) NOT NULL,
C_MKTSEGMENT CHAR(10) NOT NULL,
C_COMMENT VARCHAR(117) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(C_CUSTKEY);

drop table if exists part;
CREATE TABLE PART
(
P_PARTKEY BIGINT NOT NULL,
P_NAME VARCHAR(55) NOT NULL,
P_MFGR CHAR(25) NOT NULL,
P_BRAND CHAR(10) NOT NULL,
P_TYPE VARCHAR(25) NOT NULL,
P_SIZE BIGINT NOT NULL,
P_CONTAINER CHAR(10) NOT NULL,
P_RETAILPRICE DECIMAL(15,2) NOT NULL,
P_COMMENT VARCHAR(23) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(P_PARTKEY);

drop table if exists partsupp;
CREATE TABLE PARTSUPP
(
PS_PARTKEY BIGINT NOT NULL,
PS_SUPPKEY BIGINT NOT NULL,
PS_AVAILQTY BIGINT NOT NULL,
PS_SUPPLYCOST DECIMAL(15,2) NOT NULL,
PS_COMMENT VARCHAR(199) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(PS_PARTKEY);

drop table if exists orders;
CREATE TABLE ORDERS
(
O_ORDERKEY BIGINT NOT NULL,
O_CUSTKEY BIGINT NOT NULL,
O_ORDERSTATUS CHAR(1) NOT NULL,
O_TOTALPRICE DECIMAL(15,2) NOT NULL,
O_ORDERDATE DATE NOT NULL ,
O_ORDERPRIORITY CHAR(15) NOT NULL,
O_CLERK CHAR(15) NOT NULL ,
O_SHIPPRIORITY BIGINT NOT NULL,
O_COMMENT VARCHAR(79) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(O_ORDERKEY);

drop table if exists lineitem;
CREATE TABLE LINEITEM
(
L_ORDERKEY BIGINT NOT NULL,
L_PARTKEY BIGINT NOT NULL,
L_SUPPKEY BIGINT NOT NULL,
L_LINENUMBER BIGINT NOT NULL,
L_QUANTITY DECIMAL(15,2) NOT NULL,
L_EXTENDEDPRICE DECIMAL(15,2) NOT NULL,
L_DISCOUNT DECIMAL(15,2) NOT NULL,
L_TAX DECIMAL(15,2) NOT NULL,
L_RETURNFLAG CHAR(1) NOT NULL,
L_LINESTATUS CHAR(1) NOT NULL,
L_SHIPDATE DATE NOT NULL,
L_COMMITDATE DATE NOT NULL ,
L_RECEIPTDATE DATE NOT NULL,
L_SHIPINSTRUCT CHAR(25) NOT NULL,
L_SHIPMODE CHAR(10) NOT NULL,
L_COMMENT VARCHAR(44) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(L_ORDERKEY);

After that, create a foreign table to identify and associate the source data on OBS. Change the <Acces Key> and <Secret Key> sections with yours.

CREATE schema tpchobs;
set current_schema='tpchobs';
drop FOREIGN table if exists region;
CREATE FOREIGN TABLE REGION
(
like tpch.region
)
SERVER gsmpp_server
OPTIONS (
encoding 'utf8',
location 'obs://dws-demo-ap-southeast-3/tpch/region.tbl',
format 'text',
delimiter '|',
access_key '<Access Key>',
secret_access_key '<Secret Key>',
chunksize '64',
IGNORE_EXTRA_DATA 'on'
);

drop FOREIGN table if exists nation;
CREATE FOREIGN TABLE NATION
(
like tpch.nation
)
SERVER gsmpp_server
OPTIONS (
encoding 'utf8',
location 'obs://dws-demo-ap-southeast-3/tpch/nation.tbl',
format 'text',
delimiter '|',
access_key '<Access Key>',
secret_access_key '<Secret Key>',
chunksize '64',
IGNORE_EXTRA_DATA 'on'
);

drop FOREIGN table if exists supplier;
CREATE FOREIGN TABLE SUPPLIER
(
like tpch.supplier
)
SERVER gsmpp_server
OPTIONS (
encoding 'utf8',
location 'obs://dws-demo-ap-southeast-3/tpch/supplier.tbl',
format 'text',
delimiter '|',
access_key '<Access Key>',
secret_access_key '<Secret Key>',
chunksize '64',
IGNORE_EXTRA_DATA 'on'
);

drop FOREIGN table if exists customer;
CREATE FOREIGN TABLE CUSTOMER
(
like tpch.customer
)
SERVER gsmpp_server
OPTIONS (
encoding 'utf8',
location 'obs://dws-demo-ap-southeast-3/tpch/customer.tbl',
format 'text',
delimiter '|',
access_key '<Access Key>',
secret_access_key '<Secret Key>',
chunksize '64',
IGNORE_EXTRA_DATA 'on'
);
drop FOREIGN table if exists part;
CREATE FOREIGN TABLE PART
(
like tpch.part

)
SERVER gsmpp_server
OPTIONS (
encoding 'utf8',
location 'obs://dws-demo-ap-southeast-3/tpch/part.tbl',
format 'text',
delimiter '|',
access_key '<Access Key>',
secret_access_key '<Secret Key>',
chunksize '64',
IGNORE_EXTRA_DATA 'on'
);
drop FOREIGN table if exists partsupp;
CREATE FOREIGN TABLE PARTSUPP
(
like tpch.partsupp
)
SERVER gsmpp_server
OPTIONS (
encoding 'utf8',
location 'obs://dws-demo-ap-southeast-3/tpch/partsupp.tbl',
format 'text',
delimiter '|',
access_key '<Access Key>',
secret_access_key '<Secret Key>',
chunksize '64',
IGNORE_EXTRA_DATA 'on'
);
drop FOREIGN table if exists orders;
CREATE FOREIGN TABLE ORDERS
(
like tpch.orders
)
SERVER gsmpp_server
OPTIONS (
encoding 'utf8',
location 'obs://dws-demo-ap-southeast-3/tpch/orders.tbl',
format 'text',
delimiter '|',
access_key '<Access Key>',
secret_access_key '<Secret Key>',
chunksize '64',
IGNORE_EXTRA_DATA 'on'
);
drop FOREIGN table if exists lineitem;
CREATE FOREIGN TABLE LINEITEM
(
like tpch.lineitem
)
SERVER gsmpp_server
OPTIONS (
encoding 'utf8',
location 'obs://dws-demo-ap-southeast-3/tpch/lineitem.tbl',
format 'text',
delimiter '|',
access_key '<Access Key>',
secret_access_key '<Secret Key>',
chunksize '64',
IGNORE_EXTRA_DATA 'on'
);

Then, execute the following query to import the foreign table data to the corresponding database table. Run the insert command to import the data in the OBS foreign table to the GaussDB(DWS) database table.

insert into tpch.lineitem select * from tpchobs.lineitem;
insert into tpch.part select * from tpchobs.part;
insert into tpch.partsupp select * from tpchobs.partsupp;
insert into tpch.customer select * from tpchobs.customer;
insert into tpch.supplier select * from tpchobs.supplier;
insert into tpch.nation select * from tpchobs.nation;
insert into tpch.region select * from tpchobs.region;
insert into tpch.orders select * from tpchobs.orders;

Performing Multi-table and Theme Analysis

In this part, we will analyze the revenues by querying the revenue statistics of a spare parts supplier in a region. The revenue is calculated based on sum( l_extendedprice * (1 — l_discount)). The statistics can be used to determine whether a local allocation center needs to be established in a given region.

set current_schema='tpch';
Select
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
customer,
orders,
lineitem,
supplier,
nation,
region
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA'
and o_orderdate >= '1994-01-01'::date
and o_orderdate < '1994-01-01'::date + interval '1 year'
group by
n_name
order by
revenue desc;

After that, we will query the relationship between spare parts and suppliers. This information can be used to determine whether there are sufficient suppliers when the order quantity is large and the task is urgent.

set current_schema='tpch';
select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size
limit 100;

Finally, we will query the revenue loss of small orders. We will filter out small orders that are lower than the 20% of the average supply volume, and calculate the total amount of those small orders to figure out the average annual revenue loss.

set current_schema='tpch';
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container = 'MED BOX'
and l_quantity < (
select 0.2 * avg(l_quantity)
from lineitem
where l_partkey = p_partkey
);

Conclusion📚

In this article, we had knowledge about data warehouses, the differences between data warehouses and data lakes, Huawei Cloud GaussDB (Data Warehouse Service), differences between OLAP and OLTP systems, GaussDB data warehouse types and its advantages, Data Warehouse application scenarios, and finally we performed a supply chain requirement data analysis of a company on Huawei Cloud GaussDB(DWS) service.

References 📊

--

--