Data Analysis with Huawei Cloud GaussDB (Data Warehouse Service)

Published in

Huawei Developers

10 min readJun 8, 2023

Introduction

Hi, everyone! 😊 In this article, you will learn more about data warehouses and Huawei Cloud Data Warehouse Service GaussDB. To know more about GaussDB and data warehouses, we will make a data analysis practice. Let’s get started, enjoyable readings! ☕

What is Data Warehouse?

Data Warehouse is a system used to store and report data. Data from various sources is stored in a data warehouse for long-term storage and analysis after undergoing an extraction and transformation process (ETL). The data kept in the data warehouse can be accessed when necessary, and analysis and reporting processes can be performed. That is, data warehouses are a helpful structure for companies in decision making. Data warehouses, unlike databases, store historical information. In data warehouses, data can only be read, deleted or updated. In other words, only data uploading and accessing operations can be done in the data warehouse.

What are the Advantages of a Data Warehouse?

✔Users can access all data from a single system.

✔All collected data is stored in a uniform format.

✔In this way, it provides consistency and a more accurate interpretation.

✔Stores historical data, so it can give us the information about historical data

✔Stores structured, semi-structured, relational data

In the figure below, the differences between data lakes and data warehouses can be seen.

In the figure below, a general data warehouse architecture can be seen.

What is Huawei Cloud GaussDB (Data Warehouse Service)?

GaussDB (DWS) is an online data processing database that runs on the Huawei Cloud infrastructure to provide scalable, fully-managed, and out-of-the-box analytic database service. There is no need to perform complex database management operations and monitoring. GaussDB (DWS) provides several types:

1️⃣ Standard Data Warehouse:

Standard Data Warehouses can analyze both hot and cold data, and it is a highly cost-effective solution. There is no limit for its storage and computing resources, and it can be elastically scaled. It is suitable for the converged analysis that requires integrated databases, warehouses, lakes, and marts. It is most suitable for OLAP.

2️⃣ Stream Data Warehouse:

It provides efficient time series computing and IoT real-time analysis capabilities based on the standard data warehouse and supports correlation between real-time and historical data.

3️⃣ Hybrid Data Warehouse:

It provides high-concurrency, high performance and low-latency transaction processing capabilities at low costs based on large-scale data query and analysis capabilities. It can be used to process HTAP hybrid loads. Additionally, it can be deployed in both standalone, or cluster mode.

What are the Differences Between OLAP and OLTP?

OLTP (On Line Transaction Processing)

The main aim of the OLTP systems is to process data, not analyze the data. In the OLTP systems, to perform several operations at the same time, these operations should be consistent, insulated and stable. Most of the data storage sytems used in everyday life are OLTP systems such as ATMs etc.

OLAP (On Line Analytical Processing)

OLAP is a system that provides data analysis for decision support systems and reporting. The main purpose of these systems, unlike OLTP, is not to process data, but to analyze it. Compared to OLTP, it performs analysis and reporting with much better performance.

Advantages of GaussDB (DWS)

✨ High performance

✨ High scalability

✨ Robust reliability

✨ Low cost

✨ Ease of use

Application Scenario

🎯 Enhanced ETL and Real-Time BI Analysis

Data Warehouse Service provides a platform for a BI system for collecting, storing, and analyzing massive volumes of data. We can perform real-time, data migration etc. scenarios with a cost-effectively high performance.

Application Scenario: Supply Chain Requirement Analysis of a Company with GaussDB (DWS)

The purpose of this scenario: Analyzing the the order data of a company and its suppliers.

1️⃣ Analyze the revenue brought by suppliers in a region to the company. The results can be used to determine whether a local allocation center needs to be established in a given region.

2️⃣ Analyze the relationship between parts and suppliers to obtain the number of suppliers for parts based on the specified contribution conditions. The results can be used to determine whether suppliers are sufficient for large order quantities when the task is urgent.

3️⃣ Analyze the revenue loss of small orders.

Preparations

Create a GaussDB (DWS) cluster.
Obtain AK & SK of the account
A cluster has been created and connected using Data Studio

Step 1: Create a cluster

Click on the Create Cluster button to create a cluster.

Note: Select the cluster region as Singapore for this scenario.

Select the specifications as below.

In the network configuration step, select the VPC, Subnet, Security Group. In the EIP part, select “Buy Now’, and for ELB, selec “Do not use”.

Fill the required informations such as cluster name, administrator account and administrator password, etc.

Step 2: Install Data Studio Client to Connect to a Cluster

In this step, we will install Data Studio client. Navigate to the Connections tab in the DWS console, then click Download.

After downloading the client, decompress the file and double click on the Data studio.exe file and execute it. After that, create a New Database Connection and fill the required informations as below:

Note: Enter your Public Network Domain Name in the Host parameter. To check this value, go to the DWS console, click on the DWS cluster you have created, and then check the Public Network Domain Name parameter in the Connections part in the Cluster Information tab.

Creating a Database Connection in Data Studio Client

After connecting to the DWS cluster on Data Studio, we need to create schema and tables and then import the data into these tables from specified OBS bucket.

We will create a database table as indicated in the schema below:

To create these tables, execute the following query in the gaussdb database. The query below will be create the schema named tpch and the tables.

CREATE schema tpch; 
set current_schema = tpch;

drop table if exists region;
CREATE TABLE REGION
(
        R_REGIONKEY  INT NOT NULL , 
        R_NAME       CHAR(25) NOT NULL ,
        R_COMMENT    VARCHAR(152)
)
with (orientation = column, COMPRESSION=MIDDLE)
distribute by replication;

drop table if exists nation;
CREATE TABLE NATION
(
        N_NATIONKEY  INT NOT NULL, 
        N_NAME       CHAR(25) NOT NULL,
        N_REGIONKEY  INT NOT NULL,
        N_COMMENT    VARCHAR(152)
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by replication;

drop table if exists supplier;
CREATE TABLE SUPPLIER
(
        S_SUPPKEY     BIGINT NOT NULL,
        S_NAME        CHAR(25) NOT NULL,
        S_ADDRESS     VARCHAR(40) NOT NULL,
        S_NATIONKEY   INT NOT NULL,
        S_PHONE       CHAR(15) NOT NULL,
        S_ACCTBAL     DECIMAL(15,2) NOT NULL,
        S_COMMENT     VARCHAR(101) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(S_SUPPKEY);

drop table if exists customer;
CREATE TABLE CUSTOMER
(
        C_CUSTKEY     BIGINT NOT NULL,
        C_NAME        VARCHAR(25) NOT NULL,
        C_ADDRESS     VARCHAR(40) NOT NULL, 
        C_NATIONKEY   INT NOT NULL, 
        C_PHONE       CHAR(15) NOT NULL, 
        C_ACCTBAL     DECIMAL(15,2)   NOT NULL,
        C_MKTSEGMENT  CHAR(10) NOT NULL, 
        C_COMMENT     VARCHAR(117) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(C_CUSTKEY);

drop table if exists part;
CREATE TABLE PART
(
        P_PARTKEY     BIGINT NOT NULL, 
        P_NAME        VARCHAR(55) NOT NULL, 
        P_MFGR        CHAR(25) NOT NULL, 
        P_BRAND       CHAR(10) NOT NULL, 
        P_TYPE        VARCHAR(25) NOT NULL,
        P_SIZE        BIGINT NOT NULL,
        P_CONTAINER   CHAR(10) NOT NULL,
        P_RETAILPRICE DECIMAL(15,2) NOT NULL,
        P_COMMENT     VARCHAR(23) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(P_PARTKEY);

drop table if exists partsupp;
CREATE TABLE PARTSUPP
(
        PS_PARTKEY     BIGINT NOT NULL,
        PS_SUPPKEY     BIGINT NOT NULL, 
        PS_AVAILQTY    BIGINT NOT NULL,
        PS_SUPPLYCOST  DECIMAL(15,2)  NOT NULL, 
        PS_COMMENT     VARCHAR(199) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(PS_PARTKEY);

drop table if exists orders;
CREATE TABLE ORDERS
(
        O_ORDERKEY       BIGINT NOT NULL,
        O_CUSTKEY        BIGINT NOT NULL, 
        O_ORDERSTATUS    CHAR(1) NOT NULL, 
        O_TOTALPRICE     DECIMAL(15,2) NOT NULL,
        O_ORDERDATE      DATE NOT NULL , 
        O_ORDERPRIORITY  CHAR(15) NOT NULL, 
        O_CLERK          CHAR(15) NOT NULL , 
        O_SHIPPRIORITY   BIGINT NOT NULL,
        O_COMMENT        VARCHAR(79) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(O_ORDERKEY);

drop table if exists lineitem;
CREATE TABLE LINEITEM
(
        L_ORDERKEY    BIGINT NOT NULL,
        L_PARTKEY     BIGINT NOT NULL, 
        L_SUPPKEY     BIGINT NOT NULL,
        L_LINENUMBER  BIGINT NOT NULL,
        L_QUANTITY    DECIMAL(15,2) NOT NULL, 
        L_EXTENDEDPRICE  DECIMAL(15,2) NOT NULL,
        L_DISCOUNT    DECIMAL(15,2) NOT NULL,
        L_TAX         DECIMAL(15,2) NOT NULL, 
        L_RETURNFLAG  CHAR(1) NOT NULL,
        L_LINESTATUS  CHAR(1) NOT NULL,
        L_SHIPDATE    DATE NOT NULL, 
        L_COMMITDATE  DATE NOT NULL ,
        L_RECEIPTDATE DATE NOT NULL, 
        L_SHIPINSTRUCT CHAR(25) NOT NULL, 
        L_SHIPMODE     CHAR(10) NOT NULL, 
        L_COMMENT      VARCHAR(44) NOT NULL
)
with (orientation = column,COMPRESSION=MIDDLE)
distribute by hash(L_ORDERKEY);

After that, create a foreign table to identify and associate the source data on OBS. Change the <Acces Key> and <Secret Key> sections with yours.

CREATE schema tpchobs;
set current_schema='tpchobs';
drop FOREIGN table if exists region;
CREATE FOREIGN TABLE REGION
(
        like tpch.region
)                    
SERVER gsmpp_server 
OPTIONS (
        encoding 'utf8',
        location 'obs://dws-demo-ap-southeast-3/tpch/region.tbl',
        format 'text',
        delimiter '|',
        access_key '<Access Key>',
        secret_access_key '<Secret Key>',
        chunksize '64',
        IGNORE_EXTRA_DATA 'on'
);
 
drop FOREIGN table if exists nation;
CREATE FOREIGN TABLE NATION
(
        like tpch.nation
)
SERVER gsmpp_server 
OPTIONS (
         encoding 'utf8',
         location 'obs://dws-demo-ap-southeast-3/tpch/nation.tbl',
         format 'text',
         delimiter '|',
         access_key '<Access Key>',
         secret_access_key '<Secret Key>',
         chunksize '64',
         IGNORE_EXTRA_DATA 'on'
);
 
drop FOREIGN table if exists supplier;
CREATE FOREIGN TABLE SUPPLIER
(
        like tpch.supplier
)
SERVER gsmpp_server 
OPTIONS (
        encoding 'utf8',
        location 'obs://dws-demo-ap-southeast-3/tpch/supplier.tbl',
        format 'text',
        delimiter '|',
        access_key '<Access Key>',
        secret_access_key '<Secret Key>',
        chunksize '64',
        IGNORE_EXTRA_DATA 'on'
);
 
drop FOREIGN table if exists customer;
CREATE FOREIGN TABLE CUSTOMER
(
        like tpch.customer
)
SERVER gsmpp_server 
OPTIONS (
        encoding 'utf8',
        location 'obs://dws-demo-ap-southeast-3/tpch/customer.tbl',
        format 'text',
        delimiter '|',
        access_key '<Access Key>',
        secret_access_key '<Secret Key>',
        chunksize '64',
        IGNORE_EXTRA_DATA 'on'
);
drop FOREIGN table if exists part;
CREATE FOREIGN TABLE PART
(
        like tpch.part

)
SERVER gsmpp_server 
OPTIONS (
        encoding 'utf8',
        location 'obs://dws-demo-ap-southeast-3/tpch/part.tbl',
        format 'text',
        delimiter '|',
        access_key '<Access Key>',
        secret_access_key '<Secret Key>',
        chunksize '64',
        IGNORE_EXTRA_DATA 'on'
);
drop FOREIGN table if exists partsupp;
CREATE FOREIGN TABLE PARTSUPP
(
        like tpch.partsupp
)
SERVER gsmpp_server 
OPTIONS (
        encoding 'utf8',
        location 'obs://dws-demo-ap-southeast-3/tpch/partsupp.tbl',
        format 'text',
        delimiter '|',
        access_key '<Access Key>',
        secret_access_key '<Secret Key>',
        chunksize '64',
        IGNORE_EXTRA_DATA 'on'
);
drop FOREIGN table if exists orders;
CREATE FOREIGN TABLE ORDERS
(
        like tpch.orders
)
SERVER gsmpp_server 
OPTIONS (
        encoding 'utf8',
        location 'obs://dws-demo-ap-southeast-3/tpch/orders.tbl',
        format 'text',
        delimiter '|',
        access_key '<Access Key>',
        secret_access_key '<Secret Key>',
        chunksize '64',
        IGNORE_EXTRA_DATA 'on'
);
drop FOREIGN table if exists lineitem;
CREATE FOREIGN TABLE LINEITEM
(
        like tpch.lineitem
)
SERVER gsmpp_server 
OPTIONS (
        encoding 'utf8',
        location 'obs://dws-demo-ap-southeast-3/tpch/lineitem.tbl',
        format 'text',
        delimiter '|',
        access_key '<Access Key>',
        secret_access_key '<Secret Key>',
        chunksize '64',
        IGNORE_EXTRA_DATA 'on'
);

Then, execute the following query to import the foreign table data to the corresponding database table. Run the insert command to import the data in the OBS foreign table to the GaussDB(DWS) database table.

insert into tpch.lineitem select * from tpchobs.lineitem;
insert into tpch.part select * from tpchobs.part;
insert into tpch.partsupp select * from tpchobs.partsupp;
insert into tpch.customer select * from tpchobs.customer;
insert into tpch.supplier select * from tpchobs.supplier;
insert into tpch.nation select * from tpchobs.nation;
insert into tpch.region select * from tpchobs.region;
insert into tpch.orders select * from tpchobs.orders;

Performing Multi-table and Theme Analysis

In this part, we will analyze the revenues by querying the revenue statistics of a spare parts supplier in a region. The revenue is calculated based on sum( l_extendedprice * (1 — l_discount)). The statistics can be used to determine whether a local allocation center needs to be established in a given region.

set current_schema='tpch';
Select
n_name,
sum(l_extendedprice * (1 - l_discount)) as revenue
from
customer,
orders,
lineitem,
supplier,
nation,
region
where
c_custkey = o_custkey
and l_orderkey = o_orderkey
and l_suppkey = s_suppkey
and c_nationkey = s_nationkey
and s_nationkey = n_nationkey
and n_regionkey = r_regionkey
and r_name = 'ASIA'
and o_orderdate >= '1994-01-01'::date
and o_orderdate < '1994-01-01'::date + interval '1 year'
group by
n_name
order by
revenue desc;

After that, we will query the relationship between spare parts and suppliers. This information can be used to determine whether there are sufficient suppliers when the order quantity is large and the task is urgent.

set current_schema='tpch';
select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
and ps_suppkey not in (
        select
        s_suppkey
        from
        supplier
        where
        s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size
limit 100;

Finally, we will query the revenue loss of small orders. We will filter out small orders that are lower than the 20% of the average supply volume, and calculate the total amount of those small orders to figure out the average annual revenue loss.

set current_schema='tpch';
select
sum(l_extendedprice) / 7.0 as avg_yearly
from
lineitem,
part
where
p_partkey = l_partkey
and p_brand = 'Brand#23'
and p_container = 'MED BOX'
and l_quantity < (
        select 0.2 * avg(l_quantity)
        from lineitem
        where l_partkey = p_partkey
);

Conclusion📚

In this article, we had knowledge about data warehouses, the differences between data warehouses and data lakes, Huawei Cloud GaussDB (Data Warehouse Service), differences between OLAP and OLTP systems, GaussDB data warehouse types and its advantages, Data Warehouse application scenarios, and finally we performed a supply chain requirement data analysis of a company on Huawei Cloud GaussDB(DWS) service.

References 📊

Progressive Knowledge_GaussDB(DWS)_Huawei Cloud

Data Warehouse Service (DWS) is a real-time, simple, secure, and reliable enterprise-class converged data warehouse…

support.huaweicloud.com