Huawei Cloud Data Lake Factory (DLF)

Elif Meriç
Huawei Developers
Published in
6 min readApr 4, 2023
Huawei Cloud Data Lake Factory

Introduction

Hi everyone! 🤗👋 In this article, we will talk about Huawei Cloud Data Lake Factory (DLF) service. We will learn how to manage big data services, how to create & schedule job pipelines on Huawei Cloud with DLF. With this way, DLF makes our big data development processes easier ☕. Let’s start, enjoyable readings😉

Data Lake Factory is a one-stop big data development platform to manage big data services in one service. DLF enables us several opportunities. Such as:

1️⃣ Job scheduling,

2️⃣ Monitoring,

3️⃣ Script development,

4️⃣ Data management,

5️⃣ Data integration.

Data Lake Factory provides many advantages such as data lake development, one-stop data warehouse building.

✅ Advantages of Data Lake Factory

Data Lake Factory (DLF) provides many advantages. For example, the most important of these is that DLF provides a one-stop big data development environment. It enables us to perform operations such as script development, job scheduling, data integration, data management, monitoring in one platform. Additionally, it enables us to manage different big data services like DLI, DWS etc, so that DLF makes possible to schedule and orchestrate these services.

DLF supports diverse data types, it allows editing SQL online, and also Shell Scripts, real-time script query. Additionally, it allows job development for the nodes such as SQL, MR, Machine Learning, Spark, Shell, Data Migration. In this direction, it also provides job scheduling. In conclusion, we can manage, schedule, monitor big data components.

Advantages of DLF

Data Lake Factory (DLF) supports the services below:

  • MRS
  • DLI
  • CDM
  • RDS
  • DWS
  • CloudTable

Data Lake Factory enables us to manage multiple data warehouses like MRS Hive, DLI, DWS, and to manage data tables using the interface or Data Definition Language (DDL). Data Lake Factory can work with Cloud Data Migration (CDM) to provide reliable data transmission between various data sources and to integrate data sources into data warehouses.

DLF provides an online script editor to develop SQL and Shell scripts on the console. Also it provides a visual interface to build a data processing workflow by drag&drop mechanism. Lastly, it supports monitoring that allows us to view the operation details of each job and each node, and supports various alert methods. Additionally, you can check Huawei Cloud DLF documentation to know more about DLF: [1] Overview_Data Lake Factory_Service Overview_Huawei Cloud ❗ 🤗

Data Lake Factory Console Overview

In the navigation bar on the left, various functions such as Data Development, Monitoring, which DLF provides, can be accessed. A data connection can be created from the console seen on the screen below.

Creating a Data Connection

The following Data Connection types are supported in DLF:

  • MRS SparkSQL
  • DWS
  • MRS PrestoSQL
  • DLI
  • MRS Hive
  • MRS Kafka
  • RDS

DLF Solution Architecture Example: Building Cloud Data Warehouses

HUAWEI CLOUD Data Lake Factory (DLF) services allows us to migrate offline data to HUAWEI CLOUD more quickly and easily. After migrating data from on-premises data sources to cloud data sources, DLF provides opportunity to integrate these data into cloud data warehouses such as DWS, DLI, HBase.

Building Cloud Data Warehouses

How to Develop a Spark Job on DLF?

The general progress to Develop a Spark Job on DLF

Prerequisities

1️⃣ An OBS bucket should be created to store the JAR package.

2️⃣ A Data Lake Insigh (DLI) Queue should be created to provide physical resources for the Spark job.

Step 1: Creating a DLI Queue

Firstly, go to the Data Lake Insight console, then click on the Resources in the navigation bar, and then select Queue Management. After that, click Buy Queue button to create a DLI Queue. As DLI Type, select For General Purpose and Dedicated Resource Mode for the Spark cluster.

Creating a DLI Queue on DLI Console
Steps for Buying a Queue

Step 2: Preparing Spark Job Codes

After preparing the JAR package of the Spark Job codes, it should be uploaded to the created OBS bucket. Then, A resource should be created in the Configuration > Manage Resource tab, and click on Create Resource button to create a resource.

Managing Resources

Step 3: Creating a Spark Job

To create a Spark Job on DLF, navigate to the Develop Job on the console. Then, right click on the Jobs tab and click on the Create Job. After that, provide the required parameters such as Job Name, Agency, Log Path in OBS, etc.

Create a Spark Job

After creating the job, go to the Develop Job page, and click on the job you have created. In this page, there are various nodes for different purposes to create. Such as, there nodes for Data Integration, Compute&Analytics, Manage Resource, Data Monitoring, and other nodes such as SMN, etc. The related nodes can be used by dragging and dropping mechanism. In this scenario, DLI Spark node will be used. After selecting DLI Spark node, the parameters such as Node Name, Job Name, DLI Queue (The DLI Queue created in the first step to use its physical resources), Major Job Class (Java/Scala main class of batch processing job), Spark Program resource package (The resource JAR package created on the “Manage Resource” page.) should be defined.

Developing a Job

After defining the required parameters, click on the Test button to test the job, and then click to Save if there is no problem in the job logs.

💠 In other cases, different pipelines can be created with the nodes that DLF provided. For example, in the following architecture below, MRS Kafka node is set up to collect messages, MRS Spark SQL, MRS Flink nodes are also set up to process the collected data, and an OBS manager is set up to store files.

Application Scenario

Besides that, DLF provides job scheduling capabilities. It allows us to schedule our jobs. As it can be seen in the figure below, DLF provides scheduling types such as Run once, Run Periodically, Event-based, and after that, we can set up the scheduling properties.

Conclusion

By taking into consideration all of these advantages provided by Huawei Cloud Data Lake Factory (DLF) service, big data processes, which can sometimes be quite complex, can be handled easily. Jobs can be scheduled and automation is provided with various advantages such as scheduling provided by DLF. In this way, an automated system is obtained thanks to the jobs that can be run at the times specified in the DLF. Besides, it provides various functions such as DLF, monitoring, script development, data integration, data management.

📚References

  1. Huawei Cloud Data Lake Factory Documentation

--

--