A Journey of Data Engineer to the Cloud — Data Lake

Arvind Kumar

Published in

The Startup

9 min readAug 2, 2020

A generic onboarding flow of data engineers to the cloud.

As a Data Engineer, We firstly think about where to land data into a cloud which means think and understand about storage before processing.

In this regard, the crucial thing for DE to know storage which supports limitless,auto-scalable, Hadoop compatible, cost-effective, high throughput, ease of use for data upload, data access with authentication and authorization, audit logs, network isolation, data life cycle management, Choice of replication method and data security, etc. More info

We(Data Engineers) think about how to upload data to the cloud - once or repeatedly, pull or push.

Often times, a one-time data transfer happens when it is an initial load like 500TB or more than 1PB of data which should be done through offline transfer devices like Azure data box, Azure Data Box Disk, Azure Import/Export and Azure partner’s tools for data migration like WANdisco.

Another scenario where we upload data once during development like sample data, or production data which is not in TBs which should be done through online network transfer. Below options are shown to transfer 100GB data over 500Mbps speed once.

In the case of repeated data upload with a push to the cloud, It can be achievable by triggering time or event-based scripts on-premise/other clouds, written using AzCopy, Azure PowerShell, Azure CLI, Azure storage Rest API.

In case of repeated data upload with a pull from on-premises or another cloud, It can be achieved using managed ETL data pipeline Azure data factory also which provides 90+ different data source connectors like SAP, Salesforce, S3, etc.

Once we know how to upload data then we also think about how to upload it securely and make sure it is stored into a cloud with security.

Data Encryption in Transit

We should always use SSL/TLS protocols to exchange data across different locations.

When huge data is being transferred to the cloud, Need to make sure data always goes through a dedicated network line with a certain speed from the organization data center to the cloud data center. Azure Express Route is one of the solutions for this.

Another option of transfer when you are using a public network(no dedicated line)then can consider Azure VPN Gateway and Https.

We can also make it mandatory via the Azure portal to use Http(s) while interacting with ADLSGen2.

Data Encryption at Rest

As soon as data is in flight, need to make sure data should be encrypted before saved/landed into Azure data center disk of ADLSGen2 and decrypted back when it is being read, This is called server-side encryption. By Default ADLSGen2 provides data encryption at rest using Microsoft managed encryption key. Customer can use their own key also and store the customer encryption key in Azure key vault.

Data can be encrypted before uploading for cloud and decrypted when it is downloaded from cloud this is called Client-side encryption

More info about data security best practices in Azure.

After storing data in ADLSGen2 with encryption, How to control access of data for internal org users.

A security principal is an object that represents a user, group, service principal(for application), or managed identity that is defined in Azure Active Directory (AD), is requesting access to Azure resources.

RBAC

Here, Internal users(aka Security principal) are part of my organization Azure active directory. In-built ‘storage blob data’ role of ADLSGen2 storage can be assigned to Internal users. These roles are readers, contributors (writer), and owner. By assigning different roles to different internal users, the purpose of data access can be controlled.

POSIX-like access control lists (ACLs)

Roles are assigned to an internal user(security principal) at the ADLSGen2 account level whereas ACLs are also assigned to an internal user(security principal) but at a container, folder, and file level. ACLs do not inherit. However, default ACLs can be used to set ACLs for child subdirectories and files created under the parent directory.

The following table lists some common scenarios to help you understand which permissions are needed to perform certain operations on a storage account.

If the requested operation is authorized by the security principal’s RBAC assignments then authorization is immediately resolved and no additional ACL checks are performed. Alternatively, if the security principal does not have an RBAC assignment or the request’s operation does not match the assigned permission, then ACL checks are performed to determine if the security principal is authorized to perform the requested operation.

More info

How to provide access to ADLSGen2 to external users/apps/groups.

If data sender and receiver both have access to the Azure portal then using Azure data share service data can be exchanged and tracked also. This exchanged process can be automated as well.

Storage Account Access Key

External user like application or person accessing ADLSGen2 without using Azure portal login or Azure Active directory log in, can access ADLSGen2 via account access key(kind of password) but this is not recommended approach of sharing account access key with external users because using this key, external user will get full control of storage account like delete/create and delegate the access.

Shared Access Signature

For an external user/app/group, a time-based expiry token should be generated for activities like uploading a file to ADLSGen2. This token can be generated using either an account access key or user delegation key(which requires Azure AD credentials). Here recommended approach is user delegation SAS token which helps to track caller at least because it involves AAD. More info

Generated token or access key can be used at Client-side code to interact with ADLSGen2 Rest APIs over Https.

In short, SAS token for an external user and RBAC/ACLs for the internal users.

Once I allow certain users or application access to ADLSGen2, How to restrict their access request of ADLSGen2 from a certain network.

Firstly enable firewall i.e. enable “selected network” access and disabling “All networks” on ADLSGen2

If the objective is to allow access requests coming from Azure virtual network then a particular Virtual network name is added/whitelisted in the ADLSGen2 firewall.

If the objective is to allow requests coming from a particular IP e.g. your machine IP or IP range e.g. organization network then particular IP and IP range is added into the ADLSGen2 firewall.

How can I explore new data quickly without provisioning any infrastructure and pay as I use?

Azure synapse SQL on-demand is a serverless distributed query engine and auto-generate code for you to explore data. You can also modify generated code like group by section and visualize the result in different chart types. It supports file formats like CSV, parquet(nested), and JSON. More info

How should I organize data in ADLSGen2?

You create three containers Bronze(raw), Silver(cleansed), and Gold(analytics). In each container, you store diff departments or diff data sources after a few processing.

In each container you can create folders hierarchy , it might looks like <data source name>/year=<year>/month=<month>/day=<day>/hour=<hour>. The most frequently queried column should be used in partition strategy for better query performance and less data scanning.

In case you are storing multiple departments huge data then you might create a container for each department and within each department container, creates folders like Bronze, Silver, and Gold.

While deciding the folder structure we should consider expected IO operation, file size, and access control of users at the folder level. More info to understand the performance tunning guideline of data lake.

Let us discuss what are these containers or folders — Bronze, Silver, and Gold

Bronze Container: A landing zone where raw data comes from different data sources located at different places like on-premises, other clouds, the same cloud, etc. A recommendation is to keep this data immutable and allow read-only access to any computing, data engineer, or data science after landing. If possible ingest data to this container in compress format to reduce storage cost. Apply auto life cycle management policy to move data from hot to cold and cold to an archive for cost optimization. In general, this raw data is accessed by only data engineers and data science. We also keep historical data for compliance purposes with lowest cost.

Silver Container: Data are read from the Bronze container and then it goes through a quality check process. Data are curated and make it more in presentation form where data or business analyst can also understand. It is more like a cleaning process or data prep e.g. replacing null value, removing extra spaces, convert multi spellings of the same word into a single, etc. Once this process is finished then curated data are stored into a Silver container. With consideration of data democratization, It can be exposed to multiple stakeholders through the Azure data catalog and help them to understand what is available inside the data lake. It is also known as the data preparation process in the context of machine learning model training.

Gold Container: Data are read from the Silver container which is already curated(like no duplicate, single consistent value/spell for a single word) and now can be enriched by joining with other referential data for analytics purposes and write into Gold containers. Gold container data can be used for timely reporting and data mart creation after inserting it into DWH.

This complete pipeline of data movement and data processing can be orchestrated using Azure data factory (Low Code, No Code, UI based drag-drop for pipeline activities))and for data processing, we can use serverless ADF data flow for simple transformation/data preparation and/or provision compute like Azure data bricks, Azure HDInsight and Synapse SQL/Spark, etc for complex data transformations and data enrichment.

How to plan for developing end to end data pipeline?

It is very important for Data engineers to spend enough time to understand data sources around 5Vs before start ingesting into the cloud. The outcome of data source analysis in terms of 5Vs helps to design the data pipeline better. for e.g. If a source is generating data once a day then you don’t need streaming, you can go for a batch processing architecture.

The architecture of any data platform/pipeline evolves as we onboard different data sources and serve new business requirements every time. It is always recommended - Kappa architecture — More info.

From the data engineers' point of view, a single integrated development environment (IDE) or workspace is required to develop end to end pipeline which can help to build data pipeline activities in a single window with security, access control, single sign-on, and monitor, etc.

Ingest data in high volume(size) , with velocity(real-time) from different locations(like on-prem) to ADLSGen2.
Analyze and process different structures and formats of data like JSON, parquet, XML, CSV, Excel, etc using the choice of languages such as Spark, SQL, java, scala, R, Python, and .NET, etc.
Visualization of results immediately after processing.
Orchestrate and schedule pipeline with monitoring and alert.

Azure Synapse analytics facilitates not only data engineers to develop an e2e data pipeline using a single IDE but also provide analytics platform for data scientists, data analysts, and BI dashboard developers.