Part 2— Roadmap to Become a Data Engineer for ETL/Data Warehouse Developers

Published in

Data Engineering on Cloud

9 min readAug 7, 2022

Are you a traditional ETL Developer and want to become Data Engineer but not sure how? Here is the part 1 of the 2 part series where you will learn the details about transitioning from traditional ETL Developer to Data Engineer on Cloud using AWS, Python, SQL, Spark, etc.

As part of these 2-part series of videos, we will cover how to become a Data Engineer if one is an experienced ETL or PL/SQL or Data Warehouse or Mainframes Developer. If you are an experienced Oracle PL/SQL Developer or an Informatica Developer or Talend Developer or Abinitio Developer or Microsoft SSIS/SSRS Developer or Data Stage Developer, then it is inevitable for you to transition to Data Engineer.

In these sessions most of your questions related to why and how you need to transition to Data engineering with examples based on our vast experience.

Here is the program link related to Data Engineering using AWS Analytics — https://itversity.com/bundle/data-engineering-using-aws-analytics
For sales inquiries: support@itversity.com

Part 2 Video — Roadmap to become a Data Engineer (Sample Project)

Here is the Video related to Part 2 which is primarily focused on Sample Project. Do not forget to watch the Part 1 Video, before going through the Part 2 Video.

Agenda for the workshop

Here is the agenda for this detailed workshop which covers Roadmap to become Data Engineer for ETL/Data Warehouse Developers.

Introduction about ITVersity
Durga’s Expertise and Journey
What is Data Engineering and why ETL, PL/SQL, Data Warehouse, and Mainframes Developers should take it seriously?
What are the key skills and up to what level ETL, PL/SQL, Data Warehouse, and Mainframes Developers should know the skills?
Details about our Guided Program on AWS (others in the future).
More details about the Guided Program for Data Engineering using AWS Data Analytics

Few Important Links

Few important links to stay connected with us.

Visit our website — https://www.itversity.com
Subscribe to our YouTube Channel — https://www.youtube.com/itversityin?sub_confirmation=1
Follow our LinkedIn Page — https://www.linkedin.com/company/itversity
Sign up for our Newsletter — https://forms.gle/mwVYMRzAdv89rxRf8
Udemy Profile — https://www.udemy.com/user/it-versity/
For Enquiries: support@itversity.com

Special Note to Udemy Customers

Thank You for our esteemed customer
Make sure to rate us and also provide feedback as demonstrated. Your rating and feedback is very important for our community success.
If you are existing Udemy Customer and not familiar about ITVersity Courses in Udemy, feel free to visit this page.

Data Engineering Sample Project — Problem Statement

Here are the details about the workshop or demos. It is primarily based on running analytics on top of Salesforce leads/sales data using Data Lake and Data Engineering Pipelines.

Get the data ingested from Salesforce into the bronze layer. Data will be in semi-structured format.
Apply required transformations and store the data into Data Warehouse
After applying required transformations, the Data should be well structured and should be ready for reporting.

Data Engineering Sample Project — Design using AWS Native

Here is the design using AWS Native Data Analytics Services using Salesforce, AWS s3, AWS Lambda or AWS EMR with Spark SQL, Redshift and some reporting tool such as AWS Quicksight.

Salesforce — Source which generate leads data
Appflow — AWS Fully Managed Service which can be used to ingest Salesforce Data periodically into Data Lake (which is nothing but AWS s3).
S3 — AWS Cloud Storage which can act as Data Lake for Data Analytics Applications.
Lambda — Used to build light weight Data Processing applications. Lambda can be deployed using programming languages such as Python.
EMR with Spark SQL — Used to build large scale Data Processing applications. Alternatively, we can also process data using Spark Data Frame APIs on EMR. EMR Stands for Elastic Map Reduce.
Redshift Serverless — Can be used for Ad-hoc Analysis of Data and also to build required Data Marts for reporting.

Data Engineering Sample Project — Design using Databricks on AWS

Here is the design leveraging Cloud Agnostic Data Lake Platforms such as Databricks. The sample project is built using Salesforce, DBFS on AWS s3, AWS Lambda or Databricks Runtme, Databricks SQL based on Spark SQL, and some reporting tool such as AWS Quicksight.

We can also use Databricks on other prominent cloud platforms such as Azure and GCP.

Salesforce — Source which generate leads data
Appflow — AWS Fully Managed Service which can be used to ingest Salesforce Data periodically into Data Lake (which is nothing but AWS s3).
DBFS — Databricks Abstraction on top of Cloud Storage which can act as Data Lake for Data Analytics Applications. DBFS Stands for Databricks File System.
Lambda — Used to build light weight Data Processing applications. Lambda can be deployed using programming languages such as Python.
Databricks Runtime with Spark SQL — Used to build large scale Data Processing applications using Distributed Engine built based on Spark.
Databricks SQL — Can be used for Ad-hoc Analysis of Data and also to build required Data Marts for reporting.

The above design is for AWS Databricks. We can also build similar solution on Azure where AWS s3 can be replaced with ADLS, Appflow can be replaced with ADF and AWS Lambda can be replaced with Azure Functions.

Data Engineering Sample Project — Implementation Plan

Here the high level details related to the implementation of Data Engineering using Data from Salesforce.

Overview of Salesforce and AWS Appflow
Review CSV Data with semi structured or JSON Fields
Upload files to Bronze Layer of Data Lake (AWS s3) simulating Data Ingestion from Salesforce via AWS Appflow
Overview of Approach — AWS Lambda vs. AWS EMR using Spark SQL
Implement the solution to get the data from Bronze Layer to Gold Layer
Ad-hoc Analysis using Redshift Spectrum and Load Data into Redshift reporting tables
Overview of Reporting using Redshift Tables

Overview of Salesforce and AWS Appflow

Here are some of the details related to Salesforce and AWS Appflow.

Salesforce is globally renowned Cloud based CRM Application. It is primarily used to streamline leads, sales and Customer Relationship Management.
AWS Appflow is fully managed Data Flow Service from standard applications like Salesforce. It can be used to not only get the data from Salesforce but also from other platforms such as ServiceNow.

Typically we can ingest data from Salesforce using AWS Appflow in the form of CSV Files.

Review CSV Data with semi structured or JSON Fields

Before starting on building the applications on CSV Data, we need to understand the characteristics of the data.

Whether the CSV Files which contain the data have header or not.
Delimiter or Separator. Even though CSV stands for Comma-Separated Variable format, we might end up using other Separator or Delimiter as well.
Enclosing Character
Escape Character

Details about Enclosing Character and Escape Character are covered as part of the workshop or lecture in detail.

Data Ingestion from Salesforce into s3 using Appflow

As we do not have active Salesforce Account and integration with Appflow, for the sake of Sample Project or Workshop data is already uploaded to AWS s3.

Many times as part of the projects, the integration might take time. As long as we know the structure of the data, we can generate sample data and can build rest of the data pipeline.
This approach might result in some rework, but will definitely increase productivity of the team.

AWS Lambda vs. AWS EMR with Spark SQL

Even though AWS EMR with Spark SQL is effective for large scale data sets, for smaller to medium sized data sets, it might not be viable solution.

Instead of using AWS EMR with Apache Spark, it is better to use AWS Lambda Functions for light weight Data Processing.
AWS Lambda Functions take significantly lesser time to start and also fast in performance when it comes to light weight Data Processing.

Solution using AWS Lambda

Here is the solution to transform semi-structured data to structured data using AWS Lambda Functions.

Solution using Spark SQL on EMR

Here is the solution to process Semi-Structured Data from Salesforce using Spark SQL on EMR.

We can build Orchestrated Pipeline using either of the approach leveraging AWS Step Functions.

Analysis or Reporting using Amazon Redshift

Here are Redshift Queries used as part of the Sample Project to take care of Analysis as well as Reporting the data using Amazon Redshift.

Here are the steps involved to run queries the data in AWS s3.

Create Glue Crawler pointing to s3 locations
Create Database and Tables by running Glue Crawler
Create External Schema in Redshift. It is also known as Redshift Spectrum.
Run relevant queries to perform Ad-hoc Analysis. For reporting purposes, it is recommended to ensure data is copied to Redshift tables.

Reports built against Redshift internal tables perform better compared to Redshift External Tables.

Path for ETL Developers to be a Data Engineer

Here are the details for seasoned ETL Developers to transition to be a Data Engineer. In case, if you want to enrolled to our Data Engineering courses, feel free to reach out to support@itversity.com.

Revise your existing SQL, Data Modeling and ETL Skills
Enhance your SQL Skills by adding Serverless and ability to process JSON
Learn Programming using Python leveraging your existing ETL Abilities
Master Cloud Essentials — Data Lake, IAM, Serverless Functions, Orchestration, etc
Ability to Process large volumes of data using Distributed Computing Frameworks such as Spark
Understand Architectural Patterns based on Cloud
Ability to troubleshoot the issues and tune the performance at different levels.

Guided Program to be a Data Engineer — AWS Data Analytics

Here is the agenda about our Guided Program to be a Data Engineer using AWS Data Analytics Services. In case, if you want to enrolled to our Data Engineering courses, feel free to reach out to support@itversity.com.

Python and SQL
AWS Essentials for Data Engineers for Data Lake, Distributed Compute, Data Warehouse, and other purpose-built services
Mastering AWS Lambda Functions for Data Engineers — to build or enhance data pipelines
Mastering AWS Elastic Map Reduce for Data Engineers — to build data pipelines to process large-scale data using distributed computing
Mastering AWS Redshift as Data Warehouse for Data Engineers — to build Data Marts or Data Warehouse to serve enterprise reports or dashboards
Mastering AWS Athena and Glue Data Catalog — for ad-hoc analysis of Data as well as to build data pipelines for large-scale data
Mastering Amazon Managed Streaming for Apache Kafka (MSK) — to build streaming data pipelines integrating with Spark and other purpose-built AWS Services
Performance Tuning Guide for Data Engineers on AWS — Data Ingestion, Data Lake, Data Processing, Loading Data

Details about Data Engineering using AWS Data Analytics

Here are the details about our Guided Program to be a Data Engineer using AWS Data Analytics Services. In case, if you want to enrolled to our Data Engineering courses, feel free to reach out to support@itversity.com.

Cost of the Course

Our standard price for the course starts at USD 400. There is a lot of variations related to pricing and would suggest to reach out to our sales team by sending email to support@itversity.com

Timelines for the Course

It will take any where between 4 Months to 12 Months to complete the program based on Full Time vs. Part Time, Prior Expertise on SQL and Python, etc.

It is recommended to reach out to us at support@itversity.com to understand how long it will take to complete the program. We give access up to 14 months and also provide free coupons for the courses in Udemy.

Delivery Mode (Hybrid)

You will get access to all the material on the first day itself (except for those courses which are not ready yet). On top of Self Paced content, here are the additional details.

Continuous technical support
Expert Live Sessions

Labs and Additional Costs

On top of content, you will also get labs on Python, SQL, Hadoop, Spark, and Kafka for 14 Months.

However, you need to take care of costs associated with AWS Services. It might cost you additional USD 100 to go through all the content.

Lifetime Access to the content

As part of our Platform one will get access for 14 months. However, we offer free Udemy Coupons and you will end up getting life time access to the content.

Refund Policy — 14 Days

If you think, the course is not for you then the money will be fully refunded in 14 days.

Placement Assistance or Support

We offer limited Placement Assistance or Support. In case, if you need extensive support, it will come at additional cost.

Alumni Club

We are also working on building elite Alumni Club to build a stronger Data Engineering community globally. You will be having life time access to this Club.