Part 1 — Roadmap to Become a Data Engineer for ETL/Data Warehouse Developers

Published in

Data Engineering on Cloud

8 min readAug 6, 2022

Are you a traditional ETL Developer and want to become Data Engineer but not sure how? Here is the part 1 of the 2 part series where you will learn the details about transitioning from traditional ETL Developer to Data Engineer on Cloud using AWS, Python, SQL, Spark, etc.

As part of these 2-part series of videos, we will cover how to become a Data Engineer if one is an experienced ETL or PL/SQL or Data Warehouse or Mainframes Developer. If you are an experienced Oracle PL/SQL Developer or an Informatica Developer or Talend Developer or Abinitio Developer or Microsoft SSIS/SSRS Developer or Data Stage Developer, then it is inevitable for you to transition to Data Engineer.

In these sessions most of your questions related to why and how you need to transition to Data engineering with examples based on our vast experience.

Here is the program link related to Data Engineering using AWS Analytics — https://itversity.com/bundle/data-engineering-using-aws-analytics
For sales inquiries: support@itversity.com

Part 1 Video — Roadmap to become a Data Engineer

Agenda for the workshop

Here is the agenda for this detailed workshop which covers Roadmap to become Data Engineer for ETL/Data Warehouse Developers.

Introduction about ITVersity
Durga’s Expertise and Journey
What is Data Engineering and why ETL, PL/SQL, Data Warehouse, and Mainframes Developers should take it seriously?
What are the key skills and up to what level ETL, PL/SQL, Data Warehouse, and Mainframes Developers should know the skills?
Details about our Guided Program on AWS (others in the future).
More details about the Guided Program for Data Engineering using AWS Data Analytics

Few Important Links

Few important links to stay connected with us.

Visit our website — https://www.itversity.com
Subscribe to our YouTube Channel — https://www.youtube.com/itversityin?sub_confirmation=1
Follow our LinkedIn Page — https://www.linkedin.com/company/itversity
Sign up for our Newsletter — https://forms.gle/mwVYMRzAdv89rxRf8
Udemy Profile — https://www.udemy.com/user/it-versity/
For Enquiries: support@itversity.com

Special Note to Udemy Customers

Thank You for our esteemed customer
Make sure to rate us and also provide feedback as demonstrated. Your rating and feedback is very important for our community success.
If you are existing Udemy Customer and not familiar about ITVersity Courses in Udemy, feel free to visit this page.

Introduction about ITVersity

Started as YouTube Channel in 2014. It have 60,000+ subscribers.
Started with Hadoop and evolved with the industry. Expertise in following areas

Hadoop and its ecosystem such as Sqoop, Hive, Flume, etc
Apache Spark and Apache Kafka
Cloud Platforms — AWS, Azure, GCP
Docker and Kubernetes

Launched our first course in Udemy in 2016. Now have more than 200K customers.
Focus Areas: Data Engineering on Cloud using AWS, Azure, GCP, Databricks, Snowflake, etc

Durga’s Expertise and Journey

Here are the details about Durga’s Expertise in the areas of Databases, Data Warehousing, Big Data, Data Engineering, etc.

20+ Years of rich IT Experience in building Large Scale applications.
Started Career as Trainee/Intern in 2002.
Started as Employee with Tavant Technologies as Java Developer in 2004.
Started solving complex data problems and moved to US in 2006.
Transitioned into PL/SQL, Data Warehousing, Goldengate in 2007
Implemented quite a few zero downtime large scale migrations
Transitioned into Big Data in 2012 as Technical Manager/Consultant for Cognizant.
As industry evolve, transitioned myself and consulted in complex large scale projects using technologies such as Hadoop, Spark, Kafka, AWS Data Analytics, Snowflake, Databricks, Cloudera, Hortonworks, MapR, etc
Trained thousands of Professionals across the globe leveraging state-of-the-art labs.

Recap of Conventional Data Warehousing

Let us recap details related to conventional or legacy Data Warehousing.

Responsibilities of a Conventional ETL Developer

Here are the typical responsibilities of a conventional ETL Developer in the Data Warehouse world.

Understand the Reporting Requirements and collaborate with BI Developers/Architects to come up with Data Model.
Ingest Data from files, databases, etc into ODS.

Perform ETL — Extract from ODS, apply relevant Transformation rules and then Load into Data Marts (as part of Data Warehouse)

Day to Day Tasks of a Conventional ETL Developer

Here are the typical tasks of a conventional ETL Developer in the Data Warehouse world.

Develop ETL Applications based on the requirements using tools such as Informatica, Talend, Ab Initio, etc.
Develop Frameworks to get the data from source into ODS depending on ODS requirements
Use tools to export and import data for development and unit testing.
Run Ad hoc queries against all tables across different layers to ensure data quality.
Use tools to load from CSV files into the tables (example: Loader)
Real Time Data Replication or CDC using tools such as Informatica CDC, Oracle Goldengate, etc.
Deploy and Schedule Workflows using tools such as Control-M, Appworx, etc.

Limitations of Conventional Data Warehouses

Here are the limitations of Conventional Data Warehouses.

Underused Infrastructure
High Maintenance Costs
Restrictions to Scale
Designed to deal with structured data
Limitations related to Metadata Development

Modern Data Engineering on top of Cloud based Data Lakes solve the challenges faced by Conventional Data Warehouses in a cost effective manner.

What is Data Engineering?

Conventional Data Warehousing + Modern Analytics

It is becoming more Cloud oriented which will reduce the infrastructure and maintenance costs with unlimited scalability

Data Engineering on Cloud Platforms — AWS, GCP, Azure, Databricks, Snowflake, CDP, etc

Data Engineering on Cloud — Decoupled Architecture

Systems — Data Engineering

Here are the different Systems we typically use for Data Engineering.

Variety of source or upstream systems — Purpose Built Databases, Files, REST APIs
Data Lake
Downstream systems such as Data Warehouses or MPP, NoSQL, External Systems

Data Lake — Downstream Applications

Here are the broader level applications that are typically built on top of Data Lake.

Executive Reports and Dashboards (Data Warehouse or MPP)
Customer Analytics (NoSQL or RDBMS)
Recommendation Engines

Other Categories

Data Science or Machine Learning Applications
Integration with External Applications via REST APIs
Publishing Data to External Vendors or Customers

Data Engineering — Typical Data Pipelines

Here are the two categories of Data Pipelines which we build typically.

Batch Data Pipelines
Streaming Data Pipelines

Data Engineering on Cloud — Different Platforms

Data Engineering — General Reference Architecture

Here is the General Reference Architecture of Data Lake and Data Engineering Pipelines.

Multi-layered Architecture
Data Ingestion from source, Data Processing in the Data Lake and Data Loading into the target should be decoupled.
Low Code or No Code Approach with extensive usage of SQL and connectors.

Data Engineering using AWS Data Analytics (Cloud Native)

Here is the reference architecture to build Data Lake and Data Engineering Pipelines using AWS Data Analytics Services.

Storage or Data Lake — AWS s3
Compute or Processing — AWS Lambda Functions or AWS EMR using Apache Spark. We can also use Glue in place of Lambda Functions or AWS EMR.
Batch Data Ingestion from Relational Databases — AWS DMS
Streaming Data Ingestion from Log Files or Databases — AWS Managed Streaming for Kafka (MSK) or AWS Kinesis
Data Warehouse for conventional reports and dashboards — AWS Redshift Serverless or AWS Athena Serverless
NoSQL Data Stores for Real-time Data Applications — Amazon Dynamodb
Data Science and Machine Learning — Amazon Sagemaker

Data Engineering using Databricks (Cloud Agnostic)

Here is the reference architecture to build Data Lake and Data Engineering Pipelines using Databricks on AWS or Azure or GCP.

Storage or Data Lake — DBFS on AWS s3 or ADLS or GCS
Compute or Processing — Databricks Runtime using Apache Spark
Batch Data Ingestion from Relational Databases — Underlying Cloud Solution or Apache Spark over JDBC
Streaming Data Ingestion from Log Files or Databases — AWS Managed Streaming for Kafka (MSK) or AWS Kinesis
Data Warehouse for conventional reports and dashboards — Databricks SQL
NoSQL Data Stores for Real-time Data Applications — 3rd Party NoSQL Data Stores such as MongoDB or Amazon Dynamodb or Azure Cosmos.
Data Science and Machine Learning — Databricks ML

Skills to be a Data Engineer

Here are the Key Skills for some one to be a Data Engineer.

Extensive Knowledge of SQL
Programming Abilities — preferably Python
Data Processing Frameworks — Pandas, Spark
Key Data Engineering/Data Lake/Data Warehouse Services on Cloud

AWS — s3, Glue Catalog, EMR, Redshift, Step Functions
Azure — ADLS/Blob, Azure Databricks, Synapse, ADF
GCP — GCS, Data Proc, Big Query, Airflow

Ability to build orchestrated pipelines
Ad-hoc Analysis of Data using relevant tools

Trends in Big Data and Data Engineering Skills

Here are the details related to trends in Big Data and Data Engineering Skills.

Apache Hadoop and its eco system of tools such as Apache Sqoop, Apache Flume, Apache Pig, etc are almost dead. Also, Apache Hive which is nothing but a SQL Engine have took different forms such as Spark SQL, Presto, Athena, etc.
Apache Spark have become one of the essential skill in Data Engineering. But learning Apache Spark is not good enough to excel as Data Engineer.

Trends in Enterprise Data Lake Platforms

Following Cloud based platforms have created healthy competition for customers to adapt and build Enterprise Data Lakes and build applications around those.

Cloud Native Services on AWS aka Data Analytics Services, Azure and GCP
Cloud Agnostic Platforms such as Databricks, Snowflake, CDP, etc

Roadmap to be a Data Engineer

Here is the roadmap to be a Data Engineer.

Start with SQL. One need to have Extensive Knowledge of SQL.
Make sure to gain enough proficiency of Python at broader level (especially Data Processing frameworks such as Pandas, Apache Spark). If you are an experienced professional focus on breadth than depth of Python.
Cloud Essentials based on the Cloud Platform of choice.
Understand key Data Warehousing Concepts.
Be familiarize about BI terms such as reports, dashboards, etc.
Learn how to build end to end Data Pipelines and focus on integration of services and applications using REST APIs as well as SDKs or CDKs over REST APIs.
Ability to troubleshoot issues related Job Failures and Data Quality.
Performance Tuning at both Design as well as Implementation Level.

Conclusion of Part 1

Let us conclude the part 1 here and understand how to build Pipelines using AWS Data Analytics Services to get a taste of Data Engineering.

As part of this article, you have learnt about key trends in Data Engineering using Cloud and also understood the Roadmap to become Data Engineer.

If you like the content, please subscribe to our Medium Publication. Here are the other important links you can follow.
Visit our website — https://www.itversity.com
Subscribe to our YouTube Channel — https://www.youtube.com/itversityin?sub_confirmation=1
Follow our LinkedIn Page — https://www.linkedin.com/company/itversity
Sign up for our Newsletter — https://forms.gle/mwVYMRzAdv89rxRf8
Udemy Profile — https://www.udemy.com/user/it-versity/
For Enquiries: support@itversity.com

As part of Part 2, we will come up with Problem Statement, Design and Solution of Hypothetical Scenario where data from Salesforce is ingested and processed using Data Lake.