The Ultimate Guide to Becoming a Data Engineer/ Big Data Engineer: Top N Strategies

MohamedKashifuddin
11 min readApr 26, 2023

--

“Climbing to the top demands strength, whether it is to the top of Mount Everest or to the top of your career”- A. P. J. Abdul Kalam

0001.jpg (700×415) (india.com)

The Core Of Data Engineering :

The basic fundamental always stays at the top it's the programming lang which you choose to begin.
This might be surprising to those coming from a past background you would have used an ETL Tool, Informatica PowerCenter, Talend, SSIS etc …

This will serve you as a guide for the below category

  • Beginners- Yet just going to start the carrier in DE
  • Intermediate — Already working in this domain
  • Expert- People worked in DE but from a python background
    This guide doesn’t focus on Cloud, docker and k8
    It's a practical guide focusing to become a better DE expert

These are divided into three parts and a few more can be added in future:
1. First section will give you an overview of legacy tools
2. The second section will give you the reason why you need to adapt the upcoming tech stack
3. Third section will provide you with a learning path

Drill-Down to the core and get an overall idea: Part -1

  • These tools are not bad options like vender LockIn eg Public Cloud but indeed these are similar to that where you really can’t move away due to the features & support it provides so does it mean it isn’t powerful enough to support the modern era of huge data
  • The honest opinion is these are still powerful enough to handle tons of data, We still need to understand these are Enterprise and some open source products, They have a dedicated set of engineers to continuously improve their systems, and their base model will be the same but it will support the modern era
  • These tools come with no code and fewer changes in methodology, In case you want to process and transform the data
  • There are ready-to-use connectors, you would pretty much see a connector for every source system and adding a new source into the system becomes a streaming less effort
  • Processing the data from the downstream systems uses more or less a drag/drop or Adding a SQL query methodology, It is one of the quickest ways to bring the processed data into the data warehouse in nearly real time
  • The majority of these would be batch jobs which would run at a certain time on a day once or plus it also supports recursive jobs which might be running forever which would serve the use case of near real-time data
  • The systems run on the underlying layer — memory, storage, scheduling tool, network, operating system, administrator etc... These are some general things you would have in any ETL tools
  • These are very powerful systems but the point of doubt comes in why Spark or Flink are getting popular. why people are moving away from these enterprise systems, and why would anyone take such to get their businesses into a risky situation
  • whenever you would like to get an enterprise tool and want to adopt it as part of the infrastructure, There is also a cost evaluation so most of the time the term called shared systems are very common, which results in a lot of costs effective way to fulfil the business needs quickly as possible
  • The genetic problem comes up at the initial time the cost might look feasible but when the other stakeholders move away due to businesses reasons, the cost of using those would be much higher in the long term
https://www.buffalobills.com/team/coaches-roster/sean-mcdermott

“You’ve got to make tough decisions, sometimes unpopular decisions… Whatever it is, if it’s the right move at the right time, you’ve got to be also willing to make mistakes”

The New Era Of Revolutionary Modern Data: Part -2

  • The major change and challenge that comes into play would be the licensing cost, the only way to get a way out would be to adopt an opensource software
  • This comes with a risk either you would have a dedicated group of the individual team which whenever There is a patch required, Can need to patch it quickly to avoid the probability of getting affected due to the valnarblity
  • The underlying layer presents the same as the previous architecture but the foundation changes back to the first item in our list would be the programming lang. Let us get into the granularity of the details
  • It is better to adopt an open-source tech stack and has an administration + DevOps team to manage the services

Whenever you work on a project you would see the below Environment :

1. huge Opensource software is involved with self-hosting — On-prem
2. Opensource but using these with a vender eg Cloudera, hortonworks
These are much better in terms of support etc...
3. Opensource over cloud eg Azure hdinsight,emr, Dataproc
These are the same open-source services but they would be available on the cloud maintained by the cloud vendor

Why is it better in term of enterprise tools —

  • Easy to ship anywhere, As long you have better administration teams the product can run in long term
  • Even if you opt for a vendor like Cloudera, You would still need to pay the cost but in return, the support and security will pay back
  • Plus the advantage is you would get a bunch of opensource software together eg you don’t need to have a setup separately for the hive, H-Base etc … it's an all-one setup and its just one step away from extending the application with a required software
  • When it comes to cloud like Azure, GCP, and AWS case of security requirements allows the project:
    Then its a much better option to opt for the cloud to host the open-source software
  • It comes with pay as you use model, you still require a team to manage these on the cloud but you can save a lot of costs and much faster in terms of development
Credit: AFP via Getty Images

“ All Birds find shelter during a rain.
But Eagle avoids rain by flying above
the Clouds.

Problems are common, but attitude
makes the difference “

An interesting area of bread and butter — Part -3

  1. Programming lang is the most important area of learning
    I would encourage you to learn Scala
    If you want to be a better Data engineer then you should learn Java or Scala
    Practically you will see a lot of scala spark projects in real-time.
    In case you would like to limit your self then you would have very less chance to get a good understanding of the whole business model
Rock the JVM —Daniel Ciocîrlan

When it comes to learning these I would suggest to start learning from ROCK THE JVM — Daniel Ciocîrlan
* He is one of those folks whom you can learn scala in deep.
* You really can't go wrong when you decided to invest time in learning from Daniel

Let me help you find out how to figure out you can invest a lot of time:
1. Start to learn scala from his youtube channel — rockthejvm
2. Then take his udemy scala course beginner
3. At this point you would have already gained a good amount of knowledge — Now you can go ahead to his official site — Rock the JVM
There are a couple of option
I) Purchase scala Bundle of lifetime access
II) Take monthly Membership just like Amazon prime / Netflix

2. Spark is one of the major tech stacks for pretty much every project you would come across, one of the differences you will see is the environments, On-prem with Cloudera, Cloud on Azure, GCP and AWS, In case the project components are in azure then you will use Databricks + hdinsight / Databricks + snowflake

When it comes to learning spark I would suggest starting learning from ROCK THE JVM — Daniel Ciocîrlan

Links :

  1. https://www.youtube.com/c/RocktheJVM
  2. https://www.udemy.com/user/daniel-ciocirlan/
  3. https://rockthejvm.com/

3. Learn Hadoop

Apache Hadoop
  • People have a wrong assumption that it's deprecated and is no longer used in the industry but just to correct you people still prefer to use Hadoop to store the data, Most of cases you will be subjected to use a shared cluster crosse few too many teams, it minimizes the cost of maintenance
  • Coming to learning Hadoop I will suggest learning from Youtube, Medium blogs, public blog sites
  • The key quality to learning Hadoop is to combine it with spark+scala and practice how to capstone projects on top of Hadoop
  • The minimum system requirement is In the case of Windows you can use WLS, WinUtils or VMware in my opinion, I suggest using WSL it's simple and easy. Even on the older system you can install and practice

4. SQL

Mohammad Tanweer khan
  • This is one of the essential skills, Even though this blog tends to tell you “Never limit yourself to SQL and never become just a SQL person”
  • But you will need to learn SQL and needs to be a better person at this skill
  • There are two different ways you will need to learn coming to the interview perspective know more about window function and subquery, joins and correlated subquery but also be good at CTC, Recursive CTC, joins, aggregate and merge basically at work you need to know more about ETL/ELT way of using SQL more
  • When it comes to learning SQL I will suggest starting with Tanweer khan, Get started from his youtube channel Tech Nest Academy, As you feel this way of teaching fits you in the correct path then enquire and enroll in the SQL course

Links :

  1. https://www.youtube.com/@technestacademy490/videos
  2. https://www.linkedin.com/in/mohammad-tanweer-khan-6ba27953/

5. Learning a Data Warehouse

  • This is a major key component of the whole DE, Learning these are very important
  • There are N number of flavours to learn about DE and work on Hands on stuff
  • These are a few which I will suggest you learn starting any one of them, Hive, Bigquery, Snowflake, Never try to learn all three when you start to learn any DW when you're starting from the bottom
  • Your initial goal shouldn't be to master it but to get started
    1. J Garg — Hive — Udemy
    2. Sandeep Kumar — Bigquery — Udemy
    3. Nikolai Schuler — Snowflake — Udemy
  • The main goal is to learn techniques and the ways you use DW to develop your project app, As you know SQL very well at this point in time, it should take very less time to adapt the syntax additionally learn a few topics and work on the partition / Repartition etc .. more of an Optimizing area
  • Links :
    1. Hive to ADVANCE Hive (Real time usage) :Hadoop querying tool | Udemy
    2. https://www.udemy.com/course/google-biqquery/
    3. https://www.udemy.com/course/snowflake-masterclass/

6. Data modelling

  • This is aligned with all Major components and this plays a vital role in every area of the above components
  • The primary goal is to bring/ receive structured, Unstruged data into the system but it has no value unless it's properly segregated into re-useable data
  • Generally, there are scenarios when the spark job will pick the data in batch and will have spark jobs in sequence
  • This will directly load the data into a fact /dimension table or it will write the process data into a different /single folder in the object store anywhere in Hadoop, GCS,s3 then using an orchestration tool the data will be loaded into the respective table for further reporting
  • So the above terms might look new to you but trust me once you learn and implement them, it's pretty easy to get on the path, more you do practically, you will get that conformable on the terms
  • Learning can be started using the below resources
    1. Analytics Engineering Bootcamp —Rahul Prasad / David Badovinac
    2. Data Warehouse — The Ultimate Guide —Nikolai Schuler
    3. Data Warehouse Fundamentals for Beginners — Alan Simon
  • These are some great resources where you can do self-learning and start with hands-on stuff

Links :

  1. https://www.udemy.com/course/analytics-engineering-bootcamp/
  2. https://www.udemy.com/course/data-warehouse-the-ultimate-guide/
  3. https://www.udemy.com/course/data-warehouse-fundamentals-for-beginners/

7. Orchestration tools

  • In every project, you will definitely see orchestration tools to automate all the tasks which you require to perform during the batch
  • Basically, you need to create certain pipelines and add dependencies in sequence or parallel. All the jobs at production you will definitely perform through an automation tool
  • In real-time, there are two types of tools used most of the time:
    1. write a few simple lines of code and automate
    2. No code/ less code approach most of the time you will require to drag and drop but will require to add a few lines of code whenever you need to add a custom scenario
  • The learning path can be started using the below options
    1. Azure Data Factory — Ramesh Retnasamy — Udemy
    2. Apache Airflow — Marc Lamberti— Udemy
    3. Apache Oozie — Loony Corn — Udemy
    4. Azure Data Factory (ADF) — Deepak Goyal — Udemy
  • In my opinion, I will suggest gaining more knowledge on ADF and airflow since these will cover both the use cases
    Note: Never get surprised if you see an internal tool or enterprise tool at the project which will be very similar to what you learned from the above-suggested options

Links :

  1. https://www.udemy.com/course/learn-azure-data-factory-from-scratch/
  2. https://www.udemy.com/course/the-ultimate-hands-on-course-to-master-apache-airflow/
  3. https://www.udemy.com/course/from-0-to-1-the-oozie-orchestration-framework/
  4. https://www.udemy.com/course/two-real-world-azure-data-engineer-projects-end-to-end/

End of the blog farewell and a few takeaways below

wallpapercave.com

“When we tackle obstacles, we find hidden reserves of courage and resilience we did not know we had. And it is only when we are faced with failure do we realize that these resources were always there within us. We only need to find them and move on with our lives. “

  • The blog comes to an end, I would like to convey the same message, There are really good knowledgeable people out there teaching the stuff which is exactly required to improve the knowledge
  • I m a true believer that doing self-learning is the best way of learning and you will always require to learn new things at all times as things get evolved.
  • I will definitely suggest being a better software Engineer at work even though your work is around DE and always out of the box think never limit yourself to a tool or a library
  • If you like to chat more with me do connect on LinkedIn -
  • Mohamed Kashifuddin

Links :
https://www.linkedin.com/in/mohamedkashifuddin

Thanks for reading till the END ….
Hit the Like button and share it with someone in need of this guide

See you in the next blog ….

--

--

MohamedKashifuddin

Data Engineer - Passionate On building Big Data Systems and Certified Google Cloud Professional Data Engineer and Azure data Engineer