Alsathamhussain – Medium

Alsathamhussain

Alsathamhussain

Parquet vs Delta table

Parquet and Delta tables in a fun and easy-to-understand way. Please note that these examples are meant to be illustrative and might not…

Aug 29, 2023

Aug 29, 2023

Alsathamhussain

APACHE SPARK:: The Incredible Incremental Load Journey

Once upon a time in the bustling city of Dataville, there lived a talented data engineer named Alex. Alex worked for a prominent tech…

May 19, 2023

APACHE SPARK:: The Incredible Incremental Load Journey

May 19, 2023

Alsathamhussain

“Overcoming Struggles to Become a Knowledgeable and Successful Big Data Engineer: A Motivational…

As a data engineer, you may have faced struggles and challenges in your journey towards becoming knowledgeable and successful in the field…

May 14, 2023

May 14, 2023

Alsathamhussain

From Big Data to Small Solutions: Navigating PySpark’s .agg() Method and Overcoming Join Challenges

Interviewer: Can you explain the difference between using .agg() in PySpark and not using it, and give a real-time scenario for each?

May 6, 2023

May 6, 2023

Alsathamhussain

Salting technique in Spark: A Thrilling Interview Experience

John was a data engineer who had been studying PySpark for weeks in preparation for an upcoming interview. He had learned about many…

May 4, 2023

May 4, 2023

Alsathamhussain

Spark : Adaptive Query Optimization

Adaptive Query Optimization is a feature introduced in Spark 3.0 to improve the performance of Spark SQL queries.

May 3, 2023

May 3, 2023

Alsathamhussain

Evicting old data when the cache is full in spark

In Apache Spark, the cache is used to store frequently accessed data in memory to improve query performance. When the cache becomes full…

May 2, 2023

May 2, 2023

Alsathamhussain

Avoid unnecessary shuffling while Join :)

One optimization technique for cogroup() is to ensure that the RDDs being cogrouped have the same partitioner to avoid unnecessary…

May 2, 2023

Avoid unnecessary shuffling while Join :)

May 2, 2023

Alsathamhussain

When performing a join operation in RDD, there are a few options available to optimize the process:

Partition the datasets properly:

May 2, 2023

May 2, 2023

Alsathamhussain

PySpark |map(func) everything you need to know ?

In PySpark, map(func) is a transformation operation that applies the given function to each element of the RDD and returns a new RDD with…

May 2, 2023

May 2, 2023

Alsathamhussain

Alsathamhussain

Following

Help
Status
About
Careers
Press
Blog
Privacy
Terms
Text to speech
Teams