AlsathamhussainParquet vs Delta tableParquet and Delta tables in a fun and easy-to-understand way. Please note that these examples are meant to be illustrative and might not…Aug 29, 2023Aug 29, 2023
AlsathamhussainAPACHE SPARK:: The Incredible Incremental Load JourneyOnce upon a time in the bustling city of Dataville, there lived a talented data engineer named Alex. Alex worked for a prominent tech…May 19, 20231May 19, 20231
Alsathamhussain“Overcoming Struggles to Become a Knowledgeable and Successful Big Data Engineer: A Motivational…As a data engineer, you may have faced struggles and challenges in your journey towards becoming knowledgeable and successful in the field…May 14, 2023May 14, 2023
AlsathamhussainFrom Big Data to Small Solutions: Navigating PySpark’s .agg() Method and Overcoming Join ChallengesInterviewer: Can you explain the difference between using .agg() in PySpark and not using it, and give a real-time scenario for each?May 6, 2023May 6, 2023
AlsathamhussainSalting technique in Spark: A Thrilling Interview ExperienceJohn was a data engineer who had been studying PySpark for weeks in preparation for an upcoming interview. He had learned about many…May 4, 20232May 4, 20232
AlsathamhussainSpark : Adaptive Query OptimizationAdaptive Query Optimization is a feature introduced in Spark 3.0 to improve the performance of Spark SQL queries.May 3, 2023May 3, 2023
AlsathamhussainEvicting old data when the cache is full in sparkIn Apache Spark, the cache is used to store frequently accessed data in memory to improve query performance. When the cache becomes full…May 2, 2023May 2, 2023
AlsathamhussainAvoid unnecessary shuffling while Join :)One optimization technique for cogroup() is to ensure that the RDDs being cogrouped have the same partitioner to avoid unnecessary…May 2, 20231May 2, 20231
AlsathamhussainWhen performing a join operation in RDD, there are a few options available to optimize the process:Partition the datasets properly:May 2, 2023May 2, 2023
AlsathamhussainPySpark |map(func) everything you need to know ?In PySpark, map(func) is a transformation operation that applies the given function to each element of the RDD and returns a new RDD with…May 2, 2023May 2, 2023