Homepage
Open in app
Sign in
Get started
Polymath Data Lab
Follow
‘For Loop’ in Python is Easy
‘For Loop’ in Python is Easy
A for loop is a control flow statement that allows you to execute a block of code multiple times, iterating over a sequence
Christopher Chung
Oct 14
Mastering API Calls in Python: A Step-by-Step Guide for Beginners
Mastering API Calls in Python: A Step-by-Step Guide for Beginners
APIs, or Application Programming Interfaces, serve as the bridges that connect software, allowing them to communicate and share data.
Christopher Chung
Oct 13
Repartition vs Coalesce in PySpark: Key Differences and Performance Implications
Repartition vs Coalesce in PySpark: Key Differences and Performance Implications
Two commonly used methods for adjusting the number of partitions are repartition() and coalesce(). Let’s dive into when to use each function
Christopher Chung
Aug 13
Ordering Rows by Columns in PySpark — PySpark Tutorial
Ordering Rows by Columns in PySpark — PySpark Tutorial
Sorting or ordering records by columns can help in pre-processing data, ensuring that your data is organised and ready for analysis.
Christopher Chung
Aug 11
Selecting Top N records by Group — PySpark Tutorial: GroupBy
Selecting Top N records by Group — PySpark Tutorial: GroupBy
We often encounter scenarios where we need to select the top N records within each group of a dataset in PySpark.
Christopher Chung
Jul 26
PySpark Tutorial — Mastering PySpark GroupBy: Unleashing the Power of Data Aggregation
PySpark Tutorial — Mastering PySpark GroupBy: Unleashing the Power of Data Aggregation
PySpark GroupBy, a method that allows you to group DataFrame rows based on specific columns and perform aggregations on those groups.
Christopher Chung
Jul 24
Sharing Data Efficiently Across Your Cluster — Apache Spark Broadcasting
Sharing Data Efficiently Across Your Cluster — Apache Spark Broadcasting
Broadcasting in Apache Spark allows you to share a read-only variable across all worker nodes (executors) in your Spark cluster. This…
Christopher Chung
Jun 3
Apache Spark Performance Tuning: Repartition
Apache Spark Performance Tuning: Repartition
While Spark can handle partitions efficiently, there are situations where manually repartitioning your data can greatly improve…
Christopher Chung
Jun 1
Apache Spark Lazy Evaluation: Transformations vs. Actions
Apache Spark Lazy Evaluation: Transformations vs. Actions
One crucial aspect of using Spark effectively is understanding the distinction between transformations and actions. One crucial aspect of…
Christopher Chung
Feb 3
Exploding Array Columns in PySpark: explode() vs. explode_outer()
Exploding Array Columns in PySpark: explode() vs. explode_outer()
Splitting nested data structures is a common task in data analysis, and PySpark offers two powerful functions for handling arrays…
Christopher Chung
Jan 30
About Polymath Data Lab
Latest Stories
Archive
About Medium
Terms
Privacy
Teams