Sign in Get started

Polymath Data Lab

‘For Loop’ in Python is Easy

‘For Loop’ in Python is Easy

A for loop is a control flow statement that allows you to execute a block of code multiple times, iterating over a sequence

Christopher Chung

Oct 14

Mastering API Calls in Python: A Step-by-Step Guide for Beginners

Mastering API Calls in Python: A Step-by-Step Guide for Beginners

APIs, or Application Programming Interfaces, serve as the bridges that connect software, allowing them to communicate and share data.

Christopher Chung

Oct 13

Repartition vs Coalesce in PySpark: Key Differences and Performance Implications

Repartition vs Coalesce in PySpark: Key Differences and Performance Implications

Two commonly used methods for adjusting the number of partitions are repartition() and coalesce(). Let’s dive into when to use each function

Christopher Chung

Aug 13

Ordering Rows by Columns in PySpark — PySpark Tutorial

Ordering Rows by Columns in PySpark — PySpark Tutorial

Sorting or ordering records by columns can help in pre-processing data, ensuring that your data is organised and ready for analysis.

Christopher Chung

Aug 11

Selecting Top N records by Group — PySpark Tutorial: GroupBy

Selecting Top N records by Group — PySpark Tutorial: GroupBy

We often encounter scenarios where we need to select the top N records within each group of a dataset in PySpark.

Christopher Chung

Jul 26

PySpark Tutorial — Mastering PySpark GroupBy: Unleashing the Power of Data Aggregation

PySpark Tutorial — Mastering PySpark GroupBy: Unleashing the Power of Data Aggregation

PySpark GroupBy, a method that allows you to group DataFrame rows based on specific columns and perform aggregations on those groups.

Christopher Chung

Jul 24

Sharing Data Efficiently Across Your Cluster — Apache Spark Broadcasting

Sharing Data Efficiently Across Your Cluster — Apache Spark Broadcasting

Broadcasting in Apache Spark allows you to share a read-only variable across all worker nodes (executors) in your Spark cluster. This…

Christopher Chung

Jun 3

Apache Spark Performance Tuning: Repartition

Apache Spark Performance Tuning: Repartition

While Spark can handle partitions efficiently, there are situations where manually repartitioning your data can greatly improve…

Christopher Chung

Jun 1

Apache Spark Lazy Evaluation: Transformations vs. Actions

Apache Spark Lazy Evaluation: Transformations vs. Actions

One crucial aspect of using Spark effectively is understanding the distinction between transformations and actions. One crucial aspect of…

Christopher Chung

Feb 3

Exploding Array Columns in PySpark: explode() vs. explode_outer()

Exploding Array Columns in PySpark: explode() vs. explode_outer()

Splitting nested data structures is a common task in data analysis, and PySpark offers two powerful functions for handling arrays…

Christopher Chung

Jan 30

About Polymath Data LabLatest StoriesArchiveAbout MediumTermsPrivacyTeams