Data Storage in PySpark: save vs saveAsTable

Strategies for Storing DataFrames and Leveraging Spark Tables

Tom Corbin
6 min readAug 31, 2023

When it comes to saving DataFrames in PySpark, the choice between ‘save’ and ‘saveAsTable’ is more significant than it might initially appear. Although they perform similar tasks—saving your DataFrame to a location—these methods are quite different. This article dives into their differences, the scenarios where each is most effective, and the implications for data storage and retrieval.

To understand the difference between these to methods, it’s helpful to first understand the concept of a table in Spark.

Understanding Tables in Spark

In traditional databases, tables are physical objects. They are structured data containers with predefined schemas that hold your data, stored as physical files on disk storage. But in Spark, the concept of a table is slightly different.

Spark tables are more of a logical representation of structured data than a physical entity. They are logical constructs that define the schema and data location, rather than being a physical structure stored on disk. When you create a table in Spark, it stores the data as a collection of files in a distributed file system (more on this later). It also saves some metadata describing…

--

--

Tom Corbin

Data Engineer, Spark Enthusiast, and Databricks Advocate