Pyspark DataFrame

Muttineni Sai Rohith
CodeX
Published in
6 min readAug 13, 2022

--

DataFrame is an industry Buzzword nowadays and people tend to use it in various cases. In this article, we will learn more about DataFrame in Pyspark, Its Features, importance, creation, and Exploratory Data Analysis using Pyspark DataFrames. Later on, we will work with a use case.

To learn more about Pyspark and become a certified Pyspark Developer follow Part-1: Pyspark For Beginners.

Pyspark DataFrame

Dataframes is a data structure similar to an excel sheet or SQL table, where data is organized into rows and columns. Usually, Rows represent the number of observations. Rows can have a variety of data formats — Heterogeneous, whereas a column can have data of the same data type — Homogeneous. In addition to data, Data frames usually contain some metadata; for example, column and row names.

Features of Pyspark DataFrame

  • Dataframes are Distributed in Nature, which makes them Fault Tolerant and Highly Available Data Structures.
  • Lazy evaluation is an evaluation strategy that holds the evaluation of an expression until its value is needed. It avoids repeated evaluation. Lazy evaluation in Spark means that the execution will not start until an action is triggered. In Spark, the picture of lazy…

--

--

Muttineni Sai Rohith
CodeX

Senior Data Engineer with experience in Python, Pyspark and SQL! Reach me at sairohith.muttineni@gmail.com