How to Read Excel files using pyspark in Databricks?

Amit Joshi
4 min readSep 2, 2023

--

In one of my recent requirements, I encountered the need to read Excel files using PySpark in Databricks. While reading CSV files is relatively straightforward with ample resources available, dealing with Excel files can be a bit trickier. In this article, we’ll dive into the process of reading Excel files using PySpark and explore various options and parameters to tailor the reading process to your specific requirements.

I am using the Databricks community edition. Here you can see a sample Excel file In which I have used sample1 file below.

Let's get started.

The first step is to upload your Excel file to Databricks. You can do this through several methods, including using Databricks File System (DBFS), external storage like Amazon S3, or other available data sources. I’ll walk you through the process of uploading the file to DBFS

The file has been successfully uploaded. The next step is to open the cluster on which you will run the notebook. Then, navigate to the cluster configurations, select “Libraries,” and click on “Install New” as shown below.

Select “Maven” as the Library source. In the “Coordinates” field, copy and paste the following: “com.crealytics:spark-excel_2.12:0.13.5”. Alternatively, you can choose the latest version by clicking on “Search Packages,” and then proceed to click “Install.”

You will see The library that we installed below which is used to read excel files in pyspark.

The Actual Pyspark code to read the Excel file:

from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
.appName("ReadExcelWithHeader") \
.config("spark.jars.packages", "com.crealytics:spark-excel_2.12:0.13.5") \
.getOrCreate()

# Define the path to your Excel file
excel_file_path = "/FileStore/tables/sample1.xlsx"
df = spark.read \
.format("com.crealytics.spark.excel") \
.option("header", "true") \
.option("inferSchema", "true") \
.load(excel_file_path)

# Show the DataFrame
df.show()

Initially, we established a Spark session where we included package details in the configuration section. While it’s possible to read Excel files without specifying these package details, it is considered a best practice to include them. Following that, we designated the path to our Excel file and created a DataFrame for reading the Excel data.

Now that we’ve successfully read the file, you might encounter situations where additional parameters are necessary to fine-tune the reading process. For example, you may want to read specific cells only from a file or want to specify a custom date and time format, or read data from a password-protected file. These scenarios can be addressed by setting specific options accordingly.

In my case, there are two rows at the beginning of each file that need to be skipped during data reading. You can see below.

If you use the same method as described above, you may obtain an incorrect output due to the presence of the initial rows that need to be skipped. You can see the result below which is not useful for me.

To address this issue, you can utilize the “.option(“dataAddress”, “‘NameOfYourExcelSheet’!YourCellRange”)”. This option allows you to skip rows or specify specific data to be read. In my case, I’ve used “.option(“dataAddress”, “ ‘Sheet1’ !A3:E393”)” to skip the initial rows and read the data from row 3 to row 393(last row) in the ‘Sheet1’ of my Excel sheet.

Now I got the desired result. This was just one option. Here are some options you can define according to the requirements.

.option("header", "true") \ #use first row as header
.option("inferSchema", "true") \ #to infer schema from file
.option("dataAddress", "'SheetName'!A3:E393") \ #to select specific cells or sheet(give only sheet name) only
.option("treatEmptyValuesAsNulls", "true") \ #to treat empty values as null
.option("timestampFormat", "yyyy-MM-dd HH:mm:ss") \ #specify timestamp format
.option("dateFormat", "yyyy-MM-dd") \ #specify date format
.option("workbookPassword", "your_password") \ #to specify password for protetced files

You can explore more about options here. Remember that you don’t need to specify all of these options in every case. You should choose the options that are relevant to your specific Excel file and the way you want to read it. Adding unnecessary options may complicate your code and potentially slow down the reading process.

Thank you for reading! If you enjoyed this article and would like to stay updated with my future content, feel free to follow me on Medium.

If you have any queries or any topic-related suggestions, feel free to reach out to Me via LinkedIn.

--

--