Using Pyspark, reading a Parquet file from a blob, adding and remaining the columns, and saving it as a table in Databricks
This blog shows how to build a table out of data bricks by adding columns and renaming it using Pyspark.
The blog will go over the three features listed below in detail:
• Reading data from blob storage.
•Adding and renaming a column in the parquet file using pyspark.
• Saving the file as a table in databricks.
Reading data from blob storage:
Step 1:Click the Create tab on the left side of Azure Databricks.
Step 2:Select ‘Notebook’ from the Create tab. In the resulting dialogue box, enter the necessary information such as Name, Default Language, and Cluster, and then click Create. The Notebook will be created.
Step 3:Copy the key and storage account name from the access keys tab to retrieve the storage account details from Azure Blob Storage.
Step 4:Mount an Azure Blob storage container in the newly created notebook by configuring the blob storage name and key as shown below.
# Mount an Azure Blob storage container
dbutils.fs.mount(
source = “wasbs://db-input@ <storage-account-name>.blob.core.windows.net/”,
mount_point = “/mnt/<mount-name>”,
extra_configs = {“fs.azure.account.key. <storage-account-name>.blob.core.windows.net”:<key>})
* mount-name: Name can be any name of the user’s choosing.
Step 5:Open the blob storage and read the parquet file.
*Create a dataframe in the notebook cell, for example,’ df ‘, to hold the read command for reading the file from the blob. The Show() function is used to display the dataframe’s objects.
df=spark.read.parquet(“/mnt/blob-mnt/sample_file.parquet”)
df.show()
output:
Adding and renaming a column in the parquet file using pyspark:
Step 1:Run the code ‘ from pyspark.sql.functions import lit ‘ in notebook cell to import the lit function from pyspark.sql.functions.
Step 2:Create a new dataframe, for example,’ df 1'to read ‘df’ and add the column name you want with the value; the withcolumn() function is used for this.
Step 3:The value of the Date column is the current DateTime, the lit function is used to specify numeric values and the datetime.now() function is used to retrieve the current date value.
STEP 4:Create another dataframe, for eg, ‘df2’, with the column renamed from ‘df1’.
df2=df1.withColumnRenamed(‘DATE’,’ORDER_DATE’)
df2.show()
*withColumnRenamed() function is used for Renaming the column (‘DATE’ to ‘ORDER_DATE’).
Saving the file as a table in databricks:
Step 1:Select the dataframe and write the dataframe as a table with the following pyspark code
df2.write.mode(“append”).saveAsTable(‘Customer Table’)
Step 2:Select the table created in databricks by executing the select command by changing the language into SQL in the cell.
SELECT * FROM Customer_Table
To view the table, go to the left side of Azure Databricks, pick the Data tab, and then click the database ‘Default.’ A table with the name given will be generated.
About us:
Bi3 has been recognized for being one of the fastest-growing companies in Australia. Our team has delivered substantial and complex projects for some of the largest organizations around the globe and we’re quickly building a brand that is well known for superior delivery.
Website: https://bi3technologies.com/
Follow us on,
LinkedIn : https://www.linkedin.com/company/bi3technologies
Instagram : https://www.instagram.com/bi3technologies/
Twitter : https://twitter.com/Bi3Technologies