Databricks sample application and scheduling

Nagaraju Gajula
Better Data Platforms
3 min readAug 11, 2023

Introduction:

Data bricks is a cloud-based platform that provides a unified analytics and data science workspace. It has a built-in Spark engine and supports running Spark jobs and applications.

Databricks UI:

  1. Create:
  1. You can Create Notebook, Table and Clusters ….
  2. Workspace: The Workspace is where you can create, view, and manage your notebooks, dashboards, libraries, and data. You can organize your resources into folders and share them with other users or groups.
  3. Recent: Recent contains the Notebooks, which are opened recently. So that we can directly open the notebooks from this area.
  4. Search: we can search any notebook from this area.
  5. Compute: Clusters are the compute resources that you use to run your Spark jobs and applications. In the Clusters tab, you can create, edit, and terminate clusters. You can also configure the cluster settings such as the number and type of nodes, the Spark version, and the auto-scaling behavior.
  6. Worflows: Jobs are the way to run your Spark applications on Databricks. In the Jobs tab, you can create, edit, and manage jobs. You can specify the main application file, the cluster to use, the command-line arguments, and the scheduling options. You can also monitor the job runs, view the logs, and get email notifications.
  7. Data: Databricks provides several ways to access and manage your data. You can use the Workspace to upload, browse, and preview your data files. You can also use the DBFS API or the Databricks CLI to manage your data programmatically. In addition, Databricks supports integrations with external storage systems like AWS S3, Azure Blob Storage, and Google Cloud Storage.

Notebooks: Notebooks are interactive documents that allow you to write and execute code, visualize data, and document your analysis. In the Notebooks tab, you can create, edit, and delete notebooks. You can also collaborate with others in real-time, version control your notebooks, and schedule them to run as jobs.

Here’s an example PySpark application that can be run on Databricks:

from pyspark.sql import SparkSession

if __name__ == "__main__":
# Create a SparkSession
spark = SparkSession.builder.appName("Example").getOrCreate()

# Read a CSV file into a DataFrame
df = spark.read.csv("dbfs:/FileStore/tables/input.csv", header=True, inferSchema=True)

# Transform the DataFrame
df2 = df.select("column1", "column2").filter(df["column3"] > 0)

# Write the transformed DataFrame to a Parquet file
df2.write.parquet("dbfs:/FileStore/tables/output.parquet")

# Stop the SparkSession
spark.stop()

Note that the paths for the input and output files use the dbfs scheme, which is a Databricks-specific file system that can be used to store and access data. You can also use other storage systems like AWS S3 or Azure Blob Storage, but you'll need to configure your Databricks cluster to access them.

To schedule a Databricks notebook, you can create a Databricks job. Here are the steps to create a job that runs a notebook on a schedule:

  1. In the Databricks workspace, open the notebook that you want to schedule.

Here are the steps to create a job that runs a notebook on a schedule:

  1. Click on the “Schedule” icon in the right-hand sidebar.
  2. Provide the Job Name
  3. Select Schedule Manual/Scheduled.
  4. If Scheduled selected then provide the scheduling details.
  5. Select the Cluster Existing/ Job Cluster.
  6. If Job Cluster selected then Need to provide the Cluster settings.
  7. If your Notebook takes any parameters then provide the parameters.
  8. Give the emails to send alerts.
  9. Click on Create.

Conclusion:

By following these steps, you’ll have established a job that automates the execution of your notebook according to the defined schedule and settings, helping you efficiently manage and streamline your data processing and analysis tasks.

--

--