Things to know about Databricks Clusters as Data Engineer

Aditya Shaw
2 min readMay 13, 2022

--

Being a developer, it’s obvious that you might small things that are running in the backend. Must of us are aware about the cluster but we miss small detail which are very helpful and we must know.

As per official documentation, A Databricks cluster is a set of computation resources and configurations on which you run data engineering, data science, and data analytics workloads, such as production ETL pipelines, streaming analytics, ad-hoc analytics, and machine learning.

Some points that we should be aware of about the clusters are given below:

  1. There are two types of clusters and each of has its own pros and cons as mentioned.

· All purpose: To analyse data collectively using interaction notebook, it can be shared by multiple users.

· Job Cluster: To run fast and robust automated jobs, it cannot be restarted once terminated and it terminates the cluster after job is completed.

2. Notebooks attached to all purpose can be seen in the common cluster information.

3. There are three cluster modes in databricks.

· High Concurrency: It is optimised to run concurrent SQL, Python and R workloads. It does not support Scala.

· Standard: It is recommended for single user cluster. It can run SQL,Python,R and Scala workloads.

· Single Node: Cluster with no worker. It is recommended for single user cluster computing on small data volumes.

The default cluster mode is Standard.

4. While starting a terminated cluster, Databricks re-creates the cluster with same ID, automatically installs all libraries and re-attaches the notebooks.

5. The auto termination of the cluster depends on the mode of the cluster. Standard clusters are configured to terminate automatically after 120 mins whereas high concurrency are configured to not terminate automatically.

6. The direct print and log statements from your notebooks, jobs, and libraries go to the Spark driver logs and helps in debugging. These logs have three outputs:

  • Standard output
  • Standard error
  • Log4j logs

For detailed documentation on clusters of databricks, please refer their official documentation on clusters.

If you have any queries or doubts please feel free to connect with me on LinkedIn.

--

--

Aditya Shaw

Data Engineer, Tech Enthusiast. Interests in Personal Finance, Stock Market, Investments & Photography. Connect: https://www.linkedin.com/in/adityashaw18/