Free Trail Databricks Tutorial

Huishuanghsu
6 min readFeb 11, 2024

--

New to Databricks and exploring Databricks features

Photo from 銀菁夢 website

Brief Introduction

Databricks is a leading data analytics platform designed to simplify the process of big data analysis for businesses and organizations. It was founded by the original creators of Apache Spark, a powerful open-source processing engine built around speed, ease of use, and sophisticated analytics. Databricks aims to empower users to leverage the potential of big data and artificial intelligence (AI) through a unified analytics platform. This platform offers an integrated environment for data engineering, collaborative data science, full-scale machine learning, and business analytics.

One notable feature of Databricks is MLflow, an open-source platform that manages the machine learning lifecycle, including experimentation, development, and deployment. MLflow enhances Databricks by facilitating the development, refinement, and deployment of machine learning models at scale. This accelerates the transformation of data into actionable insights and promotes a culture of innovation within organizations.

In this blog post, I will explain how to register for and use Databricks Community Edition. Unlike the 14-day free trial of Databricks, the Community Edition, although somewhat limited in functionality, does not require you to have your own cloud account, nor does it require you to provide cloud computing or storage resources. It allows users to experience the Apache Spark community without time restrictions.

Registration

Create an account

Please follow the link below and enter your personal information.

https://www.databricks.com/try-databricks#account

Image by Author

Continue

Clicking “Continue” will take you to the next page, where you can choose “Get started with Community Edition” at the bottom. The Databricks Community Edition does not require selecting a cloud provider.

Image by Author

You will receive a welcome email from Databricks, and then click the link to verify your email address. The system will prompt you to create a Databricks password.

When you click on “Submit”, the system will take you to the Databricks Community Edition homepage.

Databricks Community Edition homepage.

https://community.cloud.databricks.com/

Image by Author

Getting Started — Create cluster

A cluster consists of a set of computational resources that are utilized to run data engineering, data science, and machine learning workloads in a scalable and managed environment. The Community Edition provides access to a micro-cluster, which includes 15.3 GB of memory, 2 cores, and 1 DBU.

Click on ‘Create’, then select ‘Cluster’ from the left column.

Image by Author

You can select a different runtime version. Choose the Databricks Runtime based on compatibility, stability, performance, and specific feature requirements for your data processing and machine learning tasks.

For more runtime information : https://docs.databricks.com/en/release-notes/runtime/index.html

Image by Author

Click “Create compute”.

Cluster status can be monitored and managed via the Compute section.

Image by Author

After starting the cluster, you can attach your notebook to it. The cluster will automatically terminate after an idle period of one or two hours.

Input Data

Click “Create” again, then click ‘Table”.

Upload the data, then select the cluster to which you want to attach it.

Click “Create Table with UI”, choose a cluster, and then click “Preview Table”.

Image by Author

Processing a large dataset may take some time. You can edit the table name, column names, and column formats. Select “First row is header” if the first row of your dataset contains the column names.

Image by Author

Click “Create Table”

Image by Author

Create Notebook

Click “Create” again, then click “Notebook”.

Image by Author

You can select your preferred language in the notebook or cell.

Notebook

Image by Author

Cell

Image by Author

Check columns and rows in table

SELECT count(*) from genes_csv
SHOW COLUMNS FROM genes_csv;

Start coding

This is an example of how to join tables and extract the desired data.

Create a view to show the sum of weights (in absolute value) per gene name, only for genes that have a significant p-value — lower than 1E-8. I join two other tables using ‘gene’ as the key common to all three tables.

CREATE VIEW weight_view_1 AS
SELECT SUM(ABS(weights_csv.weight)) AS sum_weight, genes_csv.genename
FROM weights_csv
JOIN genes_csv ON genes_csv.gene = weights_csv.gene
JOIN perf_csv ON perf_csv.gene = weights_csv.gene
WHERE ABS(perf_csv.pred_perf_pval) < 0.00000001
GROUP BY genes_csv.genename;

SELECT * FROM weight_view_1;

Data and Visualization

Click the plus sign to access the various features provided by Databricks, allowing you to visualize your output.

SELECT * FROM weight_view_1
Image by Author

In the Visualization section, you have the option to choose from various graph types to display your results. This interface allows you to customize the desired X and Y axes, as well as the color of the bars, among other settings.

Complete your settings, then save.

Image by Author
Image by Author

You can edit the graph title by clicking on the tab.

Image by Author

Click the plus sign on the right again, and then select Data Profile. Databricks will provide information about your table, similar to the information generated by pandas’ df.describe() method.

Image by Author

To create a Dashboard, click ‘“Add to Dashboard” to incorporate all cell outputs from the notebook. You can remove any unwanted graphs by clicking the “x” or rearrange them by dragging to your preferred location.

Image by Author
Image by Author

To download the output to your local computer, click on “Table”, then choose “Download All Rows”.

Conclusion

Thank you for joining me on my learning journey with Databricks. Databricks is a powerful platform that enables you to transform and deploy models by facilitating data input, transformation, extraction, and visualization, streamlining the entire process for simplicity.

Collaborating with colleagues and sharing your work is seamless on this platform. In addition to the features already highlighted, Databricks also provides AutoML, model deployment, performance monitoring, webhooks, and task scheduling — capabilities not available in the community edition. Moreover, operating with a small cluster in the community edition can lead to extended task execution times. For a deeper exploration and access to more features within Databricks, you might want to consider subscribing or opting for the 14-day trial.

For training courses or certifications, please visit

https://www.databricks.com/learn/training/home

Happy Learning!

--

--