What is Databricks? And why use it?

Published in

Technoid Community

11 min readJun 12, 2023

Databricks was founded by a team of engineers who worked on Apache Spark at the University of California, Berkley in 2013. Their main goal was to make big data processing and machine learning accessible to a broader audience by providing a cloud-based platform that simplifies the process of data transformations and analytics.

What is Databricks?

Databricks is essentially a unified analytics platform designed for large-scale data processing and machine learning applications. It is cloud-based and provides an integrated environment for data engineers, scientists, and other stakeholders to work together on data projects. Databricks supports SQL, Python, Scala, R, and Java to perform data analysis and processing, and offers several commonly used libraries and frameworks for data processing and analysis.

Here are some reasons why you may want to use Databricks:

Scalability: Databricks is built on top of Apache Spark, which is a distributed computing framework designed for processing large datasets in parallel. Spark allows users to scale out their data processing across a cluster of machines, enabling them to process data faster and more efficiently. This makes Databricks a great choice for projects that require processing of large datasets. Further, Databricks consists of a few other key features that contribute to its scalability, such as Amazon Elastic Compute and autoscaling; which allows users to dynamically scale the compute resources of their clusters based on workload demands, and the Delta Lake; which is an open-source data lake format that provides scalable data storage and management capabilities.

Collaboration: Databricks provides a collaborative environment for data teams to work together on projects. This includes version control, shared notebooks, and the ability to collaborate in real-time. While Databrick Notebooks are the most basic form of collaboration where users can collaborate in real-time by using the Databricks Workspace Collaboration feature, where multiple users can edit and work on the same notebook simultaneously, Databricks also provides other collaboration tools such as Databricks Repos, Databricks File System (DBFS), Shared Libraries, and Shared Workspaces.

Databricks Repos is a version control system that is integrated with Databricks which allows users to manage their code and collaborate with other team members on data engineering, data science, and machine learning projects. It is based on the git version control system and provides several features similar to other git tools, including, branching and merging, code reviews, code search, commit history, and collaboration. This means users can create and manage branches to work on different features or experiments, and then merge changes back into the main branch when they are ready, get their tech leads to create pull requests to review and approve code changes made by team members before merging them into the main branch, and can collaborate on code by commenting on pull requests, opening issues, and assigning tasks.

Databricks Repos is tightly integrated with Databricks notebooks and other Databricks features, allowing users to easily manage code and data in a single platform. What is most convenient about Databricks Repos is that, it is also integrated with other popular Git providers, such as GitHub, Bitbucket, GitLab, Azure DevOps, and AWS CodeCommit, allowing users to integrate Databricks Repos with their existing workflows. To learn how to set up Databricks Repos click here.

Native Git providers that can be integrated with Databricks Repos

Automation: Databricks provides automation tools that can help you automate the process of data ingestion, transformation, and analysis. This can save your team a lot of time and effort. Databricks Jobs, Delta Automations, Notebooks, CLI, and APIs are the main automation tools included in Databricks. Databricks allows you to create workflows with dependent tasks to so your pipeline runs seamlessly. The ability to define your workflows allows users to manager their data pipelines with full autonomy in designing and improving ETL processes in different use cases. And if a workflow fails due to a single task, Databricks lets users “Repair and Re-run” the flow from the failed task onwards, so processing and computations aren’t duplicated, saving time and cost.

Integrations: Databricks integrates with many popular data tools and services, including AWS, Azure, and Google Cloud. This makes it easy to integrate Databricks into your existing data infrastructure. In addition to this, Partner Connect is a feature in Databricks Premium Tier that allows users to easily discover and connect with Databricks partners. These partners include software vendors, system integrators, and consulting firms who have developed solutions and services that integrate with Databricks.

With Partner Connect, users can quickly browse and search for partners based on their specific needs, and can access their solutions and services directly within Databricks, without the need to install or configure anything.

Partner Connect provides a simple way for users to access a range of third-party solutions, such as data connectors, machine learning libraries, and custom applications, which helps users save time and effort. This means that in just a few clicks you could integrate your lakehouse with everyday tools such as Power BI, Tableau, Azure Data Factory and many more.

A few third party tools that can be integrated through Partner Connect

Security: Databricks provides robust security features that help keep your data safe. This includes access controls, encryption, secret scopes, and compliance with industry standards.

In-built Data profiling and Visualisation Capabilities: Databricks notebooks have in-built visualisation and profiling capabilities when data tables are viewed. The “Visualization Editor” allows users to customize their visual by providing fields for grouping, categorising, and even simple formatting options.

Visualization Editor in Databricks Notebooks

Types of in-built visualisations available on a Databricks Notebook

What is the Delta Lake?

Delta Lake is one of the key features that makes Databricks a useful tool for Data Scientists, Engineers and other stakeholders. It is an open-source data lake technology that is designed to provide scalable data storage and management capabilities for big data workloads. It is a storage layer that runs on top of Apache Spark, which allows it to leverage Spark’s distributed computing capabilities for high-performance data processing and analytics.

Delta Lake provides several features that make it a powerful technology for big data workloads, including:

ACID transactions: Delta Lake supports atomic, consistent, isolated, and durable (ACID) transactions, which ensures that data is consistent and accurate. This means that multiple users can read and write to the same data lake without conflicts or data inconsistencies.

Schema enforcement: Delta Lake enforces schema validation, which ensures that data adheres to a predefined schema. This helps prevent data quality issues and makes it easier to work with data across multiple teams and applications.

Time travel: Delta Lake provides time travel capabilities, which allows users to access and revert to earlier versions of data. This makes it easy to recover from data corruption or accidental data changes. You can easily configure how far back you want to time travel to in just a few lines of code.

Code snippet to change log retention configurations

Data versioning: Delta Lake provides data versioning, which allows users to track changes to data over time. This makes it easy to audit data changes and track data lineage.

Data reliability: Delta Lake provides automatic data compaction and optimization, which ensures that data is reliable and consistent over time. This reduces the risk of data corruption and makes it easier to manage and maintain large data lakes.

Compact Data Storage: Delta Lake stores data in parquet data file format, which is a columnar format. This means all values of a particular column are stored together, regardless of which rows they belong to. This ensures that querying a specific column is much faster than querying an entire row, as the system only needs to read the values of the desired column, rather than scanning through all columns of every row. This is particularly useful in data engineering or business intelligence as it stores and retrieves data more efficiently. Columnar storage could also be more efficient for compression and encoding.

For your real-time streaming needs, the Delta Lake provides another useful feature: Delta Live Tables. This feature provides a real-time data processing layer on top of Delta Lake, allowing you to ingest and process streaming data in real-time. This enables users to create and manage continuous queries that process streaming data and write the results back to Delta Lake in real-time. This feature uses Apache Spark Streaming and Structured Streaming under the hood to provide scalable, fault-tolerant streaming.

How is Databricks different to other tools out there?

Uses Apache Spark: Databricks is built on Spark which was specifically created for processing of large data sets, and was optimized for interactive or iterative processing. Whereas other analytic tools are designed for more traditional SQL-based data processing.

Cloud-native platform: Databricks is a cloud-native platform, which means that it is designed to work seamlessly with cloud infrastructure and services. This makes it easy to deploy and scale, and it also provides a range of cloud-specific features, such as autoscaling, elastic compute, and serverless computing.

User-friendly and Intuitive Interface: Unlike more traditional analytic tools like Amazon Redshift and Azure Synapse Analytics which has a more traditional database platform, with a focus on SQL-based querying and data warehousing, Databricks has a more user-friendly interface that makes it easier to work with data and more comprehensive set of features for data processing and analytics.

Cost: While other warehousing and analytics tools charge based on storage and processing capacity, Databricks charges for compute resources used. However this does not guarantee that Databricks would be the cheaper option.

Data Processing: Databricks can handle both batch and streaming data processing (sometimes through the same API) unlike other analytics tools which can be more specialised, such as Redshift, Azure SQL Data Warehouse/Synapse Analytics, Amazon Kinesis Analytics, which are more catered towards either batch or streaming data processing.

But as any tool, Databricks does have some limitations that you need to consider:

Cost: Databricks can be expensive, especially for small organizations or startups with limited budgets. Pricing is based on usage, and costs can quickly add up if you are processing large volumes of data without considering optimizations.

Learning Curve: Databricks has a steep learning curve, particularly for individuals who are not familiar with Apache Spark and programmatically defining data pipelines. While Databricks provides many resources and tutorials, it can still take time to learn the platform. But if you are interested in learning, you can find a whole lot of help on their website.

Dependence on Cloud Infrastructure: Databricks is a cloud-based platform, which means it depends on cloud infrastructure providers such as AWS, Azure, or Google Cloud. If these providers experience outages or other issues, Databricks may be impacted. This was noticed in 2023 when Azure had an outage for a brief period and this caused Databricks too to be impacted.

Data Security: While Databricks provides robust security features, organizations with sensitive data may prefer to keep their data in-house rather than using a cloud-based platform like Databricks.

Integration Limitations: While Databricks integrates with many popular data tools and services, there may be limitations or challenges in integrating with certain tools or services, especially those that are not cloud-based.

Regardless whether you decide to follow them or not, as with any engineering tool, the Databricks community has come up with a few best practices when using the tool:

Use a separate workspace for each environment: To ensure better organization and governance, create separate workspaces for different environments such as development, testing, UAT, and production.
Use version control: Use a version control system to track changes made to your code and notebooks. This helps in maintaining a history of changes and enables rollbacks if necessary.
Optimize cluster configuration: Always optimize your cluster configuration to balance performance and cost. Start with small clusters and increase the size only when required. It is also important to make sure to use appropriate instance types based on the workload requirements. And remember: based on your task, the configuration that best suits you will vary. So be on the lookout for that!
Optimize notebook execution: Use best practices when creating notebooks and workflows to optimize execution for your use case. This includes using the correct Spark configurations, job-level configurations, avoiding nested loops, and minimizing data shuffling.
Use the Delta Lake feature: It really is a powerful tool that helps you with managing large-scale data pipelines. By providing ACID compliance with parquet files, Delta Lake helps bridge and bring the best out of the modern data lake paradigm and the traditional data warehouse.
Monitor resource utilization: Monitor resource utilization to ensure that your clusters are not over or underutilized. This can help you optimize the cluster configuration and save costs.
Use automation: Use automation tools like Azure DevOps or other CI/CD pipelines to streamline the deployment process. This helps to reduce the time required for manual deployment and reduces the risk of errors.
Enable security features: Enable security features like network isolation, authentication, role-based access control, encryption, and authorization. This would ensure the safety of your data and applications. And do not forget to use features like Key Vaults and Secret Scopes to securely store confidential information such as client secrets and tokens.
Avoid mounting Storage Accounts: Databricks recommends moving away from mounting your storage accounts on to DBFS. While mounting is a useful feature, it comes with its own performance and stability concerns. To access external data sources users may create session scoped connections using provider secret scopes such as the Azure Key Vault, AWS Parameter Store, etc.
Z-Ordering: While columns with high cardinality (i.e., those that have a large number of distinct values, like an ID) is not recommended for partitioning because of I/O overhead, If you expect them to be commonly used in query predicates, then use Z-ORDER BY. Delta Lake automatically lays out the data in the files based on the column values and uses the layout information to skip irrelevant data while querying.
Modularise functions: Modularise functions into libraries to import and treat the notebook akin to a main file. This helps with code reusability, simplified debugging, easier maintainability, and reduce time spent on development and maintenance.

Databricks offers its users three pricing tiers to chose from based on their requirements:

Community Edition: This is a free version of Databricks that offers a very limited set of features and is primarily for small-scale use cases. It includes a single workspace with shared resources, limited compute and storage resources, and just basic support.
Standard: This is for small to medium-sized enterprises and offers more advanced features and capabilities than its Community Edition. It includes advanced security features, unlimited workspaces, enhanced compute and storage resources, and technical support. The pricing of compute resources varies based on the service provider.
Enterprise: This is primarily for large-scale enterprises and offers the most advanced features and capabilities of the Databricks platform. It includes several advanced security and compliance features, premium technical support, and dedicated compute and storage resources.

If you do think Databricks could be an useful tool for your organization, or even just yourself, you can head over to their website and sign-up for Databricks Community Version to evaluate it. You will also find many useful tutorials which can help you get started here.

References:

Databricks documentation | Databricks on AWS

[1] (2022, May 10). Introducing Databricks Workflows. https://www.databricks.com/ . https://www.databricks.com/blog/2022/05/10/introducing-databricks-workflows.html

[2] B. H., & D. L. (2019, August 14). Productionizing Machine Learning with Delta Lake. https://www.databricks.com/ . https://www.databricks.com/blog/2019/08/14/productionizing-machine-learning-with-delta-lake.html