How Akamai Leverages Databricks Unity Catalog For Distributed Data Governance

Gilad Asulin
7 min readJun 20, 2023

--

This article has been authored by Gilad Asulin (LinkedIn), Pulkit Chadha (LinkedIn) and Katy Mufler(LinkedIn)

Introduction

Data governance plays a crucial role in modern organizations, ensuring data integrity, security, and compliance. However, with the increasing volume and complexity of data, traditional centralized approaches to data governance are proving to be inadequate. This blog post explores how Akamai, the leading content delivery network provider, leveraged Databricks Unity Catalog to distribute and collaborate on data governance. We will delve into the challenges faced by Akamai, the benefits of a distributed approach, and how Databricks Unity Catalog facilitated their data governance journey.

Akamai was founded in 1998 as a Content Delivery Network and is the leading solution in the market with more than 4000 locations in 134 countries and roughly 800 cities all across the world. ~30% of the internet traffic is served through Akamai’s network today. In addition to delivering content, Akamai provides various security solutions to protect websites, applications, and APIs from cyber threats, including DDoS (Distributed Denial of Service) attacks, web application attacks, and data breaches. In the security applications division, Akamai analyzes more than 650 TB per day which contains ~120B suspicious events. Last year Akamai got into the cloud computing market after the acquisition of Linode.

Akamai has been leveraging Databricks’s Lakehouse Platform since 2020, since then we processed in Databricks more than 50 exabytes of data in Delta Lake and have 80 workspaces used by 100s of users. Over the years As Akamai expanded its data infrastructure, it faced challenges in ensuring consistent data governance practices across teams and departments. The traditional centralized data governance model hindered collaboration, delayed decision-making, and limited the scalability of data governance efforts.

Overcoming Data Governance Hurdles: Akamai’s Success with Unity Catalog

Streamlining Data Sharing between Databricks Workspaces

Prior to Unity Catalog, Akamai AppSec teams leveraged Hive Metastore and encountered significant obstacles when it came to sharing managed tables across Databricks workspaces. With Hive, there was no built-in mechanism for seamless data sharing. As a result, they had to resort to creating external tables for each table that needed to be shared. This approach led to code duplication and increased maintenance efforts. For example, if there were five workspaces that needed access to a common table, they had to create five mount points, resulting in redundancy and complexity. Furthermore, relying on mount points for data sharing proved to be insecure and unreliable, as they were prone to errors and inconsistencies.

To illustrate the challenge, consider a scenario where Akamai’s data science team and analytics team both needed access to a centralized customer dataset. Without Unity Catalog, they would have had to create separate external tables for the same dataset in each workspace. This not only resulted in redundant code but also made it challenging to ensure data consistency across the workspaces. Unity Catalog’s introduction resolved this issue by providing a unified metadata catalog, eliminating the need for duplicate table creation and simplifying data sharing.

Implementing Access Isolation Through Least Privileges

Additionally, Akamai also faced difficulties in enforcing access isolation based on the principle of least privileges within Databricks workspaces. Initially, there was no straightforward method to restrict user permissions at the workspace level. Although creating separate workspaces seemed like a potential solution, it introduced its own set of challenges, particularly regarding data sharing. For example, if the marketing team required access to certain tables for analysis purposes, but those tables were located in a workspace dedicated to the data science team, collaboration became cumbersome. Restricting access at the workspace level limited cross-functional collaboration and hindered productivity.

To elaborate, imagine a scenario where Akamai’s finance team needed access to specific financial datasets, while the engineering team required access to different datasets for performance analysis. Without adequate access isolation, the finance team might inadvertently gain access to engineering-specific datasets, potentially compromising data security and confidentiality. With Unity Catalog, Akamai gained the ability to define fine-grained access controls at the table and column level, ensuring that users only had access to the data necessary for their specific roles. This enabled more precise access isolation, reducing the risk of unauthorized data access and improving overall data security.

Improving User Management and Governance

Finally, Akamai faced challenges with the user management system within their >80 Databricks workspaces, which operated independently from each other. This resulted in duplicate user management efforts and fragmented governance processes. Managing user access control became complex and time-consuming, particularly as the organization scaled and the number of users increased. In addition, the lack of centralized identity and access management made it difficult to enforce consistent security policies and ensure compliance with regulatory requirements.

To illustrate the issue, consider a scenario where Akamai had to onboard a new employee. With the standalone user management system, they had to manually create a new user account within each workspace, configure the appropriate permissions, and manage user access separately in each environment. This not only consumed valuable administrative resources but also increased the risk of human error and inconsistencies in user provisioning. By integrating Databricks with Azure Active Directory and Unity Catalog’s central user management, Akamai simplified user management and governance. User provisioning, role assignments, and access control could be managed centrally through AAD for all Databricks Workspaces, reducing administrative overhead and providing a unified and consistent approach to user management across the organization.

Key Benefits Realized with Unity Catalog Adoption

Distributing data governance through Databricks Unity Catalog transformed Akamai’s approach to managing and governing data. Akamai has successfully migrated over 100 tables, including several large ones ranging from 40 to 65 terabytes each. With Unity Catalog, we are now managing and governing over 6 petabytes of data with fine-grained access controls on rows and columns. While it is challenging to quantify the exact savings achieved, it is evident that the migration has significantly reduced the need for duplicate work across various teams at Akamai

Previously, Akamai’s Lakehouse lacked visibility, making it difficult to track and measure various metrics and activity actions. However, with the implementation of Unity Catalog, we now have a comprehensive view of the lakehouse with the data lineage and rich system tables for auditing in Unity Catalog. This increased visibility allows for monitoring and analyzing various measurements and activity actions, providing valuable insights into their data operations.

By adopting a collaborative model, Akamai addressed the limitations of centralized data governance, unlocking new opportunities for growth, innovation, and data-driven decision-making.

As organizations continue to navigate the complexities of managing data at scale, a distributed approach empowered by tools like Databricks Unity Catalog offers a compelling solution to achieve effective data governance.

Akamai has significantly improved the efficiency, security, and collaboration capabilities of their Databricks deployment. The introduction of Unity Catalog, alongside enhanced data sharing, access isolation, and integrated user management has brought about several benefits.

Streamlined Collaboration: With Unity Catalog, Akamai can now seamlessly share managed tables across Databricks workspaces without the need for code duplication or insecure mount points. This fosters collaboration among teams by providing a unified and consistent view of data across different workspaces. The data science team, analytics team, and other stakeholders can access and analyze shared datasets without the limitations and complexities that existed previously.

Enhanced Data Security: The implementation of access isolation based on the principle of least privileges ensures that users have access only to the data they require for their specific roles. Unity Catalog’s fine-grained access controls at the table and column level enable precise control over data access, reducing the risk of unauthorized data exposure and ensuring data security. This is particularly crucial for sensitive data, such as financial information or personally identifiable information (PII), which can now be protected with greater precision.

Simplified User Management and Governance: By integrating Databricks with Azure Active Directory using Unity Catalog, Akamai has achieved centralized user management and governance. User provisioning, role assignments, and access control can now be managed consistently through AAD, reducing administrative overhead and minimizing the risk of errors and inconsistencies. This centralized approach to user management improves efficiency, enables better audibility, and ensures compliance with regulatory requirements.

Scalability and Future-Proofing: The implementation of Unity Catalog equips Akamai with a scalable foundation for their Databricks deployment. As the organization grows and the data landscape evolves, Unity Catalog provides a flexible and extensible framework to accommodate future requirements. It empowers Akamai to adapt and scale their data infrastructure without compromising security, collaboration, or user management.

There is More…

If you want to learn more about Akamai’s journey on adopting Databricks Unity Catalog and how the team at Akamai went about implementing and rolling out Unity Catalog join Pulkit Chadha and my session — Distributing Data Governance: How Unity Catalog Allows for a Collaborative Approach at Data and AI Summit 2023 in San Francisco on June 26th -29th

Additional Sessions at DAIS 2023

Want to learn more about Akamai’s use of Databricks and Apache Spark? Here are other Data and AI Summit Sessions you can tune into:

Visit the full Data and AI Sessions Catalog here

--

--