Enterprise Data Platform @ Compass

Henry Xu

Published in

Compass True North

7 min readMay 10, 2023

Why Compass chose Databricks to build its modern data platform

Introduction

Founded in 2012, Compass ranked the #1 real estate brokerage by volume in the United States in 2022. The technology-enabled brokerage provides an end-to-end platform that pairs the industry’s top talent with technology to deliver exceptional service to seller and buyer clients.

At Compass, data is foundational for making data-driven business decisions and producing product features that facilitate business growth. In 2019, a data platform team was formed with a vision to build a comprehensive platform to democratize information and knowledge, enabling data-driven decisions across the company. The team focused on building infrastructure and tools to efficiently ingest, transform, and store large-scale data sets. The team also developed tools and processes to ensure high-quality data and foster data collaboration on the platform.

Challenges and Architectural Decisions for the Data Platform Architecture at Compass

In this section, we outline the challenges the data platform team faced and the architectural decisions it made to address them.

When Compass formed the data platform team, the team encountered the following challenges:

Multiple overlapping analytics systems: Compass had multiple overlapping analytics systems for different use cases. Duplicate data co-existed on several storage systems, such as S3, Redshift, and Snowflake, and maintaining data consistency was a pain. The team believed building a centralized data infrastructure would be in the best interest of Compass, and this led to the creation of the Enterprise Data Platform at Compass.
Ad hoc operational reports: Multiple operational reports were maintained in ad hoc ways using Google Sheets, Looker, or Tableau. The manual process was inefficient and created tribal knowledge, leading to errors. A high-quality reporting system was a critical desired outcome for the leadership, product, and business operation teams.
Multiple machine learning infrastructures: Earlier, teams built multiple machine learning projects on top of different ML infrastructures. Infrastructure setup and feature engineering took up half of the effort of the projects. A unified data & AI platform would be required to simplify AI feature development.
Scaling bottleneck: As the volume of data snowballed, a legacy system built several years ago started to experience scaling bottlenecks. Often, feature engineering queries of the AI team competed with reporting queries of the business intelligence team for limited compute resources in a warehouse that lacked isolation. For instance, it took 2 hours to load a user activity report, and half of the time, the underlying query failed.

For these reasons and others, Compass’s data & analytics vision required a comprehensive data platform to support end-to-end data engineering, data analysis, business intelligence, and data science use cases. We established the following criteria for the next-generation data platform infrastructure:

Running AI + BI + DI use cases on one platform: The next-generation data platform infrastructure must support running artificial intelligence (AI), business intelligence (BI), and data intelligence (DI) use cases on one platform. Compass can achieve the ultimate simplicity and efficiency in data architecture by eliminating the need to duplicate data to multiple systems.
Scalability, reliability, and security: The next-generation data infrastructure must be scalable, reliable, and secure. The platform must handle the growing volume of analytics and near-real-time streaming data. Additionally, it must support scalable computation loads with reliability, QoS, and security features that keep the platform healthy.
Facilitate platform acceleration & expertise build-up: The next-generation data platform must be implemented with a 5-year technology vision using state-of-the-art future-oriented technologies and architecture. The goal was to build a platform within a year and mature rapidly. Along the journey, it was vital to train a team of experts responsible for advocating guardrails, best practices, and learnings of data platform knowledge across the company.

Among various data platform options, we chose Databricks Lakehouse as the foundation to implement the Compass Enterprise data platform for the following reasons:

The Databricks Lakehouse architecture allowed Compass to store and manage our analytics data on one platform that could scale for future growth. In addition, the architecture would create one environment where AI, BI, and DI teams can collaborate without creating data silos.
The Databricks SQL warehouse, backed by Databricks’ proprietary Photon engine, offered a warehouse to customers with storage-and-compute separation capability and competitive performance. The managed MLFlow on Databricks offered a standard way to streamline end-to-end ML pipelines with data easily accessible on the same platform.
Databricks was (and is) built on top of Spark, one of the world’s most reputable big data technologies. The technology depth and innovation strategy provided by Databricks offered Compass a forward-thinking data infrastructure built for scalability, reliability, and security.
Features such as data lake transactions and optimization, Unity Catalog, etc., could significantly accelerate the maturity of the Compass data platform and the data team with an architectural vision for the next five years.

In summary, Databricks offered a reliable cloud data infrastructure that could help Compass accelerate the strategy of its enterprise data platform.

The Architecture of the Enterprise Data Platform at Compass

This section describes the current architecture of the enterprise data platform at Compass. It is worth noting that the platform is continuously evolving.

Overall Architecture

As depicted in the following diagram, the Compass data platform is built on top of Databricks, which enables us to reliably deploy workloads such as a data engineering component, the Compass data lake, SQL warehouses, and machine learning apps at scale, and is guarded by our governance framework.

The data engineering component is responsible for ingesting and transforming data from various sources, including landing S3 buckets, vendor Web APIs, Kafka streaming inputs, and operational data stores of Compass. This component leverages Databricks data Autoloader and Spark jobs to process the data.

We use Databricks SQL to serve BI Dashboard queries. The Databricks Unity Catalog guarantees that data and permissions are implemented consistently across all workspaces for all platform users.

Data science teams leverage MLflow as the machine learning infrastructure, Databricks Notebook for development, and jobs to run production workloads.

The governance framework includes a data quality framework, security and compliance, data classification and lifecycle management, discovery & lineage, and documentation backed by Unity Catalog. This component is still in development mode.

The business intelligence team chose Tableau as the reporting and data visualization tool because Tableau offered rich enterprise reporting features, security, and ease of use.

Compass Data Lake

The Compass data lake is the core storage component of the data platform, built on the Delta Lake technology, which provides essential features like table-level OLAP transactions, optimization, and time travel that the data engineering team depends on.

The data lake mainly comprises Storage, metadata, and the data lake structure.

Storage. All Compass Data Lake tables are external tables in Databricks, and the actual data is stored in an S3 bucket in the Compass AWS account via the Delta Lake APIs.

Metadata. Spark needs to store the table metadata in a metastore to know where the data is when a user queries. Compass adopted the Databricks Unity Catalog because of its simplicity and security, allowing us to manage permissions across all workspaces at the account level.

Structure. We structure data in the Compass data lake as databases and tables. Adopting the medallion architecture, we organize the data into Bronze, Silver, and Gold layers. This data structure allows us to implement features like data quality runner and fine-grained access.

The Compass data lake in the Medallion architecture allows the data engineering team to organize data, build pipelines, and perform data quality checks in a structured way.

Lessons Learned

Initially, we used AWS instance profiles as the data access management mechanism, which complicated the security and compliance requirements implementation. After collaborating with the Databricks technology leadership, we adopted the Unity Catalog for metadata and access management mechanisms as a long-term solution.

Another lesson we learned was to implement combined Databricks & AWS cost metrics before onboarding a large number of users to the platform. Our engineers built the combined cost metrics and used them to drive cost reduction responsively. For example, using all-purpose compute clusters on jobs was not recommended, as all-purpose compute was 5.7 times more expensive than job clusters. Additionally, Databricks classic SQL was less efficient than serverless SQL for low-utilization usage patterns because classic SQL had a five-minute cold start time, compared to the 5-second start time we observed for serverless SQL.

We also found it valuable that a centralized data infrastructure engineering team focused on standardizing guardrails, best practices, cost management, security and privacy, and knowledge build-up. The benefits greatly outweighed the cost of cross-team collaboration for infrastructure support.

Conclusion

By leveraging the Databricks technologies, the Compass data platform team built a modern data platform that largely addressed the pain points the team was formed to solve. Eventually, the overlapping data platforms will be deprecated, freeing up engineering and infrastructure resources from duplicate work. Today, the platform has become the go-to place for data and machine learning needs across the entire Compass organization. The journey towards a unified Enterprise Data Platform is now underway, with limitless potential to harness the power of data as a competitive advantage for Compass.

In subsequent blogs, we will describe how we standardized data ingestion, implemented data transformation, ensured data quality, and enabled machine learning use cases.