Introducing Google Workspace DLP: How Compass scales security data leak prevention automation

Ashley Graves
Compass True North
Published in
5 min readAug 17, 2022

Save money while you save data

Intro

Compass, the largest real estate brokerage in the US in terms of closed sales volume as of Mar 25, 2022, is in a unique and challenging position due to the nature of our business; we facilitate a large volume of high value, time-sensitive real estate transactions. Due to this, we have strict requirements for detector accuracy, user transparency, and supported file types. Using Google Workspace as one of our core collaboration tools, we aimed to build advanced security features on top of their existing controls.

Compass’ Google Drive Data Loss Prevention (DLP) is an internally developed tool that prevents overly permissive sharing of sensitive data by allowing administrators to define automated remediation actions when documents are insecurely shared. This tool was built to support our unique DLP requirements while utilizing native Google Workspace features, allowing us to apply strict controls to mitigate risk, decrease operating costs, and prevent interruption of ongoing transactions. Supported actions include administrative notifications, end-user notifications to the document owner, and automatic revocation of permissions. Today we are open sourcing it on GitHub for other teams to use as well.

Build vs Buy Decision

When looking for a DLP solution, we evaluated solutions based on:

  • Detector quality and granularity: A broad range of pre-defined detectors with support for increased confidence (using detection thresholds, contextual clues, or mathematical verification like the Luhn checksum).
  • Audience Scoping: Ability to apply different policies per Google Workspace Organizational Unit (OU).
  • Flexible Policies: Ability to apply different policies based on sharing settings: public, domain, searchable, and target audiences. Many products focus only on external sharing, whereas we also aim to mitigate insider risk and the blast radius of potential attacks by limiting internal over-sharing as well.
  • Flexible Response: notify users, notify admins, revoke the violating permission.
  • Customized Messaging: User notifications must clearly describe what changed, who changed it, which file was affected, which permission was removed, what policy was violated, and how they can get help.
  • OCR: Since many real estate transaction documents include scanned documents, OCR support is required in addition to traditional text analysis.

Flexibility in where and how we applied policies, such as per-Organizational Unit was required for us to effectively roll the solution to all populations and support department-specific needs. OCR was necessary for supporting detection of scanned documents such as drivers licenses, tax documents, and checks. We also needed a low false positive rate to ensure minimal business interruption, minimize engineering time spent reviewing false positives, and to gain leadership buy-in for organization-wide deployment.

The vast majority of out-of-the-box enterprise solutions evaluated were missing at least one, if not several, of the above requirements in addition to having a high price point — typically in the hundreds of thousands of dollars. A shocking number of these had false positive rates of up to 80%; even for SSNs and credit card numbers which should be validated using the Luhn algorithm, the standard mechanism for validating identification numbers. One vendor suggested that we reduce false positive rates by maintaining multiple DLP rules, one for each credit card vendor’s card number range — this would result in a complex, difficult to manage ruleset that would require manual updates to both the detector (as credit card vendors change their card number ranges), and to each rule.

Google Workspace data loss rules natively supports detector logic, exception management, and Optical Character Recognition (OCR). However, the only actions that could be automatically performed were blocking on external share, warning on external share, and labeling. Since Google was already performing the operationally expensive and developmentally complex functions, we looked into using these initial detection activities as an input to a custom response-action workflow. Using this custom workflow, the final cost for scanning, logging, and taking automated actions on files for 30,000 users is approximately $150/mo in AWS costs.

Design

Compass Drive DLP leverages Google Workspace’s native cloud DLP pre-defined content detectors for initial detection of files containing sensitive data.

High level design diagram
  1. A DLP rule is created and enabled within the Google Admin UI with the desired content detectors defined.
  2. Drive DLP Collector queries the Google Workspace Activity API for rules matching content detection events and enhances the data with additional permissions metadata such as permission types, permission roles, and whether file discovery is enabled.
  3. Drive Policy Engine determines which actions need to be taken based on whether the file event metadata matches a defined policy.
  4. Drive Response Actions perform the actions defined in the policy for each violation.

Future Opportunities

Currently, our DLP tool does not support Shared Drives, certain forms of excessive access, or alternative notification types. We welcome any community suggestions or contributions in the meantime.

  • Other forms of excessive access: Compass Google DLP does not detect forms of excessive access such as 1,000 individuals are added to the document or a Google Group containing all employees is added to the document.
  • Admin notification integrations (Slack, etc): Admin notifications are currently limited to email.

Visualization

While visualization is not natively built into the tool, it can be implemented using Quicksight and the data stored in DynamoDB. By using Athena data collectors (refer to this excellent blog post), you can trivially build continuous monitoring of your DLP detection and response.

Recap

By leveraging native application services and APIs and letting Google do the expensive stuff, we were ultimately able to develop a low costDLP solution with high security impact and low employee friction, that met our defined requirements while surpassing the functionality of many commercial tools. Because our scanner has a low false positive rate we can let it run unattended, allowing Enterprise Security Engineering to continue doing innovative work in this space.

--

--