Compliant Data Access in MariaDB & MySQL: Access procedure for debug queries

Maxime Renou
Inside Personio
Published in
7 min readNov 17, 2021

Engineers who work in SaaS companies like Personio know that invalid or inconsistent data can cause a lot of pain, and the bugs created from bad data can be difficult to reproduce without the ability to investigate the data stores. But how do you find a balance between keeping data private and secure and giving engineers the tools they need to debug customer issues? To solve this, we built a self-service tool that handles sensitive data requests in a secure way.

(Note that this article follows up on my prior post, Compliant Data Access in MariaDB & MySQL: Enforcing Tenant Separation for Debug Queries — make sure to read that first!)

Maxime Renou, Senior Site Reliability Engineer at Personio

Problem

Typically we develop locally with fixtures and use anonymized data for bug fixing, but in some cases, we need to understand the real data. Anonymization can also hide problems that disappear during the process, like encoding issues, which is why engineers occasionally need to access real data. Ensuring the data is as secure as possible during these times was our primary challenge.

In the past, we had used a ticketing system where engineers could request access to a specific set of data for debugging purposes. Tickets were processed manually by an SRE, and the results were sent back to the engineers in a secure and ephemeral manner.

While this worked for a while, the process didn’t scale as Personio grew bigger and bigger. So we looked into building a self-service tool that would allow engineers to access real data in a fast, and secure way.

Requirements

Since Personio stores sensitive data from our customers, we had several strong requirements:

  • It must allow access to the data of a single tenant at the time.
  • Access must be granted by an authorized approver.
  • Engineers must pair up to work on the database (following the four eyes principle).
  • Every action conducted using the tool must be auditable.
  • It should easily integrate with our existing tools (such as OneLogin & Slack).
  • It should be low maintenance, easy to use, and cost-effective.

Implementation

With those requirements in mind, we started to design the architecture of this tool, which we called dbaccess. Because we use AWS as a cloud provider, and this is where our databases and applications live, we designed the architecture using AWS managed services. We have multiple datastores in place, so we needed a tool to grant access to all of them, while restricting to one tenant at a time.

Our main design choices included:

  • A client/API architecture for the tool, as it needs to communicate with our databases. Because these are in private subnets, we needed to deploy the tool in the same VPC/subnets. As our engineers don’t have access to that VPC, they needed a client that would communicate with it.
  • Golang, to write the client and the API. We chose Golang because it’s efficient, has good support with AWS SDK, and is easy to distribute as a CLI tool.
  • Lambda, to run the API of dbaccess, as it’s low maintenance and cost-effective.
  • API Gateway, to expose the API (Lambda functions) to the client while plugging into Cognito for the authentication aspect (Cognito itself is connected to OneLogin as the identity provider).
  • DynamoDB as a datastore to keep track of the requests and their status. We benefited from DynamoDB features such as TTL, to limit how long a request is valid for, and DynamoDB Streams option, to send audit logs on Slack when a request expires.
  • Integrating the tool with Slack for the approval and audit logs processes (more on these below), CloudWatch for logging, and SSM for configuration.

We then tackled the processes where the engineer, the pairing engineer, and the authorized approver get involved:

Request Access Process

When an engineer requests access to customer data.

  • An engineer requests access via the CLI client.
  • The request goes through the API Gateway.
  • The authentication is done via Cognito and OneLogin.
  • The request goes to request-company-access Lambda function.
  • The request is persisted to DynamoDB.
  • A Slack notification is sent to the authorized approver.

Even though we decided to not open source this tool, as it’s too specific to our company processes, we wanted to show the usage to showcase the simplicity of the tool.

Requesting access can be achieved with the client by running something like:

dbaccess request-company-access --company-id=1 --reason="Debugging live issue" --jira-ticket="XX-1234" --number-of-queries=2 --manager "john@personio.de" --second-reviewer="sarah@personio.de"

Let’s review the parameters:

  • company-id -> ID of the tenant to access.
  • reason -> A brief description for the authorized approver to get some quick context.
  • jira-ticket -> The Jira bug ticket associated with that request. This can be consulted by the approver or for later auditing.
  • number-of-queries -> Number of queries expected to run. While we don’t currently set a hard limit based on that number, we wanted our engineers to anticipate how many queries they expected to run.
  • manager -> This is the authorized approver.
  • second-reviewer -> The engineer to pair with to execute the queries.

Approval Process

When an approver receives a request and approves/declines it.

  • An authorized approver receives a Slack notification (see below).
  • After approving/declining the request, the response goes to the slack-interactive-endpoint Lambda function.
  • The request status is updated in DynamoDB.
  • A notification is sent to the engineer and the pairing engineer.

This is an example of the notification received by the approver:

If the request is approved, the engineer and pairing engineer receive another Slack notification to let them know that they can start pairing for the next process.

Execute Query Process

When an engineer executes a query for a specific company.

  • An engineer executes a query via the client (together with the pairing engineer).
  • The query goes through the API Gateway.
  • The authentication is done via Cognito and OneLogin.
  • The query goes to execute-query Lambda function.
  • The function execute-query verifies in DynamoDB that the engineer is allowed to access that company ID.
  • The query is executed against the target database.
  • The results are returned to the client.

The query can be executed by running something like:

dbaccess execute-query "SELECT id, company_id FROM employees" --company-id=1 --instance=hr_management --db=org
+---------+------------+
| id | company_id |
+---------+------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
+---------+------------+

The parameters are relatively simple:

  • First, the query to be executed.
  • company-id -> the tenant ID to access.
  • instance -> Name of the RDS instance.
  • db -> Name of the database in that RDS instance.

And the tool returns the result in a familiar format, limited to the tenant that it was granted access for.

To ensure that queries can only be executed within a tenant scope (for us, this means one company at a time), engineers must specify which company id they want to access, and then we either rely on RLS (Row Level Security) for PostgreSQL databases or on our custom implementation for MariaDB (which you can read about here).

Slack Audit Logs Process

When an engineer’s access expires.

  • When a DynamoDB request expires, the Lambda function dynamodb-event is invoked by the DynamoDB stream.
  • If the request was approved, the Lambda function retrieves all queries executed by this request ID.
  • Then a Slack notification is sent to the authorized approver with the queries ran and the number of rows returned.

This is an example of the notification received:

In addition to all of that, Audit logs are pushed to CloudWatch for every action taken against dbaccess.

Conclusion

Five months after we introduced this dbaccess tool, 330 queries were executed through it. To put that number into perspective, during that same period we had 230 other queries which were requested via tickets, which means 60% of all requests were handled by the tool in a self-service manner. The remaining queries are not yet supported by the tool (write queries or multi-tenant read queries). We also received positive feedback from our engineers that they appreciated the ability to run queries independently, without waiting for an SRE to become available.

From a cost perspective, the infrastructure to run the tool costs on average $0.20 per day. This very cheap price is possible because we went fully serverless and therefore only pay when the tool is used.

But, as always, there is also room for improvement. Our engineers have already asked to extend the tool to more data stores, such as ElasticSearch or SQS. We are also considering having an administrative mode only for SRE to do maintenance tasks. Stay tuned for more as we explore each of these options!

--

--