Observability at Pave

Quan
Pave Engineering
Published in
4 min readAug 3, 2021

Here’s how our investment into observability with Datadog has unlocked new value for engineering, built a process around metrics, and enabled the engineering team to scale.

When I started at Pave, we were a simple series A startup with 4 engineers and one launched customer. Our “observability stack” was to drop console.log in various places and to spelunk our way through the log stream editor in Google Cloud Functions. Reproducibility was almost impossible, our “alerts” were email complaints and we wasted time guessing what our users were doing.

Fast growing startups are messy, but the lack of metrics meant that we were flying blind. Just as you wouldn’t trust an airplane with bad instruments, you wouldn’t trust a startup with bad telemetry.

One of the first tech initiatives I pushed for at Pave was best-in-class observability for engineers.

The system diagram of our engineering system and how it feeds into our observability stack

1. Choosing the Stack

Deciding on the technology was a classic buy versus build decision for us. Here is how we decided:

  1. We valued doing the due diligence for our key technological decisions, so we did a “bake-off” between ELK and Datadog. We did a one-week sprint to implement a hasty MVP for both and compared the two, and shared with the team.
  2. We valued that ease of use because time and engineering resources were the limiting resource. We would rather spend mental cycles evolving our product rather than tinkering with our infrastructure.
  3. We triangulated our findings with what we see other successful tech companies made. Success does not happen inside of a vacuum.

Takeaway: Datadog is more expensive, but we favored this option since we believe it is worth paying for best-in-class observability

2. Implementation and Consensus

Datadog generates automatic metrics and dashboards through its integration with GCP, but it’s not completely plug and play. We need to do some work on our end.

  1. We need to generate business data related specifically to Pave. Our logging library needs to autolog key business information about our system (requestId, companyId, companyId) with every log line + metric.
  2. We need to avoid logging compensation data, as this is particularly sensitive. This means we need to de-identify our data (no emails!) and to avoid logging salary or equity information wherever possible.
  3. We needed to agree on a consistent logging standard. Everyone has opinions.

Takeaway: Although getting started with datadog and its default integrations + dashboards was easy, most of our energy spent was making it specific for Pave and building team consensus for standards.

Example of our web infrastructure metrics: status codes, number of instances, and CPU + memory utilization

3. Key results

Here are some problems we solved with our new observability stack:

  1. Identify scaling issues: Early stage companies eventually try to sell to larger companies and hit performance problems. We need to pay particular attention to endpoint performance by accounts and companyIds.
  2. Complete full audits on our endpoints: As we gained notoriety, we also attract attackers who are looking for vulnerabilities. With full audibility, we can determine who has called which endpoint, how and when.
  3. Improve quality: We can assign triage, safely remove dead endpoints, and detect unhandled errors and exceptions.
  4. Focus on our product: We can trust our system because it autoscales with load, and is trusted by the biggest tech companies out there. If we had gone with an ELK solution, we’d be periodically resizing our Elasticsearch instance or reindexing documents.

Takeaway: Investing in our observability stack unlocked new capabilities for our engineering team.

4. Learnings

In the beginning, I had personal doubts about whether the investment in observability was worth it but I am glad to have done this project. Some key learnings for me:

  1. The increase in telemetry and knowledge about the system was eye opening and opened us to new opportunities.
  2. The process of building dashboards and alerts sparks productive conversation about useful metrics and ownership.
  3. Tools that are easy to use scale well and result in good adoption.

Takeaway: Observability is a worthwhile investment to any company whether you are an early seed stage startup or late stage capital venture.

Two charts that tell a narrative. Left: Monitoring for a p95 execution time of a pipeline to ensure that we don’t hit the wrong thresholds. Right: Distribution of times it takes to process large and small companies.

Last words

I am a firm believer that any investment in one’s observability stack is well worth the costs since it has such a high impact. It had made the lives of engineers better at Pave and has improved product quality.

If you don’t have feedback, you’re just guessing. ✈️

--

--