Probing is a technique to perform regular checks on a service using a short interval. Probes provide signals that can significantly cut down debug time. This post describe probes and how they can be used to drill down into errors and make debugging more focused; how they can partition the debug space.

Probes

Probes are targeted checks, performed as request / response actions, on a short (~1 minute or less) interval. Some common applications of probes are:

  • Uptime Probes: Internal Debugging — The focus of this post. Probes make a binary Yes/No determination of if a service is functioning as expected.


Successfully changing systems requires an understanding of the current system’s state. Profiling is a tool for understanding systems at a point in time. Without a good understanding of the current state, changes can be suboptimal, counter productive, or even dangerous. Profiling is used to breakdown a system’s current state using dimensions, and is a prerequisite for successfully modifying systems.

What is Profiling?

Profiling describes the current state of a system. Knowing the current state helps to inform changes. Consider the following goals and how profiling helps each.

  • Reduce monthly spending. Requires understanding current spending.
  • Eat healthier. Requires understanding what you’re currently eating.
  • Reduce…


Image for post
Image for post

JSON is the de-facto logging standard. JSON is so ubiquitous that the popular logging data tools (such as Elasticsearch) accept JSON by default. Although JSON is an evolution over previous logging standards, JSON’s lack of strict types make it insufficient to use for long-term persistence or as a foundation for a data-lake. This post describes the problem with JSON and proposes a solution using a strictly typed interchange format such as Protocol Buffers.

The Trouble With JSON

JSON logs establish an explicit structure. JSON parsers are available in most languages which make it accessible as a log standard. JSON logs are referred to as…


Alerting on SLOs is an SRE practice which enables teams to proactively be notified when a level of service is not being met. When an SLO alert fires teams can be confident that a client is impacted. Alternative alerting techniques have difficulty quantifying customer impact, which can complicate incident response. This post describes SLO alerts and the benefits they provide over alternatives. The Google SRE book describes how to alert on SLO’s, and this post aims to describe why to alert on SLOs.

Terminology Refresher

SLOs are a quantifiable target representing a client’s experience. SLOs are built on SLIs, and SLIs are…


Since February I’ve been working on an asynchronous AWS Lambda service processing 60,000 events / second from Kinesis. Lambda provides minimal operational overhead, fast deploys, predictable pricing, and enforces many tenants of 12-factor apps, out of thee box. After working with Lambda almost daily for the past 3 months, I’m convinced it’s the future for asynchronous processing, and here’s why.

Outsourcing

Lambda removes complexity over stateful deployment models. This minimizes the surface area service engineers are responsible for. Lambda removes the need for system alerts, instance sizing, autoscaling, provisioning servers, and choosing process monitors, amongst others. There’s no SSH’ing into instances…


AWS Lambda is a serverless solution which enables engineers to deploy single functions. AWS Lambda handles orchestrating, executing, scaling the function invocations. It’s important to structure go lambda projects so that the lambda is a simple entry point into the application, equivalent to cmd/. After a project is structured, it important to keep logic outside the lambda, which allows for easy reuse and testing of the application logic. The following are a series of steps which can be used in Go based lambda projects to help keep projects structured and increase the testability of lambda-based projects.

Structure

In Go, it’s common…


Contact tracing is a technique to determine everyone that an individual came into contact with during some period of time. It’s important to identify everyone exposed to a contagious individual so that they can self isolate until sickness incubation period has passed. With contact tracing it’s possible to significantly reduce the spread of disease. As soon as an individual tests positive for a sickness they can apply contact tracing to identify and alert people that they came in contact with to self isolate.

Identification

Consider an individual who tests positive for coronavirus:

Image for post
Image for post

Contact Listing is the first step of tracing. The…


ValueStream now supports generating pull request reports for GitHub Repositories. Many of the leading code review metric SAAS’ charge hundreds or thousands of dollars a month for similar data. We believe it shouldn’t cost an arm and a leg to gather and aggregate YOUR data. The ValueStream report generation is able to provide a high level view into pull request performance.

ValueStream ships with CLI tools for pulling and performing aggregations to generate reports. This post uses the ImpactInsights/valuestream pull request information as an example.

First step requires pulling the “raw” pull request data from GitHub. Private repos require generating…


Tracking engineering performance from code review (referred to as pull requests) metrics needs to be done carefully. Focusing solely on pull requests for performance metrics leads to a myopic view of engineering performance. Is there anything that pull requests can indicate about engineering performance? Which metrics (if any) should be used, and how should they be used?

What Can Pull Requests Indicate?

To understand what Pull Requests can indicate about performance it’s important to first understand how pull request are regularly used and the role that they play in software delivery. Performance, in this context, refers to the ability of a company to turn software…


Creating sustainable long lasting change in human systems is only achievable though modifying human behavior. Anyone that’s worked with people should understand the difficulty in this task. One common method for enforcing change is command and control, which often leads to hostility and resentment. The other end of the spectrum, No control or oversight, can be equally as damaging, leading to a free-for-all or local fiefdoms emerging. This article describes a middle ground method to creating change through enhanced information management and by carefully designing system constraints.

This strategy is frequently seen in tech and especially relevant to distributed organizations…

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store