Image for post
Image for post
Photo by Brook Anderson on Unsplash

I’m rarely happy with database observability. There’s always more metrics to collect, more data to be aggregated and visualised. Most importantly I want to understand how the database is used and by whom. So that I can improve it.

MongoDb has a decent set of primitives for analyzing database usage. There’s the “slow query” logging, the “profiler”, couple of “top” utilities and set of system collections and functions for additional insights. However, they are just that — primitives. Considerable automation is required to make them usable at scale.

Also, what’s not so great is the overhead of some of those tools — running a full profiler has a noticeable performance overhead and on a busy database the system.profile collection can roll over faster than you can tail it. Same goes for logging — until MongoDb 4.4 the logs were only semi-structured making them prohibitively expensive to parse. Furthermore, since logs are subject to slowOpThresholdMs we’re only capturing the “slow” queries. …


Image for post
Image for post
Photo by Ben Davis, Instagram slovaceck_

Running autonomous robots on city streets is very much a software engineering challenge. Some of this software runs on the robot itself but a lot of it actually runs in the backend. Things like remote control, path finding, matching robots to customers, fleet health management but also interactions with customers and merchants. All of this needs to run 24x7, without interruptions and scale dynamically to match the workload.

SRE at Starship is responsible for providing the cloud infrastructure and platform services for running these backend services. We’ve standardized on Kubernetes for our Microservices and are running it on top of AWS. MongoDb is the primary database for most backend services, but we also like PostgreSQL, especially where strong typing and transactional guarantees are required. For async messaging Kafka is the messaging platform of choice and we’re using it for pretty much everything aside from shipping video streams from robots. For observability we rely on Prometheus and Grafana, Loki, Linkerd and Jaeger. …

About

Martin Pihlak

Site Reliability Engineer. Rust enthusiast. Pragmatic. Enjoys sailing, lifting and running.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store