I spend a large part of every day shelled into cloud servers, viewing logs, checking alerts in slack channels, looking at pages on my phone, glancing at the kitchen clock as I walk by to get coffee, and otherwise behaving like a typical engineer. These activities have something in common: they all involve timestamps of one form or another and most of them are different.

Yeah, I hate time zones, and you probably do too. Our servers are on UTC military time. Our slack channel shows 12-hour local time, as does the kitchen clock and my phone. My colleagues are…


I don’t often write “in the trenches” stories about production issues, although I enjoy reading them. One of my favorite sources for that kind of tale is rachelbythebay. I’ve been inspired by her writing in the past, and I’m always on the lookout for opportunities to write more posts like hers. As it happens we experienced an unusual incident of our own recently. The symptoms and sequence of events involved make an interesting story and, as is often the case, contain a couple of valuable lessons. Not fun to work through at the time, but perhaps fun to replay here.


Logging is one of those plumbing things that often gets attention only when it’s broken. That’s not necessarily a criticism. Nobody makes money off their own logs. Rather we use logs to gain insight into what our programs are doing… or have done, so we can keep the things we do make money from running. At small scale, or in development, you can get the necessary insights from printing messages to stdout. Scale up to a distributed system and you quickly develop a need to aggregate those messages to some central place where they can be useful. …


I’m a software engineer and so I usually fill this space with software and systems engineering topics. It’s what I do and love, and I enjoy writing about it, but not today. Instead I’m going to talk about what my wife does, and loves doing, and how the times we are living through have affected her job and our lives together. In many ways we’re among the lucky ones: we both have incomes and health insurance, and I already worked from home. In other ways we’re not so fortunate. The current crisis facing the world is like nothing any of…


Last night we migrated a key service to a new environment. Everything went smoothly and we concluded the maintenance window early, exchanged a round of congratulations and killed the zoom call. This morning I settled in at my desk and realized that this key service’s builds were breaking on master. My initial, and I think understandable impulse was that somehow I had broken the build when I merged my work branch for the migration into master the night before. …


A couple of years ago I lost all of what I would have considered, up to that point, my intellectual life, not to mention a lot of irreplaceable photos, in a hard drive failure. And while this post is not about the technical and behavioral missteps that allowed the loss to occur those things nonetheless make up a part of the story. How does it happen that an experienced software engineer, someone who is often responsible for corporate data and has managed to not get fired for losing any of it, suffers a hard drive failure and finds himself in…


This morning I disabled private notes on my stories, for a couple of reasons. Philosophically I just don’t like them. If we’re editing a story together then private notes might make a lot of sense. But for a published piece I much prefer suggestions, questions, objections, etc. to be made in the public arena where everyone can read and comment on them. Back when I was an active moderator on the Anandtech forum I used to steer a lot of DMs back into the public forum for the same reason: if you’re going to make a point or ask something…


At Olark we’ve been running production workloads on kubernetes in GKE since early 2017. In the beginning our clusters were small and easily managed. When we upgraded kubernetes on the nodes, our most common cluster-wide management task, we could just run the process in the GKE console and keep an eye on things for awhile. Upgrading involves tearing down and replacing nodes one at a time, and consumes about 4–5 minutes per node in the best case. When we were at 20 nodes it might take 90–120 minutes, which is in a tolerable range. It was disruptive, but all our…


Usually my posts here are about some thing I think I might have figured out and want to share. Today’s post is about a thing I’m pretty sure I haven’t figured out and want to share. I want to talk about a problem we’ve been wrestling with over the last couple of weeks; one which we can suggest a potential fix for but do not yet know the root cause of. In short, if you are running certain types of services behind a GCE class ingress on GKE you might be getting traffic even when your pods are unready, as…


If you’re a GKE user and you’ve created a cluster within the last six months or so you might have noticed a new option:

You may also have caught the press release announcing this feature back in May, or the announcement last October of container-native load balancing for GKE pods, a related thing. …

Mark Betz

Senior Devops Engineer at Olark, husband, father of three smart kids, two unruly dogs, and a resentful cat.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store