A Day In The Life Of Ops — (Part 1)
“If you’re looking for sympathy you’ll find it between shit and syphilis in the dictionary.”
** This is part 1 of what started as a single post, and grew into a series of related posts. Consider this an introductory glimpse into who I am, what I do, and the technical background I’ve come from. Hello world.**
What it’s like:
I walk to the entrance of the office building where I work, burnt out from a 13 hour (overnight) MySQL DB migration of critical patient health information which I painstakingly specced, planned, practiced, meditated on, practiced again, prayed for, coordinated, and executed…completely alone. It was an endeavor I’d never before undertaken (with or without help). It is an endeavor I’d rather not undertake again.
When I swing the door open and cruise into the office around noon (the boss said I could sleep until noon, but had to be at work and on-call the next day, having worked a total of 21 hours the day before…thanks, dude) I half-expect that I’ll be walking into a room full of sunshine and the warm smell of freshly baked apple-cinnamon muffins, complements of Martha Stewart herself. The developers and junior admins will be standing up to applaud me, beaming with expressions of love and appreciation, a glint of teary-eyed admiration in their eyes. There will be rose petals falling from the sky, paving the way to my desk, and I will solemnly make my way to my little corner as I humbly nod, acknowledging each of my colleagues to the sound of trumpets and other assorted brass instruments that herald the return of a triumphant hero from the grueling trials of battle. It would sound something like the “Victory Celebration” theme that plays in Episode VI Return of the Jedi, during the Ewok and Rebel celebration after the Battle of Endor. Somewhere in there would also be a beautiful damsel who swoons, a hug from my mother, a pat on the back from my dad, my high school varsity coach approvingly giving me the thumbs up, and a unicorn. ← The ordering of things in that last sentence is very psychologically concerning. Let’s create a card for that after the next scrum meeting and include the fix in our upcoming sprint.
Instead, I’m greeted by the stale, amalgamated scent of office snacks, coffee, human breath, and the sound of furious keystrokes at each of my colleagues’ desks. I can also hear my own sense of humor diabolically pointing and laughing at the rest of my psyche for having expected — or hoped — for any sort of warm welcome or acknowledgement. I set my bag down at my desk, realizing I need to get myself a new backpack and make note of the fact that my desk is dusty and filthy. I walk over to the coffee machine. Nobody bothers to look up from their screen.
Great. No f#cking coffee pods.
I’ll have to go negotiate something with the medispa next door; these days, I feel like trading desktop support (summarily stated as bullshit) for favors like a cup of substandard coffee is beneath me. I’m a cloud architect/engineer now! I don’t need to concern myself with this sort of plebeian nonsense, unless the request is coming from management or HR. As the lead Ops/IT guy in the organization, you can’t expect to get something for nothing. The special hell that IT folks gain exclusive membership to is a world in which no good deed goes unpunished. When you do a favor for someone, it is inevitable that they’ll fuck up some new, vaguely related thing. Whether that new trainwreck takes place in a day, a week, or a month, you’ll inevitably be the one they hold responsible for it.
Why? Because no good deed goes unpunished.
…Because you’re the IT/Ops guy, and if the system is comprised of circuit boards, is powered by electricity, can beep, possesses flashing lights, resides on an electronic screen, or requires any input/interaction via keyboard or touch-screen keypad, it is your job to fix. Somehow, this includes the water cooler and the office microwave too. Fax machines and copiers are to be avoided like the plague.

Having sorted and secured my caffeination requirements, I settle in behind my keyboard and hop on Reddit while sipping my coffee-flavored water. Nothing makes me feel better than aimlessly spelunking through Reddit comment threads where people essentially take time out of their workday to shit all over one another, somehow diverging from the OP’s topic of US politics and tengentially arriving at insults about one another’s gardening skills. There’s no liberation like that of feeling fully self-expressed. We’re going to revisit that last statement in a later post.
My blissful oblivion is interrupted by my boss. He has somehow managed to sneak up behind me with the sort of stealth only a long-time middle management ninja can possess, and is now clumsily initiating what appears to be a brief sequence of small-talk subterfuge which will inevitably segue into a request for some new/other bit of monolithic, monstrous work. I say as few words as possible, arms crossed, waiting for him to talk himself into a conversational cul-de-sac, the only departure from which will be for him to spit out what he wants next from me.

“So, we didn’t get any calls or complaints from the customers this morning. The devs haven’t mentioned any issues they’ve noticed. Most importantly, the interns report no new or suspicious log messages/errors in Graylog.”
“…yeah…no, just the usual fucking set of errors & failures we log and ignore by the thousands each day,” I think to myself.
I look up from my coffee cup as I sip slowly while simultaneously clenching my jaw, replying “I know, I was online watching each and every request coming over the wire. I saw traffic start to trickle in from our East Coast customers around 5:30 AM our time (PST).” I say this with all of the arrogant non-chalance and ironic detachment I can muster up.
On the very long list of things that my non-technical boss doesn’t understand (or grasp the value of) is the time I’ve spent painstakingly intsrumenting every last aspect & metric of any worth across our app servers, load balancers, and innoDB/MySQL engines. I also built out centralized/aggregated logging by myself, without any spec on input from any of the devs. Total situational awareness makes my job (of wizarding, unicorn wrangling, and DevOps magic) possible. This is doubly true when I’m the sole engineer correlating system events to application-logged failures. The tricky part is knowing when too much data is too much, since finding the signal in the noise is proportionately difficult to how much systems noise you’re listening to. You don’t want to be the boy who cried wolf, or the canary in the coal mine, or any other such shitty cliche involving animals and chirping or crying sounds.
Catastrophizing false-positives could cause you to lose face. And with engineers — who incidentally happen to be some of the most opinionated, stubborn, intelligent, and creative people on the planet — that’s a very hard thing to earn back.
“So anyhow,” he continues, “…good job. It seems to have gone off without a hitch. Now, I know you’re a little tired, but I was meaning to ask you about…”
What happened:
I don’t think I’ll ever be able to impress upon this man the sheer magnitude or the precision of the work that was accomplished, or the risks that I had to calculate, factor in, then mitigate for. This was my career equivalent of landing the Philae spacecraft onto a moving asteroid, slingshotting it in figure 8’s around half the solar system from a very, very far distance. This was far and beyond the metaphor of a sniper simply landing a precision shot. For me, at that stage in my career and with my level of experience, this was Ops rocket science, and I had to self-teach. Here’s a quick list of the fuckery I endured:
- The data was hosted in Amazon’s RDS service, and for compliance reasons, needed to be moved to private instances. It wasn’t a covered as a HIPAA compliant service back then. RDS, itself, is a black-box. This means no system-level access to MySQL. Good luck having a systems-level view of guaranteeing you’ve exported everything.
- The employees who’d designed and built the DB and its underlying data structure were long gone, and they’d taken all their knowledge & documentation with them. It’s terribly structured, almost schizoid, and monolithic in its design (with the most important table in the schema storing 120x more rows — in the millions — than the next 3 largest tables, on average), and the damned thing is poorly indexed.
- Given that (at this point in AWS’s history) you couldn’t easily slave the RDS instance to a hand-built EC2 MySQL instance in a VPC, there was no version of this data migration in which I could have shuffled, coupled, mirrored, and cobbled together all of the moving parts so that we could avoid downtime through synchronous replication mechanisms.
- In order to safely conduct the migration, we’d have to ensure that all connections from all dependent services and applications we run were killed/blocked. If, in the middle of the migration, something somewhere tried to contend with my database locks and tried to update/insert even one piece of datum, we *may* be FUBAR. What’s fun about this is that we’ve got some old, brittle services running in dark, dusty corners of our infrastructure that we’ve flipped on and forgotten about. Some are clever enough to possess retry logic, and in those cases, they persistently reopen connections I try to kill. They are not fully documented.
- The sheer size and amount of data, the monolithic architectural philosophy/style/design pattern (more accurately described as ignorance and laziness) which permeates both the database and the application (monolithic java methods!), and the way RDS was set up for “production” from its outset make it such that even if I were able to get all of the data over the wire safely and completely, we’d still be fucked if, in the morning, the users start using the app, only to find out after an hour or two that something’s *been* wrong. What I mean to say is that while reversion to the old RDS DB would be possible, our clients would lose all the data and information (patient health information = billable medical files/visits) they’d added to the system since the morning on the day of the migration. At best, I could try to replay the new binlogs against the old DB, and/or carve up the data manually, but that would mean complete and total downtime. It wouldn’t be lost forever, but it wouldn’t be available immediately either.
- We lacked any form of internal tooling for programmatic validation of the exported dataset. Thankfully, I found about pt-table-sync, and knew enough PERL to manually hack it, so that I could work around settings that Percona’s version of the tool required (which RDS did not permit you to set in MySQL).
- Additionally, there’s no guarantee that the private EC2 MySQL host I’ve set up will be able to withstand the performance demands that our shitty code, terrible indexing, and unfriendly data structure might make of it. Since none of the brilliant employees prior to my arrival had ever thought about load-testing, or traffic reproduction (and hadn’t helped me AT ALL with getting some sort of real-traffic based load-testing framework in place), I actually lacked the ability to fire-test the new hosts I’d built and tuned. Given the deadline that was hung over my head (compliance issues are really, really, really urgent), I lacked the necessary time required to build out any reliable load-test.
Anyone who has ever had to tune databases knows that one incorrect configuration parameter, one unhappy buffer option, or just 100MB of misallocated memory could potentially expose some sort of edge-case doomsday that might blow the whole database up. Now, imagine that the integrity, security, and preservation of the data in this database — patient health information — comes with grave legal implications. Also, remind yourself about the personality profile of Doctors, in general, and imagine how’d they’d be reacting if the application they require for seeing patients was down for longer that two minutes.
Also note: I am not a DBA. Up until this point in my career, my administration of MySQL had gone as far as installing it, querying tables, and setting up replication. That’s it.
Basically, I was shooting from the hip. With blindfolds on. And one hand tied behind my back. Standing on one leg. In a hailstorm. And instead of the hail being solidified water, it was turds. Lots of big turds angrily and ruthlessly raining down upon me from the sky, without an umbrella, set to the soundtrack of thunderclaps and angry rumbles from the depths of my ulcerated stomach.

In the mind of any self-respecting NOC technician, sysadmin, or Ops guy, I was up shit creek without a paddle…or a canoe. No one was there to throw me a life vest, or provide me with a helmet. Considering the risks involved, most folks I know would have walked away from this without a moment’s hesitation.
But this is not so in what I have come to understand as the practice of DevOps. And in the next few posts, I’ll tell you exactly how and why. I’ll also lay out how I was able to plan and complete the project safely, sanely, and without any residual troubles. And most importantly (yes, I started a sentence with the word “and”), I’ll talk about some of the lessons I learned which remain foundational in my personal style of DevOps practice.
For what it’s worth, this all transpired a few years ago, and that stack continues to purr like a kitten - without data loss or security breach - to this very day.