Devopsdays Minneapolis 2019

Mickey Boxell
Oracle Developers
9 min readAug 27, 2019

--

Devopsdays Minneapolis started a day early for me as coincidentally my seatmate on the flight from San Francisco was headed to the conference as well. Another coincidence was that he worked for Mesosphere and it just so happened that was the morning they officially announced they were becoming D2iQ. The next morning I joined 850 attendees to hear Bridget Kromhout’s introduction to the sixth year of the Minneapolis conference. I was especially grateful to Bridget for helping me connect with Techquity, a Minneapolis-based non-profit focused on diversity in the tech community. I arranged for Techquity to share our space and had a great time working with the folks on their team.

As usual, the conference consisted of a morning full of keynotes followed by an afternoon of open spaces, breakout sessions for people to discuss various topics of interest. This conference also included the option of attending of hands on labs as an alternative to open spaces.

Keynotes

Operational Excellence

The keynotes kicked off with Liz Fong-Jones’s excellent overview of how to achieve operational excellence when working with distributed systems. Liz reviewed how the shift to microservices introduces new challenges related to complexity and environment sprawl, which in turn can overload development and operations teams. There are so many fires to put out that there is no time for innovation and new projects. Liz shared how production excellence, designing reliable, resilient, and friendly systems, is important both for the purposes of troubleshooting systems and for the sanity of those tasked with performing the troubleshooting.

The talk touched the importance of taking a holistic approach to operations that involves multiple lines of business (operations, development, business, sales, support, etc.), because in the event of an incident the entire business is impacted. Liz reiterated the importance of observable systems: ones that can explain themselves to us without deploying additional code and that can explain the variance between good and bad events. Liz also talked about the impossibility of creating a perfect system and by acknowledging that and creating an error budget, an allowance of failure, you can leave room for innovation and progress. For example, softening a service level objective related to latency may have a negligible impact on user satisfaction, but open up a considerable time for a developer to work on another project. It was great to hear best practices from such a knowledgeable practitioner.

Security

Yolonda Smith, an infosec analyst from Target, presented about how to build security practices without handcuffing the business. Similar to the error budget described by Liz, Yolanda acknowledged that traditional security practices, for instance, a security team building its own set of tools in isolation and mandating their use, can hinder innovation. However, she suggested that this can be overcome by avoiding a reactive approach and shifting security left: both incorporating security earlier in the DevOps pipeline and also factoring in security earlier in the lifecycle of the product. Yolanda succinctly boiled this down to putting the right controls in the right place in order to add security without slowing down agility. The talk also described security as a spectrum rather than a binary. The notion of something being “secure” depends on the context. It also reminded the audience that security is a lifetime obligation and it is something that comes with a decay rate: deployments are less secure on day 10 than they are on day 1.

Failure Events

John Engelman provided a fascinating recap of a large scale failure event at Target. The incident, one which involved cascading failures in a complex distributed system, provided a case study for the importance of observability. At the time, their environment lacked insight into historical data related to client request rates and therefore they were unable to see trends. John reviewed the timeline of incident response over the months this problem persisted. Their activity began with the immediate problem solving done to address the acute symptoms and avoid customer impact. The following week was focused on gaining insight to understand the root cause of the problem, which was made more difficult by a lack of access to data showing a baseline of normal behavior that could be compared the current system data. The next weeks and then months were focused on maintaining the stable degraded state of the system. During this time they also worked to determine points of system complexity and to instrument levers enabling them to make small precise changes to the system and observe their impact over time. Finally, John shared they were able to stumble randomly upon the answer to their problem.

Containers

In the most energizing keynote of the conference, Alice Goldfuss ripped apart most of the other container 101 overviews by explaining how containers actually work and when they are appropriate to use. I would highly recommend taking a look at a recording of this talk for the full effect. Alice defined containers as processes, born from tarballs, anchored to namespaces, and controlled by cgroups. I cannot imagine a more concise way to describe a container. She expanded on the description by sharing that Dockerfiles are used to build a container image tarball which is run like a process. Namespaces determine what a process can see — a slice of what is happening on the host — and cgroups determine the resources (CPU, memory, storage, etc.) a process can use. Alice suggested that rather than visualizing containers as a whale, we should think of view them in the way a machine views of them by means of the top command used to monitor running processes.

Alice emphasized that while containers have strengths they are not appropriate for every use case. Containers are great for ephemeral, disposable processes, for instance, stateless applications: take data in, change it, and send it back out. They are portable, if something runs locally, it will run in production. They are easy to upgrade, iterate, and rollback. They are easy to troubleshoot because containers make it easy to run multiple application versions simultaneously. With all of that said, containers have weaknesses as well. Certain aspects of your environment cannot or should not be stateless. Elements such as databases require persistence to operate properly. While there are workarounds to make them work with containers, it is considerably easier to use a database as a service, something that provides automatic failover, scalability, read replicas, multi-region support, etc. rather than hire a bespoke team for something that is not the main part of your business.

Hands-On Labs

Simultaneously with the open spaces, a number of hands-on labs took place. I appreciated the opportunity to get hands-on experience with tools related to a number of the keynote topics and guidance directly from experts. Most of the topics complemented the keynotes from the event.

  • In “Pack your bags: Build Cloud Native Application Bundles” Carolyn Van Slyck and Jeremy Rickard from Microsoft shared how Cloud Native Application Bundles specification can be used to facilitate the bundling, installing and managing of container-native apps and their coupled services. Specifically, they walked through using Docker, Terraform, and Kubernetes to build a bundle and deploy it with Porter.
  • In “Container Security” Michael Ducy from Sysdig covered how to implement runtime security for containerized environments using the open-source project Falco.
  • In “Chaos Engineering 101” Ana Margarita Medina and Rich Burroughs from Gremlin walked through the process of using chaos engineering within an engineering organization.
  • In “Blameless Postmortems: How to Actually Do Them” Lilia Gutnik and Matty Stratton from PagerDuty shared techniques for performing a blameless postmortem following an incident and shared a real-world example of the techniques in practice.

Open Spaces

Security

Security in DevOps was a hot topic for discussion after being covered in multiple of the keynotes. I attended one session about where to begin with securing Kubernetes for those new to the platform, another about pipeline security, and a third about how to encourage security best practices throughout your place of business. All of them overlapped quite a lot in terms of content. In attendance were people from Target, Optum (UnitedHealth Group), and various other Minnesota-based tech companies. The majority of companies appeared to be early in their cloud security journey and in need of best practices for where to begin. We all agreed that after making sure the underlying cluster resources were secured, something which was done reasonably well by default by many cloud providers, API authorization through RBAC was a great place to start when it comes to cluster security. PodSecurityPolices would be a great second step.

Regarding pipeline and container security, at least one company mentioned they perform runtime scans on deployed images. No one mentioned using an admission controller in their environment. A couple of the groups who made the most progress in securing their environments stressed the importance of image provenance: knowing what an image is, where it came from, whether or not it is vulnerable. This included the ability to prevent dependencies with known vulnerabilities from being allowed in the first place. It also included having a centralized queryable catalog of the dependencies contained in each image to generate audit reports for compliance and to pinpoint images affected in a security incident. There was a consensus that for pipeline security the best place to start was with dependency management and image signing. Other pipeline security topics included using PodSecurity and ContainerSecurity policies in the cluster, leveraging plug-ins for your IDE that perform static code analysis, and using a trusted builder. Certificate rotation and distribution also came up a number of times.

Single Environment

One interesting open space I attended was created to discuss the pros and cons of using a single environment rather than one for production and lesser environments for development and testing. The argument was the rise of progressive delivery techniques, including canary and blue/green releases, A/B testing, feature flags, etc. has eliminated the need for releasing new iterations of software to multiple environments. These techniques reduce the risk of deploying new features by allowing you to limit the release to a subset of users before delivering them more widely. No one came out strongly in favor of the status quo, but there was a general reticence from the group when someone asked if anyone would be taking this idea back to their leadership and advocate for major change to their software delivery process.

On-Call

After entering the wrong room and feeling too sheepish to leave, I found myself in a discussion about how best to balance one’s personal life with on-call responsibilities. People talked about life on-call and the impact that being responsive beyond normal office hours can have on your life outside of work. People mentioned how difficult it is to commit to anything during an on-call period because you never know when you might have to drop everything to respond. Another challenge discussed was alert fatigue: both becoming desensitized and hypersensitive to alert chimes. People mentioned PTSD-like symptoms whenever they heard the distinct chime used to indicate an issue.

While some suggested advocating for systemic changes to make the job easier, others preferred to focus on actions in their control. One example of something not requiring systemic changes was the choice to support other engineers on your team by trading shifts to accommodate for significant events that could not be missed. A number of folks brought up Alice Goldfuss’s contributions to the on-call space, including her talk “Martyrs on Film: Learning to hate the #oncallselfie” and her handbook. According to the group, maintaining an on-call diary with times and durations of on-call incidents and anecdotes provides a source of evidence that can be used to justify hiring additional engineers and proving your value to the company. Reminding management that working on little sleep after an active shift does not produce good results and providing other reminders about the importance of boundaries and quality of life were suggested as well.

Incidentally, the next week I was delivering a talk at a meet up that one of my coworkers happened to be speaking at as well. During my presentation, he got an alert from PagerDuty kicking off what ended up being a multi-day issue. After the meetup, we rode back to San Francisco together and only spent about 30 seconds of an hour-long car ride talking. After working a full day and then some due to the meetup, he sat there with his laptop open troubleshooting with his phone ringing every few minutes. In the office the next day the ringing continued.

Wrap Up

Devopsdays Minneapolis provided a great space for me to learn about DevOps best practices from leading experts, hear about the latest cloud-native innovations, and connect with others in the industry. I look forward to the next time I have the opportunity to attend.

--

--

Mickey Boxell
Oracle Developers

Product Manager — OCI Container Engine for Kubernetes (OKE)