RootConf 2016 :: Learning From Failure

RootConf is an annual conference in Bangalore, India organised by the wonderful folks at HasGeek. It was held on the 14th and 15th of April of 2016. rootconf is dedicated to #devops in general but had folks all streams of the locally thriving IT and StartUp industry. This was my first rootconf, although I have attended the previous conferences organized by HasGeek like The Fifth Elephant, JSfoo . HasGeek gets “Hacker Culture”, right from distributing CAT-5e UTP LAN cables as Lanyards to giving out Yubikeys to every participants in a very professionally packed sponsor kit to having a well defined code of conduct . This years conference had over 250+ folks attending the conference.

We also had a bunch of volunteers from the Fedora India Project participating in the conference. Fedora Project was a community sponsor of the conference. As a systems engineer most part of my career dealing with Application Binary Interfaces and writing tooling around Software Assurance being close to the platform, this was a good learning how the developer operations and system administrators tackle operating system constraints with popular OpenSource Solutions.

This years theme was about “Learning from failure” from devops who face these issues day in and out. Most of the talks stuck to the theme , while a few patterns related to architecture, people/process hacks and devops emerged. Talks about failures were educational and were also narrative stories which many developers and devops could relate to. Interestingly, The most projects used in production by the speakers talked about did not come from a Red Hat, Oracle or a Canonical, but rather companies like ClusterHQ, HashiCorp, Twitter, LinkedIn, Etsy, Netflix. Products like kafka, zookeeper, flume, mesos, etcd, serf, chaos monkey, statsd and many more which just work in distributed production environments.

So without going into specifics of each talk, these were general gist of the talks and discussions around the Fedora Booth. Video’s of all the talks are being uploaded to HasGeek TV :: https://hasgeek.tv/rootconf/2016

Patterns

  • Failure and Embracing Risk .
  • Truce between developers, devops and System Administrators defines the culture irrespective of size of your company.
  • Pay attention to Configurations, Error handling and Monitoring.
We can generalize to economic growth. The problem is that these discussions of “growth” are made by people who have never taken risks.-Nassim Nicholas Taleb Source :: https://www.facebook.com/nntaleb/posts/10153701042473375

Pretty much everything is distributed these days including your truly favourite project of the month, my personal take on it is no matter what you do you always have to take a trade off between Stability vs Agility . In my experience most of the outages happen mostly due to Configuration bugs and improper Error handling. Yes, programming languages and language extensions do matter to an extent if and only if you have verified its proper usage, but understanding your hardware machine is important. I would emphasize on asking the developers to speed up the code, once such pattern is although could be through identifying the bottleneck like CPU, local I/O in disk, external resources. Understanding RAM, Eg, avoiding random reads and access sequentially could speed up . One you could also use CPU vector instructions. This video by Ulrich Drepper is a good introduction on what I mean by CPU utilization https://youtu.be/DXPfE2jGqg0

Vectorization for system performance .

Speed up your code on a single node by understanding the hardware you have invested , simple efficient code can be amazing competitive. 10x faster could translate to 10x less servers to invest in. Clusterize in the end if needed, clusters are hard and should be done as a last resort

This paper on : Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems is a great read on proper error handling. Having a good process on test and staging environment is vital irrespective of size of the organization .

Fedora Project, CentOS, Storage and Project Atomic

Audience that visited the fedora booth, were curious about Fedora Atomic Workstation and Ansible right after kushal’s talk on fedora cloud , Containers on CentOS, persistent storage in containers and OpenShift .

They were quite curious about Project Atomic because of the stickers, I was quite surprised that many in the devops community did not know about project atomic . We spoke about the contributions to the Kubernetes Project, Docker and also OpenContainer Initiative . For the folks who asked about the Atomic Host here are the links ::

Atomic Host for Fedora :: https://getfedora.org/en/cloud/download/atomic.html
Atomic Host for CentOS :: https://wiki.centos.org/SpecialInterestGroup/Atomic
Atomic Host for RHEL :: https://access.redhat.com/documentation/en/red-hat-enterprise-linux-atomic-host/version-7/getting-started-guide/

For the curious minds, you could also check the fedora effort around the layered docker build services . Here are a few pictures from the conference hosted on Flickr ::

Pictures from Fedora Booth and roofconf2016

Swag ::

The fedora buttons were an instant hit, for some reason folks wanted more of buttons. The DVD’s on the other hand were frowned upon by a few, although a couple of students had picked up a few. I guess it has a lot to do with the fact that most of the laptops do not ship with them any more, few of them requested USB’s hopefully we do have some budget for it the next time around.

I would like to thank the Fedora India community, Red Hat India for supporting me to get to this conference.

Books and Papers and further reading ::

  1. Drift into failure
  2. How complex systems fail
  3. Notes on distributed systems for young blood.
  4. http://www.grpc.io/ A high performance, open source, general RPC framework that puts mobile and HTTP/2 first.
  5. Revisiting Distributed Synchronous SGD , http://arxiv.org/pdf/1604.00981v2
  6. What Bugs live in the Cloud , http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf
  7. Maglev :: A fast reliable network Load Balancer https://www.usenix.org/system/files/conference/nsdi16/nsdi16-paper-eisenbud.pdf
  8. Paxos Quorum Leases: Fast Reads Without Sacrificing Writes