For the past few years, I’ve been publishing a list of my favorite tech talks from the previous year. As always, the usual caveats apply here, viz., this list isn’t comprehensive and excludes talks from many fields (blockchain, AI/ML, web and mobile development) that I do not have much experience in. However, if you do like systems engineering, cloud computing, performance, Linux and low level programming, this list might be right up your street.
In 2019 I was invited to be the track host at QCon New York. QCon is an industry conference which eschews a CFP process in favor of inviting specific individuals to be “track hosts” to curate a track. I was asked to curate the “Modern CS in the Real World” track, which is about applied CS research in the real world. My interpretation was slightly broader, in that I wanted the track to be about the application of computer science concepts, and not just purely CS research, in industry.
While I’ve served on the program committee of a plethora of industry conferences in the past, this was the first time I’ve been invited to curate a track. I still can’t believe how well this track turned out! Getting not just one but so many incredible engineers to present was nothing short was amazing. The first five talks on this list are going to be the talks I got to curate myself.
- Let’s Talk Locks: Kavya Joshi
Kavya is an absolutely amazing speaker. Pretty much every talk she’s given has been incredibly technical and presented in a manner that leaves me awestruck. This talk was no different, and goes into the guts of the use of locks in various systems (Linux system calls, the Go programming language) and the performance implications of locks.
2. EBtree — Design for a Scheduler and Use (Almost) Everywhere: Andjelko Iharos
HAProxy is one of my favorite systems of all times. There are so many design choices and implementations that have been pulled off so well I feel they should become a textbook case of how to do systems engineering “right”. One such underlying feature is the EBTree, an improvement on the typical radix tree.
EBtrees were implemented to manage suspended and active tasks in the HAProxy scheduler (100ns inserts for over 200K TCP connections and 350K HTTP connections per second). This talk goes into the evolution of HAProxy’s internals, the need for a data structure like the EBTree and how it powers the timer, access control and the LRU cache implementations
3. PID Loops and the Art of Keeping Systems Stable: Colm MacCárthaigh
This was a talk I’d solicited after watching another talk (which featured in by best of 2018 list!) that very briefly touched upon PID loops and the use of control theory in building modern distributed systems.
The talk explains in great detail how having a “propotional, integral and derivative components” framework can help one analyze and stabilize large scale systems. It’s thoroughly approachable and draws from a number of real world systems at AWS where these design principles have been applied.
4. Leaving the Ivory Tower: Research in the Real World: Armon Dadgar
Armon is, hands down, one of the top 5 speakers I’ve seen present, ever. I’ve watched Armon present at a number of meetups in San Francisco over the years and have thoroughly enjoyed every single talk I’ve seen. I was thrilled to bits when Armon accepted my invitation to present on how Hashicorp incorporates CS research into their amazingly successful products like Consul, Vault, Terraform and more.
The talk was nothing short of brilliant and charts out the history of Hashicorp from inception. It pulls back the curtain on how research is incorporated into some of the most commercial successful infrastructure tooling and distributed systems.
While we’re on the topic of Hashicorp, yet another fantastic talk on how Hashicorp turns to academic research to solve cutting edge problems in distributed systems is the talk Using Randomized Communication for Robust, Scalable Systems by Jon Currey. The use case here is the evolution of randomized communication in Consul via the use of local health information for more accurate failure detection. These ideas were formalized in a research paper called Lifeguard. More importantly, the talk also gives concrete pointers on how to evaluate academic research in production for use cases that might invalidate underlying assumptions of research papers.
5. The State of Serverless Computing: Chenggang Wu
Serverless Computing: One Step Forward, Two Steps Back was one of my favorite white papers from 2018. Chenggang Wu, the PhD student behind breakthroughs such as the Anna K/V store is one of the best up-and-coming researchers in the game.
This was a great talk on the current pitfalls of the serverless model and ways of overcoming them. This was also, incidentally, the best attended talk in the whole track, despite being what I’d like to think of as an “academic talk”.
Another interesting talk that addresses the issue of state in serverless in Stateful Programming Models in Serverless Functions from QCon San Francisco by Chris Gillum. The talk proposes how stateful architectures might be built using FaaS using the “workflow” pattern as well as the “actor” pattern.
6. Performance Matters: Emery Berger
This was a truly mindblowing talk from Strangeloop talk on measuring performance. The end of Moore’s Law and Dennard Scaling means that we can no longer rely on computers to get faster. This talk goes into the factors that affect performance on modern hardware such memory layout, caches, branch prediction, TLB misses, instruction prefetching, and surprising factors like one’s username or the use environment variables (which incurs a copy into the address space of every process that uses it and moves the program stack).
The talk then goes into the tool called stabilizer which repeatedly randomizes program layouts (function address, stack frames, heap allocations etc) during runtime, which when used to evaluate every optimization in LLVM proved that by and large, compiling at -O2 over -O1 was a win, whereas -O3 over -O2 was inconclusive. The talk finally introduces a “causal profiler” called coz that allows for optimization along certain dimensions, accurately predicting the impact of optimizations. It’s definitely one of the best talks I’ve watched on measuring performance in a long time.
This was an absolutely brilliant talk for load-balancing aficionados from SRECon Asia/Pacific on the evolution of client-side load balancing at Twitter. The talk sheds light on the need to load balance sessions as well as requests, the cons of the power of two load balancing algorithm, what the random aperture algorithm looks like in theory and practice, and an improvement called the deterministic random aperture which uses a deterministic subsetting mechanism.
8. Structured Concurrency: Roman Elizarov
This was a fantastic odyssey from Hydraconf into the evolution of asychronous APIs in programming languages, starting from .NET (the pioneer of async/await) to the evolution of coroutines in Kotlin. I’m not a Java developer, but the Kotlin coroutines API is the one I personally find the most usable and intuitive of all the different async APIs I have experience with (CSP inspired goroutines in Go, coroutines in Python3, Netty style Promise/Future based APIs). This talk walks through not just the evolution of coroutines in Kotlin but also the rationale behind each evolution and why “structured concurrency” is table stakes for concurrent APIs.
9. Service Discovery Challenges at Scale: Ruslan Nigmatullin
This was a great talk about scaling Zookeeper as a service discovery system. The talk sheds a lot of light on several strategies for scaling Zookeeper, but I feel some of the takeaways are more generically applicable to systems design.
Another important talk for those interested in learning about service discovery at scale is Eventually Consistent Service Discovery by Suhail Patel. The talk covers some of the distributed systems fundamentals behind service discovery and the eventually consistent variant (Raft, SWIM, Gossip Protocols and more). If you’re new to the concept of service discovery, this might be a good place to start learning more about service discovery algorithms.
10. Testing in Production at Scale: Amit Gud
There’s been quite a bit of chatter on the benefits of testing in production in the past couple of years. This was the first talk that explained what testing in production in practice looks like, the challenges presented by massively distributed architectures and how to route traffic based on tenancy. It’s a talk by a practitioner for practitioners.
11. All of Our ML Ideas Are Bad (and We Should Feel Bad): Todd Underwood
This is a hilarious talk on why a lot of “AIOps” is, not to put too fine a point on things, snake oil.
12. How to specify it! A guide to writing properties of pure functions: John Hughes
Property based testing can be incredibly powerful and effective, given one picks the right property to test. Choosing the right property might be a lot easier said than done. This is a great talk from CodeMesh London from the creator of the Haskell QuickCheck library (the OG property based testing library) on how to pick good functional properties.
13. Keeping Master Green At Scale: Sundaram Ananthanarayanan et al.
This was a fantastic paper from EuroSys 2019 on the tooling required in a monorepo world to keep master green at scale at all times. It’s almost like a meta-build system, a load balancer and a speculative executor rolled into one, with a dose of data science thrown in for good measure.
14. Reduce your storage costs with Transient Replication and Cheap Quorums: Alex Petrov
Voting With Witnesses a ~35 year old paper on reducing storage costs and. It’s incorporsted into Google’s Megastore and Spanner, and now for the first time made its way into an open source database — Apache Cassandra. This is a fantastic talk from Hydraconf by one of the core contributors to Cassandra on the use of transient replication and cheap read quorums to guarantee serializable consistency.
15: Taiji: managing global user traffic for large-scale internet services at the edge: David Chou et al.
This was one of my favorite SOSP papers from 2019 which talks about how Facebook models load balancing as a constraint satisfaction problem to generate the optimal dynamic routing table. Also fascinating is the use of dynamic, connection aware routing based on the “social hash”. Even better, the paper goes into the aspects of productionizing such a system: how Taiji is debugged, how they test such a system in production, what kind of tools need to be built to simplify operations and most importantly, what the limitation of this system is.
16. Time Travel: Applying Gradual Typing to Time Types with Clang’s LibTooling: Hyrum Wright
Gradual typing is something that’s enjoying its moment in the sun in multiple language ecosystems, notably Typescript, Python and Ruby. This was a fun talk on what gradual typing in the context of a typed language like C++ might look like. The example used in this talk was the migration of the underlying backing store of a particular type from a collection of integers and floating point types to the much more strongly typed
absl::Duration types representing time instants and intervals. The talk is also notable for shedding some light into the infrastructure behind large scale code changes and refactors at Google. I expect much more of such content to be covered in the upcoming book Software Engineering at Google.
17. From Laptop to Lambda: Outsourcing Everyday Jobs to Thousands of Transient Functional Containers: Sadjad Fouladi et al.
This was a fantastic paper from Usenix ATC which truly demonstrates the true scope and power of serverless. FaaS can be a great compute substrate for “burst parallel” use-cases when marrying ideas from functional programming, graph theory and more. I call this “serverless”-ing a build system and it’s pretty dope.
18: Maple Tree: Liam Howlett
The Linux kernel has two tree implementations: a trie for range scans and a red black tree for the completely fair scheduler implementation. Both have their pros and cons, some of the cons of the red black tree being that they’re not cache efficient and suffer from high lock contention.
A fascinating talk from Linux Plumbers on replacing the (augmented) red-black tree currently used to track virtual memory areas in Linux with a “maple tree”. Traditional B trees are disk optimized data structures with large nodes, whereas the Maple tree is an in-memory, RCU safe (requiring a lot of allocations, which in turn requires the nodes to be “small” (120b, two cache lines)), range based B tree. Definitely a talk for the data structure nerds out there.
19: CPU Controller on a Single Runqueue: Rik van Riel
Another talk from Linux Plumbers on the implementation of a scalable cgroups (used most famously in “container” technology) CPU controller in the Linux scheduler using a single runqueue instead of hierarchical runqueues. Yet another talk for the data structure nerds out there.
20: Compacting the Uncompactable: Bobby Powers
This talk from Strangeloop was a talk accompanying one of the most brilliant papers in recent years: Mesh: Compacting Memory Management for C/C++ applications.
Mesh is a drop in replacement for malloc(3) that compacts the heap without updating application pointers. Put very simply, it achieves this by “merging” similar objects from two pages onto a single page followed by mapping it on two virtual addresses in a thread safe way. Mesh is reported to reduce the memory of consumption of Firefox by 16% and Redis by 39%.
I feel this is required watching for anyone interested in understanding how memory allocation works. The talk goes into the different memory allocation strategies used by different allocators and what the pitfalls of some of these strategies are.
22. Malloc for everyone and beyond NUMA: Jerome Glisse
There are several applications where the application data is moved around to different physical memory to keep the data local to the processor (CPU, GPU, FPGA etc). The prevalent NUMA model isn’t quite amenable to the heterogenous compute model of a single application. This was a great talk from Linux Plumbers on the need for a unified API for various memory models.
23. pidfds: Process File Descriptors on Linux: Christian Brauner
This was a super fascinating talk from Linux Plumbers that explains how the kernel traditionally identifies processes and threads globally via process identifiers (
pids), which it recycles once a process has been reaped. Pids as such aren’t stable references when shared between processes, since it’s possible that a pid can be recycled without all processes that hold a reference to it being notified.
pidfd is a new API in the kernel that allow callers to maintain a stable reference to any given process.
24. Go scheduler: Implementing language with lightweight concurrency: Dmitry Vyukov
This was a phenomenal talk from Hydraconf on the internals of the Go runtime, touching on the design of the goroutine scheduler, preemption, implementation of goroutine stacks and more.
25. No Moore Left To Give: Bryan Cantrill
That Moore’s Law is dying is no longer disputable. This was one of the best talks from QCon New York that can be succinctly described as the “past, present and future” of Moore’s Law, and I can think of no better speaker than Bryan Cantrill to do justice to a talk of this nature.
26. Speculation & Leakage: Timing side channels & multi-tenant computing: Eric Brandwine
This was a really fantastic talk from re:Invent on the anatomy of a side channel attack and what it means for multi-tenant computing. This talks really builds from the ground up, so even someone without a deep understanding of computer architecture can easily grasp the fundamentals.
27. Preventing Spectre One Branch at a Time: The Design and Implementation of Fine Grained Spectre v1 Mitigation APIs: Zola Bridges and Devin Jeanpierre
This talk goes into the some of the Spectre (variant 1) mitigation strategies that can be enforced at the compiler level and challenges in designing and implementing fine-grained Spectre v1 mitigations in language APIs.
28: Snap: a microkernel approach to host networking: Micheal Marty, et al.
This was one of my favorite papers accepted to SOSP 19 describing Google’s network stack that’s been in production for the last 4 years. It describes Google’s replacement for the Linux TCP/IP stack — a user land microkernel based networking stack that achieves “over 3x Gbps/core improvement for RPC workloads, RDMA-like perf of up to 5M IOPS/- core.”
If you read the paper, it’s almost like Google turned Linux networking into a microservices architecture, complete with “a control plane” centered around RPC serving, and a data plane, centered around “engines” and more. Also interesting is how the call out the need for rapid releases as one of the driving forces behind Snap.
In practice, a change to the kernel-based stack takes 1–2 months to deploy whereas a new Snap release gets deployed to our fleet on a weekly basis.
One of the more welcome changes in recent years has been the fact that Amazon has become a tad more open about some of their best practices for building highly resilient distributed systems. This talk from re:Invent, AWS’s annual developer conference, is a series of 10 minute talks from engineers at AWS on topics such as measuring latency, static stability, jittering retries, shuffle sharding and more. The Amazon Builder’s Library (you can think of it as AWS’s SRE Book) has very detailed articles on many of these topics.
30. My Love Letter To Computer Science Is Very Short And I Also Forgot To Mail It: James Mickens
A hilarious, quintessentially James Mickens talk.
If there’s only one talk you watch from this list, make it this. It’s also probably the shortest talk on this list to boot.