Best of 2018 in Tech Talks

For the past two years, I’ve been publishing a list of my favorite tech talks from the previous year (here’s the 2016 edition of this post and here’s the 2017 edition).

This list isn’t by any means comprehensive and I’m certain there are many tech talks from 2018 that I’ll only discover much later. But among the talks I attended or watched, these were some of the best (in no particular order).

  1. The Future of Microprocessors, Sophie Wilson

Sophie Wilson, famed pioneer of the original ARM chip, seems to hold the belief that Moore’s Law is coming to an end (along with certain others listed later in this post). This was a phenomenal talk from JuliaCon that goes into the history, evolution and future of microprocessors.

Video

2. The Hurricane’s Butterfly: Debugging pathologically performing systems, Bryan Cantrill

Two of the talks that made it to my previous year’s list were Zebras all the way down and Debugging under fire: Keeping your head when systems have lost their mind. This was a talk in a similar vein and delivered with the quintessential Cantrillian flair, vim and vigor we’ve grown to expect. Software is built as a stack of abstractions, with seemingly minor issues in one layer (the butterflies) having the potential to transform into systemic pathological performance issues in another (the hurricane). Given such a hurricane, how does one find the butterflies?

Slides Video

3. Close Loops & Opening Minds: How to Take Control of Systems, Big & Small, Colm MacCarthaigh

Admittedly I haven’t watched all the talks from AWS re:Invent, but of the ones I did watch, this was possibly my favorite talk. It lays down some design principles for building highly stable and reliable systems (such as control planes).

1. Checksum all the things
2. Cryptographic Authentication
3. Cells, Shells and “Poison Tasters”
4. Asynchronous Coupling
5. Closed Feedback Loops
6. Small Pushes and Large Pulls (for configuration)
7. Avoid Cold Starts and Cold Caches
8. Throttles
9. Deltas
10. Modality and Constant Work

It was fascinating to learn how certain anti-patterns and unintuitive designs might, in fact, help improve the stability of systems. Possibly the most interesting part of the talk was the idea that stable control systems require a “PID loop” — proportional, integral, derivative components, and that being able to look at a system’s design and spot if it’s missing any one of these is a superpower. This is the first time I’m hearing about this “PID loop”; the talk recommends the book Designing Distributed Control Systems to learn more about how principles of control theory can apply to distributed systems engineering.

It was also interesting to learn the hierarchy or priorities at AWS: security, durability, availability, speed.

Video Slides

4. A Golden Age for Computer Architecture, David Patterson and John Hennessy

This was a fantastic talk on the history and evolution of microprocessors, the move from the CISC to RISC machines to the end of Moore’s Law and Dennard scaling which in turn presents unprecedented opportunities for advances in the “domain specific architecture” space. “Domain specific architecture” includes both advances in hardware (neural network processors for machine learning such as TPU’s, to NVIDIA’s GPUs to FGPAs) along with domain specific software (like Swift for TensorFlow). The talk concludes with the story of the inception and growth of the RISC V ISA.

For those who prefer a written article to a video, this month’s Communications of the ACM has an article authored by Hennessy and Patterson (authors of the famous book Computer Architecture) on this very topic. Moore’s law for transistors might’ve ended, but there appears to be a Moore’s Law-esque growth in the number of machine learning papers being published in the recent years.

Video [Hennessy at Stanford, ~1hour]

Video [Patterson at Facebook’s @Scale Conference ~30 mins]

5. Safe Client Behavior, Ariel Goh

This must be obvious to distributed systems old hands, but it’s worth reiterating that clients are an important part of a distributed system and must thus participate in resiliency efforts. This is a fantastic talk from SRECon Asia/Australia about best practices for client design for improving the resiliency of the entire system. Techniques proposed include jittering client requests, adding randomness so all clients don’t accidentally end up synchronizing when they make requests, when not to retry, jittering retries, retries with exponential backoffs (and concomitant gotchas), “retry budgets” (like error budgets), moving some of the control to the server and establishing a feedback loop between the server and clients, adaptive throttling on clients and much more.

Video

6. How to Serve and Protect (with Client Isolation), Frances Johnson

This is another excellent talk from SRECon Asia/Australia about protecting a service like Google Maps (with a plethora of internal and external clients) from overload. The talk touches upon problems such as system overload (and attendant problems like downstreams being oblivious of a system’s overload), cascading failure, the pitfalls of static quotas, the pros and cons of implementing graceful degradation techniques at different layers of the stack (client, edge, frontend, backend).

Video

7. Applied Performance Theory, Kavya Joshi

This is an incredible talk (as always) by Kavya from QCon London on how to use performance modeling techniques to be able to answer questions such as what additional load a system can support without degrading response time and how to detect a system’s utilization bottlenecks. The talk first walks us through a typical example of web server to demonstrate how to analyze performance in “open systems” followed by an example of “closed systems”, and how both hinge on different assumptions and require different techniques to analyze.

Slides Video

8. Amazon Aurora: Design considerations for high throughput cloud-native relational databases, Sailesh Krishnamurthy

This was an absolutely cracking talk from Facebook’s @Scale Conference on some of the design decisions and tradeoffs that underpin Amazon Aurora, the storage engine powering many popular AWS database offerings. Aurora is claimed to auto-scale up to 64TB per database instance and deliver high performance and availability with up to 15 low-latency read replicas, point-in-time recovery, continuous backup to S3, and replication across three Availability Zones.

There are two [1] [2] accompanying white papers published by Amazon on Aurora. The talk references many points from the second paper in particular, with the main takeaway being that the distributed consensus kills performance and that local state can actually be a good thing. With the use of an immutable log as the source of truth, Aurora avoids distributed consensus for membership changes by leveraging some “oases of consistency” with the use of epochs as guards as a form of write quorum and avoiding doing quorum reads altogether. It’s interesting in an era where transactional systems are making something of a comeback and Google’s preaching about why we should choose strong consistency, whenever possible, Amazon picks different tradeoffs.

Video

9. Future of FoundationDB Storage Layer, Steve Atherton

This was an exciting talk on the future of the Storage Layer of FoundationDB from the FoundationDB Summit. FoundationDB is a distributed, ordered key value store, but the storage layer itself non-distributed and is accessed by a single process from a single thread. The talk goes into the requirements of a new storage engine, non-requirements (concurrent writers, low commit latency), then explores the pros and cons of several data structures du jour (B+ trees, LSM trees) and the reasons behind picking Redwood versioned B+ tree.

Video

Another great talk from the FoundationDB Summit was the one on the document layer, the video of which can be found here.

10. Autonomous Testing and the Future of Software Development, Will Wilson

First of all, Will is possibly one of the best speakers I’ve ever watched speak (his previous talk on Testing Distributed Systems with Deterministic Simulation from Strangeloop 2014 is one of my all-time favorites).

This is a phenomenal talk from the inaugural FoundationDB Summit which makes a pretty compelling case for an AI-driven approach to testing. The talk identifies 3 main problems with testing: fragility (your test comes to rely on properties of your system that are incidental — that are not the ones you thought you were testing), lack of exhaustiveness and flakiness.

The talk argues that tests are great for turning up regressions but almost completely useless for detecting unknown-unknowns. The talk goes on to deem all of the aforementioned problems as the symptoms, with the real underlying problem being that testing is still totally manual. Even “automated testing” only ever involves Jenkins running a test suite manually authored by humans. The talk then lays down the dream of autonomous testing as the need for automated creation of tests, in addition to automated execution of tests.

Video

11. Designing Distributed Systems with TLA+, Hillel Wayne

This was a wonderfully accessible talk from CodeMesh on the use of formal specification for designing distributed systems. Think of it as a gentle introduction to TLA+. Quotable takes include:

Give a system enough time and it will do everything, including fail.
Code is not design. Code does not show you how your system works. It’s just your implementation; it’s not supposed to be your design, it cannot be your design. And if you think you can design a system and understand a system with just the code, I have a bridge to sell you, and I’m going to sell it you twice, concurrently.

Video

12. What We Got Wrong: Lessons From The Birth Of Microservices at Google, Ben Sigelman

This was a whirlwind talk about the mainspring of distributed computing at Google, touching on everything Google got right to practices that were’t quite but had strong parallels to what we know as “microservices” these days. The talk highlights where the broader industry actually does certain things better than how Google did it (such as service meshes), when and why emulating Google’s technological choices and practices doesn’t work well for the rest of us and why it becomes especially important to be able to answer certain kinds of questions before adopting architectural paradigms du jour (such as “serverless”).

Video Slides

13. Distributed Log-Processing Design Workshop, Laura Nolan, Phillip Tischler, Salim Virji

This is an absolutely incredible talk on the practicalities of architecting a large scale distributed system, including how to approach scaling, how to evaluate tradeoffs along various axes as well as tons of back of the envelope calculations to justify each decision.

The SRE Workbook (freely available online) from Google has an entire chapter called Non-Abstract Large System Design dedicated to this very topic, and I’ve heard it is the crucial interview in the entire Google SRE interview loop, as it’s the one that is statistically most likely to trip up a candidate followed by the coding interviews. Personally, I think this isn’t just relevant to SREs but should be required reading for everyone building and operating distributed systems.

Unfortunately, I haven’t been able to find a video for this.

Slides

14. Load Balancing at Hyper Scale, Alan Halachmi and Colm MacCarthaigh

This is a really fascinating talk from Facebook’s Networking @Scale conference about the evolution of load balancing at AWS. It sheds light on HyperPlane, a system that underlies AWS’s S3 Load Balancer, VPC NAT Gateway and PrivateLink, and more. I especially enjoyed learning about the SHOCK principle proposed (Self Healing or Constant Work), which suggests that when you build a system, it should be resilient to even large shocks. Or put differently, “if something big changes, the system should be able to carry on as normal”. The talk proposes that:

1. Constant effort and recovery from failure are the natural states
2. Always operate in repair mode. When a node fails, Hyperplane actually does less work!
3. When designing large scale systems, we don’t want them to be complex. We want them to be as simple as possible. To this end, we want as few modes of operation as possible (Hyperplane has no retry mode, for example. It piggybanks on TCP’s innate retry mechanism). Piling on different modes of operation results in a combinatorial explosion of complexity resulting in the system being incredibly hard to test. We want a system that is consistent and always performs the way we expect.
4. The talk also introduces the idea of shuffle sharding, a DDoS mitigation technique (where isolation is the primary mitigation technique) that’s now widely deployed across many AWS services.

Video

15. Isolation Without Containers, Tyler McMullen

One of my areas of interest is what I’ve been describing to friends as the “spectrum of compute” — VM’s, microVMs, nested VMs, containers (and flavors of “sandboxed containers” like Kata containers) and “serverless” (or functions as a service). I’m especially interested in the “spectrum of isolation” these offer — from strict process level isolation to isolation via a sandbox such as a V8. Several technologies have emerged in the recent years in the virtualization space such as gVisor (a hypervisor which implements a subset of the Linux kernel API in user space) to Firecracker — a virtual machine monitor that’s built for running lightweight and serverless workloads in micro-VMs, itself built on top of crosvm (Chrome OS’s Virtual Machine Monitor). One of the most fascinating developments in this space is WebAssembly. Initially designed as a target for native code to run on browsers, WASM is now being leveraged by CDN providers to run arbitrary code without any form of process based isolation. While I still think the jury’s out on whether this form of isolation truly passes muster, this was an fascinating talk from Strangeloop about this very topic which explains the features of WASM that even makes this somewhat tenable.

Video

16. How C++ Debuggers Work, Simon Brand

The title is pretty self-explanatory. The talks explains everything from what ELF binaries are, DRAWF symbols, the mechanics of how breakpoints work, what stepping through code truly entails, working with multi-threaded applications in a debugger and a lot more. This is definitely one of my top three talks on this list of incredible talks.

Video

17. A Philosophy of Software Design, John Ousterhout

The book A Philosophy of Software Design was hands down the best technical book I read in 2018. Every single chapter in the book is worth its weight in gold, but the chapter on deep modules is probably the one I’ve cited the most. The talk touches upon some of the main ideas and red flags introduced in the book, but if I were you, I’d just buy the book and be done with it.

Video Book

18. Clangd: architecture of a scalable C++ language server, Ilya Biryukov

One of the most interesting developments from Microsoft in the recent years has been the Language Server Protocol. The 5.0 release of the clang compiler introduced Clangd, LLVM’s implementation for the Language Server Protocol. Clangd is an implementation of the Language Server Protocol, to provide features like code completion, fix-its, goto definition, renaming etc for clients such as C/C++ source editors. This was a good talk from CPPCon that touched on some of the limitations of libclang, and explains the motivations behind the development of Clangd as well as its general architecture.

19. Coroutine Representations and ABIs in LLVM, John McCall

Coroutines in LLVM was first added by Microsoft’s Gor Nishanov and was designed around the needs of C++ coroutines TS. This was an incredible talk from the LLVM Developer’s Meeting that goes into some of the pros and cons of various implementation considerations such as ceding control (context switching, coroutine splitting with shared resumption and per yield resumption functions), to storing local state (stackful coroutines, side allocation, stack cohabitation) to yielding data, as well as the challenges of generating code for language features powered by coroutines such as generators. The talk then addresses some of the details of a different type of lowering called “returned continuation flavor” for the Swift programming language, where some of the optimizations happen at Swift’s SIL layer and not directly at the LLVM level.

Video

PS: All talks from the LLVM Developer’s Meeting are deeply educational. I’ve only watched this one talk, but I’m certain I’d happily recommend all of the others too, once I get around to watching them.

20. Developing Kotlin/Native infrastructure with LLVM/Clang, Nikolay Igotti

Kotlin Native is a super interesting development in the recent years which allows Kotlin code to be compiled down to platform binaries (ELF, Mach-O, WASM etc), so it can be run natively in addition to being able to run inside a JVM. This was a really good talk from the European LLVM Developer’s Meeting about the mechanics of Kotlin/Native, including some of the challenges of implementing memory management outside the JVM, handling exceptions, and porting to WASM (which has no runtime, no memory allocator, no exceptions etc), as well as some of the general problems encountered with LLVM (slow codegen and linking, missing public LLDB plugin API etc).

Slides Video

21. Fresh Async With Kotlin, Roman Elizarov

The spectrum of asynchronous programming is wide and varied. This was a fantastic talk from Goto Copenhagen about the challenges underpinning some of these paradigms of asynchronous programming, in particular the callback based approach with Futures. The talk then goes on to address how Kotlin aims to solve this problem with coroutines by providing a synchronous interface to the user (via the suspend primitive) while under the hood using continuations and suspension points to construct a state machine. The most fascinating part of the talk was the comparison between Kotlin’s approach and the C# approach of async/await, with the mainspring behind Kotlin’s design choices being that concurrency is hard and ergo has to be explicit. The talk ends with how even CSP-esque patterns can be implemented using the coroutine primitives of Kotlin.

Video

22. Kotlin Native Concurrency Model, Nikolay Igotti

Kotlin has no concurrency primitives at the language level. Kotlin coroutines as described in a talk above is a library based construct that targets the JVM. Kotlin/Native eschews JVM style shared object heap and locking by maintaining an invariant that an object is either owned by a single execution context, or it is immutable (shared XOR mutable). This was a great talk from KotlinConf that goes into how this is achieved with “not externally referred object subgraphs”

Furthermore, Kotlin as a language doesn’t have immutability built into the type system. Immutability is achieved by the concept of freezing, which makes the transitive closure of all objects reachable from a given object immutable. In addition, Kotlin/Native also allows the transfer of ownership of objects across execution contexts. The talk introduces the basic safe concurrency primitives provided by Kotlin/Native such as “detachable object graphs”, atomics and actor-style “workers”, how reference counting based memory management works in Kotlin/Native as well as how it achieves interoperability with other runtimes.

Slides Video

23. Is it time to write an Operating System in Rust, Bryan Cantrill

I’ve often been subject to someone or other armchair theorizing that Rust is “a language designed to write a kernel in”.

Well, is it?

This is a great talk from the foremost expert on the topic on why Rust is particularly well suited for writing systems software, as well as some of the challenges of potentially writing an entire kernel in Rust. If you like trips down the annals of computing history and that rare brand of hot takes that are actually underpinned by well-informed and reasoned thinking, this might be the talk for you.

Slides Video

24. What do you mean “thread-safe”?, Geoffrey Romer

This was a wonderful talk from CPPCon which aims to disambiguate terms like “thread-safe” or more precise terms like “data race” and “race condition” that often operate at the wrong level of abstraction. The talk proposes using the notion of an “API race” and invariants that can be built around an “API race”, followed by recommendations for both C++ library and application authors around

Video

25. Fast Safe Mutable State, Ben Cohen

When it comes to mutable state, it’s important to remember that it’s shared mutable state that is bad, not mutable state per se. This was a wonderful talk from the Functional Swift conference about when and how to use local mutable state without sacrificing safety or performance. The talk walks through some of the language features in Swift that lend it a certain functional flavor by preventing certain categories of errors possible in mutating functions.

Video

26. The Dos and Donts of Error Handling, Joe Armstrong

I had the pleasure of watching this talk live at GOTO Copenhagen. The main thrust of this talk is that it’s impossible to achieve fault tolerance using a single machine; message passing thus becomes inevitable. Building fault tolerant distributed systems boils down to detecting and acting on errors. The philosophy of error handling deemed most prudent is one where software can be proven correct at compile time and where software is assumed to be de facto incorrect and expected to fail at runtime. Large assemblies of small things are impossible to prove correct; thus it becomes important to be able to define the “error kernel”, which is a subset of system that must be correct. If as a programmer you don’t know what to do, crash. Then your software becomes simpler.

You’d be forgiven for thinking of it as a 45 minute explanation for the existence of the Erlang programming language.

Video

27. QUIC: Developing and Deploying a TCP Replacement for the Web, Ian Swett and Jana Iyengar

This was a great talk from NetDev which gives an introduction to the QUIC protocol developed at Google, the design decisions (why layer it on top of UDP, better loss recovery, flexible congestion control) it’s evolution as well as myriad adventures scaling QUIC on Linux.

Slides Video

28. Introducing Network.framework: A modern alternative to Sockets, Josh Graessley, Tommy Pauly, Eric Kinnear

Sockets can be difficult to use when it comes to connection establishment or data transfer (even with non-blocking sockets) or mobility.

Network.framework is a modern transport API that’s an alternative to sockets on Apple platforms. This was a great walk from WWDC 2018 that walked one through the anatomy of an initial connection establishment to the lifecycle of a connection, along with the myriad optimizations made at different stages. It’s also possibly the best presented talk on this list.

Video

29. Kubernetes and the Path to Serverless, Kelsey Hightower

It’s a talk by Kelsey Hightower.

Do I need to say anymore? I think not.

Video

30. Using Rust for Game Development, Catherine West

The talk starts off saying “This is probably the most boring talk …”

It’s not.

It might actually be the very best talk on this list.

Video