Beware Transparent Layers

Terry Crowley
Aug 28, 2017 · 7 min read

A consistent pattern in building complex systems that need to evolve over time is that you need to choose what layers or components to build on — when to leverage some functional component built by someone else or when to build that functionality yourself on some lower-level interface. Building on a functional component that abstracts complexity and continues to track technology changes over time can be a major benefit. In contrast, betting on a component that adds complexity and makes it difficult to respond to evolving technical realities can be a significant drag on your product strategy.

Back in 2013, the Outlook and Exchange teams deployed changes to a key component used to manage communications between them. I thought it was an interesting story, both from the perspective of clarifying what had always felt like a complex problem space as well as teaching more general lessons aound software layering, abstractions and taking dependencies.

First some history. Exchange and Outlook use MAPI (Messaging Application Programming Interface) to communicate. This is a Remote Procedure Call-based interface, initially designed for local area networks and implemented over TCP. This is a “stateful” protocol which essentially means that you establish a connection, perform operations that establish a shared state context between the client and the server across that connection and then execute operations that are interpreted in the context of that shared state. This contrasts with stateless (e.g. “REST-ful”) protocols that assume that each operation can be interpreted independently. In practice, this stateful design was targeted for an environment where clients stay connected for long periods of time, connections are stable and long-lived and it made sense to optimize for the efficiency of individual operations (since they happen frequently) rather than the efficiency of establishing communication and that shared state (since that happens rarely).

From a layering perspective, the core underlying RPC module was a Windows component that handled such issues as connection establishment, sharing and management and also the packing and unpacking of function parameters. In practice, the MAPI code made little use of the RPC parameter packing and unpacking functionality since all operations were bundled into a “black box” that was packed and unpacked by the MAPI code.

Back in the Office 2003 product cycle, we were working to enable WAN connectivity (big “I” Internet vs. intranet) between Outlook and Exchange. It’s a long convoluted story, but for a variety of reasons we (and the industry) moved to using HTTP vs. raw TCP to enable this type of functionality. IT administrators were being forced to allow HTTP access through their firewalls for web connectivity and tools for managing and locking down HTTP connectivity were more robust than raw TCP. Or at least that was the general perspective.

In any case, the solution was to build RPC-over-HTTP and allow RPC to tunnel through an HTTP connection. The Windows team extended the RPC component to support this connection mode and the Exchange and Outlook teams leveraged this work. This was a nice, clean, (almost) transparent solution to establishing connectivity in a more complex network environment. Ah, some gotchas in that “almost”. There were a variety of ways that it was only “almost” transparent.

Perhaps the most significant is that the core philosophy of HTTP was originally designed around very short-lived single request-response connections. I remember being surprised when looking at HTTP originally back in 1991 since it seemed counter to the typical usage of TCP at the time (e.g. telnet or FTP) that involved long-lived multi-request protocols. Over time HTTP has added functionality to allow request batching to optimize multiple requests between client and service, but that was not the design point and much infrastructure (proxies, firewalls) built up to take advantage of that original design point and semantics. On top of the fact that the underlying protocol was designed for short-lived connections, this period also saw the rise of laptops using less reliable wireless connections and much more frequently cycling between running and sleep states. So both the protocol (and the ecosystem around it) and the actual evolving operating environment were in conflict with the original MAPI model of stable, long-lived connections.

For ease of integration with the rest of the software layers, RPC-over-HTTP was implemented as two independent HTTP connections to support the full duplex nature of RPC. Although this was technically unnecessary (TCP is inherently full-duplex), it made for a simpler mapping to the underlying RPC layers. Additionally, RPC-over-HTTP implemented long-lived and server-push functionality in what evolved to be “non-standard” ways of achieving this functionality (it issues very large 1 gig “requests” but these are left outstanding as actual responses are chunked inside this meta-request). The evolved “standard” approach is a combination of HTTP 1.1 multi-request (which optimizes the connection setup for multiple requests but doesn’t otherwise change the underlying semantics of each request and response) and “long-polling” which has the client issue a request that the server waits to respond to until it has some notification to send.

On top of this, HTTP and RPC have independent authentication mechanisms. So tunneling also required some rigmarole around how authentication tunnels through the layers.

At the end of this process, we had a system that “worked”, but frankly was misaligned with where both the industry and operating environment was going. While everyone might want to feel they are special, one thing you find over time is that you are much better off if you build on widely used components and protocols and if your load patterns are consistent with broad usage. It is hard to specify all the problems you will run into when you are “special”, but they tend to arise from reduced ongoing investment in rarely used components and poor validation coverage of your unusual usage patterns. It’s a bad place to be.

Over time this manifested in different ways. Perhaps the most visible manifestation was in comparing reconnection times to alternate ways of accessing mail that arose over this time, like using Outlook Web Access (over straight HTTP) or mobile clients over EAS. Both of these were designed for more typical web and mobile scenarios, optimizing for quick connection (and reconnection) times vs. stable long-running connections. This was not simply the connection protocol itself, but the server infrastructure and systems that supported it.

Ultimately this just made the core experience of “open my device and see my latest mail” faster using these other modes. Additionally, as the OS made enhancements to support better app-level notification and interaction with wireless connection status, it was extremely difficult to plumb this through an RPC component that was essentially on life support in the OS group. This component was originally designed to opaquely manage connection state, most importantly because it was designed to transparently share connections between multiple applications that might be communicating with the single server. This design point became less critical than being able to rapidly respond to the changing nature of network behavior.

The non-standard use of HTTP mostly surfaced as fragility in a complex network environment of routers, firewalls, proxies, and load-balancers that often had behavior that presumed knowledge of “typical” HTTP usage and how to optimize for it. Mis-behavior in these layers is notoriously difficult to debug and often arose from local configuration (or mis-configuration) issues so they were also very difficult to validate broadly. Use of dual inter-dependent HTTP connections also was non-standard and contributed to overall fragility. Each problem that arose here ultimately had some “root cause” (that at times required developers working months to debug) but really the deeper “root cause” was this fundamental mismatch in characteristic usage of the underlying protocol with the broad industry.

Ultimately we decided that we simply needed to rip out this RPC-over-HTTP layer. Outlook and Exchange together worked to design and deploy the new protocol layer. It separates context and connection (which obviously required significant service infrastructure to support — it was not “transparent” in any way!) and allows for much faster reconnection times by caching the context across connection failures rather than needing to reestablish it which would require additional IO, computation and client-server interactions. It now uses a single HTTP connection, using standard authentication mechanisms, with typical HTTP semantics and using an industry-standard “long-polling” approach for service initiated notifications. On top of these changes, the teams also implemented extensive client telemetry to allow them to monitor, analyze and tune the client-service behavior. Ultimately, end-to-end ownership of this critical interaction path was much more important than trying to optimize for minimizing changes in other software layers.

I am always very, very leery of interposing “transparent” layers that provide new functionality but that try to cover for fundamentally different operating environments. Most typically, the layer is “transparent” except for radically changing performance, latency and error characteristics! Our experience in building applications is that these are precisely the aspects of the operating environment that require deep consideration around overall application architecture and user experience design and where extreme care needs to be taken in attempting to hide information between layers. Beware OS and framework developers bearing transparent new layers!

)
Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade