Racing the Zebra — Benchmark Performance Architecture for F# Web Server

Update 25–06–18: The Benchmark finished but Zebra crashed out barely working, after some investigation, appears the Response is not being closed out correctly and preventing subsequent requests, works on chrome due to weird chrome hack but not on standard clients.

Update 10–08–18: Fix has been merged into Techempower PR here, the AsyncTaskMethodBuilder would not sync the contexts so I implemented my own statemachine with state completion being synced by TaskCompletionSource. In addition I have added Test for Fortunes (Templating) and another simple slimmed-down micro implementation that will allow better ranking but is classed as Micro (not framework) for that reason.

Update 15–08–2018 : Results in here https://www.techempower.com/benchmarks/#section=test&runid=9d3b8d36-c4a8-499f-bd0a-0fe33cf820cf, Zebra preformed better then expected, nearly 2x faster then Giraffe, more room for improvement but more then anything, confirms architecture arguments discussed here. On applications with complex routing/parsing, chained async binding the speedup would increase exponentially.

Intro

I, along with many others, put in a fair amount of work, designing & testing the architecture of Giraffe to ensure it would perform as well as MVC as historically with Suave and other functional frameworks, due to all the functional constructs, we have had to compromise performance, the cost we paid for a more elegant functional API. For those interested on how we designed these performance elements in you can check out the Github issues Task computation expression instead of async or Continuations over binding Httphandlers.

Issues with Next Continuation

Although there was initial friction to the next continuation format, most people seem comfortable with it … but there are some bugs in the design, particularly with route parsing that prevent the pre-application of next functions at startup, and instead are causing runtime allocs & binding which defeats the purpose, this is most noticeable in parsing as the first argument of our parsing functions are a runtime parsed variable, the second being the next , the next that is supposed to be applied at startup to build the route branch structure. The parsing & parameter application need to be more tightly controlled to get the consistent pre-built execution tree.

fun next ctx -> task { ... return! next ctx }

Issues with TaskBuilder.fs

The original slowness of Suave on asp.net core was due to the constant changing back and forth between async & Task, these constructs are part of async pipelines with scheduling etc so converting back and forth was not as simple as converting an integer to a float, it was extremely wasteful so with Giraffe I proposed to use tasks throughout instead, and this was adopted using TaskBuilder.fs, leading to massive performance boosts although we were still very far behind C# async/await binding. This is due to restrictions of state passing in computation expressions vs async/await which has a specialised compiler case that is optimized; using a single AsyncMethodBuilder per pipeline such that all the management for a chain of async methods use a single state iterator. TaskBuilder overall is an excellent library that does the best it can with the limitations of F# & CE but on each bind it is allocating a closure continuation as well as a Discriminated Union which is very heap chatty on top of all the task scheduling, and each CE is creating it’s own state-machine vs C#’s ability to use one per pipeline of multiple async/await methods.

Issues with Routing

The default implementation of routing in Giraffe is very inefficient as it tries each path, one by one, to find a route match, giving very poor inconsistent performance on large lists of routes. As well as that, the current routing is using RegEx which works poorly in .net given it is non-contextual generated IL that doesn’t get fully optimized into it’s surrounding code. Using some sort of Dictionary/Map structure allows better lookup performance and in building the TokenRouter I got around both of these issues, Radix style tokens allow quick jumping down a node tree, along with hand-rolled parsers that perform far faster then Regex. The simple answer for router performance could be to use the TokenRouter instead but for benchmarks it is not allowed as marked beta and in addition to that there are two issues with TokenRouter, it’s API is considered a little more difficult then the default hit/miss approach, and like the default implementation, it currently is boxing each result (often ValueTypes) before boxing into a tuple, lots of unneeded allocs.

Return Call Stacks, Type Coercion & Chattiness

A side effect of Object Orientated programming and over refactorisation has resulted in many developers creating many more busy small functions that end up having a number of drawbacks despite the benefit of “more maintainable code”. For Performance, Methods/Functions should be chunky not chatty, chunky methods benefit from less instance loads (from memory), as well as better branch/execution locality within the single method. Functions are refactored into multiple smaller functions just for maintainability but for performance we are un-batching the memory & execution operations.

Nesting many function calls (not Tail call Optimised) results in return call stack frame build up, placeholders for the return types, hops for resume all lead to tiny slowdowns that add up. With F# functions being 1st class we can minimise return types using CPS & state machines instead.

Another side effect of the many small functions & OO mentality is that we end up needing to hack/bend the type system so that return types match up and adhere to a kind of monoid composability, over generalising, and avoiding explicitness. Functional programming is well suited to CPS so that generalised return types become less important, instead we can nest execution branches with localised explicit typed continuations, avoid trying to stuff or wrap a square block into circular hole. A classic example of this is .Net task. We create pipelines of Return type Task<’T> even though we have functions in the pipeline that are synchronous and therefore have to wrap in a task or use another hack like ValueTask<’T>. Giraffe reduces this hack using it’s next continuation but there is more that can be done.

The Ceiling For Giraffe

One of the reasons Giraffe currently beats MVC in some benchmarks is due to the way the handler branching is built out at startup, a tree of nested continuations that only need be applied at run-time. This allowed for a high single digit improvement over prior Option Binding. The continuation format also provided an elegant way to prevent pauses on hotpaths from async wraps ( that were just to maintain an async pipeline) … but after all these optimizations there is still a bit of fat/bloat from aspnetcore MW (middleware) that is also in the benchmark and what Giraffe is built on top of. Looking through the code, a screaming obvious issue is on the chatty, high allocation TaskBuilder.fs, and improving it’s performance has proven difficult as the original is well built and does the best it can for a general purpose Task CE, in order to beat the limitations and benchmark we need to redesign some things and specialize.

Introducing Zebra

To work around the limitations and try out new design patterns I have created Zebra, A smaller faster Giraffe like animal, name is inspired by Steffen & Krzy.

Return Nothing, You’re on Your Own

I wanted to get away from the Task forced wraping. Task and it’s underlying continuation infrastructure with a state machine don’t return tasks at every step so why should we… all asp.net core cares about it being notified when the entire request is complete so that it can dispose and return pooled resources. On this basis we instead look to figurately head off into the wild with a flare gun, no back and forth, just a final flare signal of completion, we can do this with a TaskCompletionSource that provides asp.net core the Task completion signal it needs.

Front Loading execution structure

A consistent pattern for any performance application is utilizing an optimised data structure state that allows for less runtime allocations, as well as a shortened runtime execution path. Algorithms and data structures are very similar things, data structure being a subset of an algo proportional to its optimisation, data structures optimise algo exec by improving relevant state locality, minimising allocation & branching. You cannot “run” a data-structure, the algo (the execution) is always needed but more and more algo logic can be absorbed by data-structures to improve performance.

Key to Giraffe being fast is building the execution tree structure upfront whereas Suave would bind a pipeline at runtime. The way Giraffe is able to build this structure can be seen in the compose function but it is like building a one sided BTree as each closure has a next captured, failure is communicated from return option. To expand this, and given we have no return type, I decided to explicitly implement a BTree structure under the covers, each node has (1) a handler, (2) a next node & (3) a fail node , fail will short cut us directly to the desired fallback node without being tied down by return types and pulling back a load of return stack frames.

Zebra Handler

The re-designed handler will return nothing (unit), with full management of execution path being handled by the incoming (function input) state. Before the handler is executed, the BTree Execution node is injected to provide reference to the next & fail handlers, the handler signature now looses the next continuation, returns nothing so is simply State<’T> -> unit. The functions are not waiting on anything as they internally manage the state of the whole pipeline.

credit:tuskphoto.com

Introducing Zebra’s State

In Giraffe we pass in the HttpContext and that is the state … I realized that if I want to use a single shared Task state machine for a pipeline, I would need to be passing it as a state. Then I had the issue that I wanted to have a CE that could somehow internally use this task state machine, So i initially thought a hybrid task CE that took the state-machine as a constructor argument in each expression. This API was a bit tedious and ugly, was allocating every CE expr and after all the complaints/issues around Giraffe I knew i needed to keep it as simple as possible … It then occurred to me that I could break the general CE mold, I didn’t want a general all purpose Task CE, I wanted a new specialised Zebra CE.

Zebra State CE

The stateful CE could be created with a new async state machine at the start of the pipeline, shared as it does binding in each function, and it does’t need to mess around with returning / task wrapping returns, we are not really returning data/work, we are mutating the HttpContext, I figured the CE return can just be a bool, similar to how the Context Option binding was working but avoids the hotpath sync wrapping as we can deal with that under the covers … and thanks to the Zero method on the builder, you can have an expression with no return and it will implicitly apply a true case to proceed to next handler.

Although having HttpContext on State is not ideal, helper methods on state allow easy access to HttpContext functionality without needing to go through HttpContext.

The specialised CE also allows for customOperations, i have built `text` & `json` for a start. what’s even nicer is that each request has it’s own pooled MemoryStream so all writing by customOperations is sync to buffer, allowing multiple, accumulative writes, with the end of the pipeline automatically setting content-length etc headers and doing Async copy. This approach allows ideal performance for json & view rendering without chunking.

Zebra Task Binding

Zebra’s Task handling is not only greatly improved by the use of a single async state machine for the entire pipeline, it also does practically no allocation as it takes advantage of storing applicator structs in the reusable state machine slot, structs that don’t get boxed to call thier async interface methods due to struct ref constraint passing. The only allocation in these pipelines should be closure wrapping of the bind continuation (if there is a captured variable, which is unavoidable).

If we look at TaskBuilder.fs:

If we have to do an await we are allocating both on the `Await(awt,_)` DU, and also on the `(fun () -> continuation(awt.GetResult())`

In Zebra we use mutableStruct refs to prevent boxing and minimise heap allocations:

Option Binding to Continuations and back again

The Irony of the full circle is not lost on me in relation to me pushing for continuations over Option binding … only to now provide a bool binding. Although it looks like bool binding, it is under the covers using the bool to apply state/context to the Next/Fail function in the pre-built tree as it was before … it’s a hybrid of both that I feel does not compromise perf materially but returns the api to the preferred, perhaps even better format of implicit/explicit bool CE returns. The HttpContext that is being wrapped in an Option for Suave or Giraffe is a wasted allocation as the mutable HttpContext can be referenced in the binding function from elsewhere, wrapping & passing makes sense when/if it were immutable etc.

For the simpler API, the extra operations are: 1) Injecting a new next / fail node into the state before the handler is run. 2) Have Return evaluate if/else to determine how to move forward, next or fail .

Routing Static Type Constrained Parsers

To address the Routing issue, I have integrated a modified Token router into the system as default, but improved API usability and to avoid all the boxing I have been able to build a fully type constrained parsing system that avoids boxing and links up the various parsers directly into applicative functions, once again minimizing allocation. This means no more Tuple objects, and with the continuation construction now managed via nodes, we can now ensure that parse routes are built out at start-up, as well as have the nicer composition format people expect.

One side effect of the explicit curry parse format is that parse routes need to have the number of params appended at the end instead of the ‘f’, so in keeping with Saturn style method routing you get ‘get1' or ‘post2' etc.

We use the inline ternary operator to allow different handler shapes to be bound together so that the experience is seamless to the developer, they only need to use => to glue things together.

So what would a Zebra application look like?

Below is just a basic mashup of functionality

What Does Zebra mean for Giraffe / Saturn?

Nothing, this is a proof of concept design built to perform better in the benchmarks while still be user friendly enough to be considered a framework. I won’t be building it out or extending it as Giraffe & Saturn already offer a rich full developer experience and don’t want to sidetrack those projects with my bizarre pet project. The design was a result/product of work i had been doing to integrate the TokenRouter into Saturn, combined with Giraffe entering the TechEmpower Benchmark.

So this is purely for benchmark and not usable?

It is a flaky alpha that needs some api hacks but it may be an interesting source for some bizarre patterns that you rarely see anywhere else, with:

  • passing stateful computation expressions
  • multi case ternary operator binding
  • Leaf reverse built execution BTrees functions
  • SRTP operator type overloading
  • AsyncMethodbuilder Struct Applicatives

Just to name a few.

I am working with a stripped down version on my fork of TechEmPower and will not be releasing the full version, please do not use or evaluate the framework as it is not production ready and not intended for others use.

I won’t be releasing this as a nuget package or anything … I’m just hoping to rank F# a little higher in Benchmark while maintaining Framework (MVC comparable) status. @gerardtoconnor on twitter if have any questions.

https://github.com/gerardtoconnor/FrameworkBenchmarks/tree/master/frameworks/FSharp/Zebra