Building a High-Performance TCP Server from scratch

11 min read5 days ago

Introduction

Building a high-performance TCP server requires a deep understanding of how sockets work and how to manage connections efficiently. In this blog post, we’ll take a close look at the .NET Socket class, uncovering how it's implemented. By understanding these details, we can leverage its full potential to build a TCP server capable of handling high-throughput, low-latency communication, especially for scenarios like receiving heartbeat messages from numerous IoT devices.

We’ll also compare the performance of our custom server with Kestrel, the highly optimised ASP.NET web server, to demonstrate that a well-crafted .NET socket server can match its efficiency.

Join me as we dive into the architecture of .NET sockets and see how to create a high-performance, low-latency TCP server that delivers results on par with Kestrel.

The code is accessible in the accompanying GitHub repository.

Heartbeat TCP Server

Our TCP server will handle heartbeat messages from thousands of IoT devices, each periodically sending a small HTTP/1.1 request that contains a header with its unique Device-Id . Upon receiving a heartbeat, the server will update the last known timestamp for that device in an in-memory data structure, send an HTTP 201 status code to acknowledge the receipt, and then close the TCP connection.

curl -X PUT -H "Device-Id: 1234" -v 127.0.0.1:9096
*   Trying 127.0.0.1:9096...
* Connected to 127.0.0.1 (127.0.0.1) port 9096
> PUT / HTTP/1.1
> Host: 127.0.0.1:9096
> User-Agent: curl/8.8.0
> Accept: */*
> Device-Id: 1234
>
< HTTP/1.1 204 No Content
<
* Connection #0 to host 127.0.0.1 left intact

The goal is to build a high-performance TCP server, with two key metrics in mind: latency and throughput. Latency refers to the time taken to receive and process a new heartbeat message — updating the last known timestamp in memory. Reducing latency improves the server’s responsiveness. Throughput, on the other hand, measures how many heartbeat messages the server can process in a given time. Increasing throughput allows the server to handle more IoT devices, improving overall efficiency.

Balancing these two metrics is critical, as high throughput doesn’t necessarily mean low latency. Optimising both ensures the server performs well under load.

Socket

To implement our server, we’ll use Socket, which is an abstraction provided by the operating system to facilitate network communication. Both the client and server must create and manage sockets to exchange data. In .NET, the Socket class from the System.Net.Sockets namespace acts as a managed wrapper around the OS's native socket functionality, providing a convenient interface for establishing and managing network connections.

Initialise a Server Socket

The following code initialises a TCP socket, binding it to a specific IP address and port. By calling the Listen method, the server signals the operating system to begin queuing incoming connections. The backlog parameter defines how many connections the OS will hold in the queue. However, while the OS can queue connections, the server still needs to actively accept them.

To do this, the server calls the Accept method, which dequeues a connection and returns a new socket dedicated to that connection. This new socket can then be used for sending and receiving data, while the original socket continues listening for additional connections.

var listenerSocket = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
listenerSocket.Bind(new IPEndPoint(IPAddress.Any, _port));
listenerSocket.Listen(backlog: 4096);

while (true)
{
  var acceptSocket = listenerSocket.Accept();
  HandleNewConnection(acceptSocket); // sequential model
}

Handling New Connections

The simplest approach to handle new connections is through a sequential model as we saw, where the server processes one request at a time. It accepts a connection, handles it completely, and then moves on to the next. While easy to implement, this model is inefficient when multiple clients try to connect simultaneously, as only one client can interact with the server at any given time. This forces other clients to wait, leading to delays and poor performance, especially under heavy load.

To improve throughput and reduce latency, concurrency is key. Concurrency allows the server to handle multiple requests at once, treating each request as an independent task that can be processed in parallel.

There are two main concurrency models: processes and threads.

Processes are more isolated, each running in its own memory space with separate resources. While this isolation increases security and stability, it makes processes slower and more resource-intensive. Communication between processes often requires complex mechanisms like inter-process communication (IPC).
Threads, on the other hand, are lightweight and share the same memory space within a single process, allowing for faster and more efficient communication. Since threads operate within the same process, they are better suited for handling multiple connections in a TCP server, where concurrency is critical. Threads enable the server to manage several tasks simultaneously with lower overhead, making them ideal for high-performance environments.

A common strategy is to create a new thread for each incoming connection, allowing the server to handle each request independently in its own thread, as shown in the following code.

while (true)
{
  var acceptSocket = listenerSocket.Accept();
  // concurrent model
  var thread = new Thread(() => HandleNewConnection(acceptSocket));
  thread.Start();
}

Thread Pool

To boost server performance, a thread pool reuses long-running worker threads, reducing the overhead of creating new threads for each task. Instead of managing individual threads, a thread pool assigns tasks to available threads, making it more efficient, especially for short-lived tasks.

In .NET, you can use ThreadPool.QueueUserWorkItem or ThreadPool.UnsafeQueueUserWorkItem to queue tasks. The key difference is that QueueUserWorkItem passes the security context to the thread, ensuring safer execution, while UnsafeQueueUserWorkItem skips this for faster performance but with potential security risks.

Setting preferLocal to false ensures tasks are distributed evenly across all threads, improving load balancing in high-concurrency scenarios.


while (true)
{
  var acceptSocket = listenerSocket.Accept();
  // concurrent model
  var handler = new ConnectionHandler(acceptSocket);
  ThreadPool.UnsafeQueueUserWorkItem(handler, preferLocal: false);
}


class ConnectionHandler : IThreadPoolWorkItem 
{
  Socket _acceptSocket;
  public ConnectionHandler(Socket acceptSocket)
  {
    _acceptSocket = acceptSocket;
  }
  
  public void Execute()
  {
    HandleNewConnection(_acceptSocket);
  }
}

Blocking vs Non-Blocking IO

After accepting a connection, a server uses the socket to read from and write to the client. In blocking I/O, each read operation suspends the thread until data arrives, causing the thread to wait and preventing it from handling other tasks. This can lead to high context-switching overhead as the number of threads increases, reducing overall throughput and scalability.

Non-blocking I/O allows the server to request data without waiting for it to arrive. For example, a non-blocking read initiates the request and continues executing other code, such as handling additional connections, while waiting for data. This method enhances efficiency by not tying up threads during I/O operations.

However, non-blocking I/O can involve busy-waiting, where the server continuously checks the status of multiple sockets, which can be inefficient. For instance, with thousands of sockets, constantly polling each one — even if only the last one has data — can lead to excessive CPU usage and diminished performance.

while (true)
{
  foreach (var socket in sockets)
  {
    data = socket.nonBlockingRead();
    if (data is not null)
    {
      ProcessData(data);
    }
  } 
}

Event-driven multiplexing

To overcome the inefficiencies of busy-waiting in non-blocking I/O, modern systems use I/O multiplexing with an event-driven model. This approach leverages an event loop and callbacks to efficiently manage multiple I/O operations. In this model, the application registers interest in specific events (e.g., data availability or readiness to write) for each socket. The event loop waits for notifications from the operating system about these events instead of continuously polling each socket.

When an event occurs, the event loop triggers a callback function to handle the I/O operation, allowing the application to respond as needed. This pattern, known as the reactor pattern, reduces CPU usage and overhead by eliminating constant polling. It efficiently handles thousands of connections by focusing only on active, ready-to-process connections, making it ideal for high-performance and responsive network servers.

foreach (var socket in sockets)
{
  OS.EventRegistration(socket, callback, event_type: read | write);
}

// event loop
while (true)
{
  events = OS.WaitForSocketEvents();
  foreach (var event in events)
  {
    event.callback_function(event.socket, event_type);
  }
}

Event-Driven Sockets in .NET

In .NET, socket operations on Unix-based systems (like Linux) are managed using an event-driven model, facilitated by the SocketAsyncEngine class. This class implements an event loop that listens for notifications from the operating system’s native mechanisms, such as epoll on Linux or kqueue on macOS. When these notifications are received, the event loop schedules the corresponding socket operations (e.g., reads and writes) as work items in the ThreadPool. The following code provides a simplified example of how the event loop is implemented by SocketAsyncEngine.

class SocketAsyncEngine : IThreadPoolWorkItem 
{
  ConcurrentQueue<SocketIOEvent> eventQueue = _eventQueue;

  static private SocketAsyncEngine()
  {
    var thread = new Thread(static s => ((SocketAsyncEngine)s!).EventLoop())
    {
        IsBackground = true,
        Name = ".NET Sockets"
    };
    thread.UnsafeStart(this);
  }

  private void EventLoop() 
  {
    SocketEventHandler handler = new SocketEventHandler(this);
    while (true)
    {
      Interop.Sys.WaitForSocketEvents(_port, handler.Buffer, &numEvents);
      if (handler.HandleSocketEvents(numEvents))
      {
          ThreadPool.UnsafeQueueUserWorkItem(this, preferLocal: false);
      }
    }
  }

  void void Execute()
  {
    while (true)
    {
      if (eventQueue.TryDequeue(out ev))
      {
          break;
      }
      ev.Context.HandleEvents(ev.Events);
    }
  }
}

To register a callback for non-blocking socket operations such as Receive or Send in .NET, we use the SocketAsyncEventArgs class. This class allows us to associate a callback with the socket operation, which is triggered when the event loop detects a relevant event. The code snippet below demonstrates how to use SocketAsyncEventArgs for this purpose:

var eventArgs = new SocketAsyncEventArgs(unsafeSuppressExecutionContextFlow:true);
eventArgs.SetBuffer(buffer);
eventArgs.Completed += RecvEventArg_Completed;
acceptSocket.ReceiveAsync(receiveEventArgs);


void RecvEventArg_Completed(object? sender, SocketAsyncEventArgs e)
{
  // consumeing buffer
}

When a read event occurs for a specific socket, the event loop triggers the registered callback via the SocketAsyncEventArgs.Completed event. Additionally, the Receive method captures the execution context to ensure it can be restored before executing the callback. For improved performance, you can disable the execution context if it’s not needed by configuring SocketAsyncEventArgs accordingly in its constructor.

For asynchronous operations like ReceiveAsync that do not take a SocketAsyncEventArgs parameter, .NET internally creates a new SocketAsyncEventArgs object. These methods return a ValueTask, which signals the runtime to proceed once the operation is completed and the underlying SocketAsyncEventArgs object is done.

`SocketAsyncEventArgs Pool`

Reusing SocketAsyncEventArgs objects from a pool can improve performance by avoiding the overhead of creating new instances for each socket event. However, pooling should be used carefully. If a pooled object is far from the CPU core’s cache, accessing it may cause delays. In some cases, creating a new object might be faster than using a cached one.

The key is to measure and balance the trade-offs between pooling and performance, ensuring the pool enhances rather than hinders efficiency.

Inline Completion

In .NET, socket continuations are usually dispatched to the ThreadPool from the event thread to prevent blocking the event handling loop. However, by setting PreferInlineCompletions to true, continuations can be executed directly on the event thread, reducing the overhead of dispatching to the ThreadPool.

By default, PreferInlineCompletions is set to false. You can enable inline completions by setting the DOTNET_SYSTEM_NET_SOCKETS_INLINE_COMPLETIONS environment variable to 1 .

bool HandleEvents(Event[] events)
{
  foreach(var event in envents)
  {
    if(PreferInlineCompletions)
    {
      event.Callback();
    }
    else
    {
      ThreadPool.UnsafeQueueUserWorkItem(event.Callback, preferLocal: false);
    }
  }
}

Socket File Descriptors

Each time an application creates a new socket on Linux, the operating system generates a file descriptor, an integer that represents the created socket. In .NET, you can access this file descriptor to make direct system calls, but this approach requires caution. The file descriptor becomes invalid once the socket is closed, and using it after that can lead to errors or undefined behaviour.

In .NET, the SafeSocketHandle property of a Socket contains this file descriptor. To access its value, you can use the DangerousGetHandle() method of SafeSocketHandle.

Event-Based Heartbeat Server

Now that we’ve covered the key concepts and building blocks, you can see everything in action in the event-based server I’ve implemented. This server handles heartbeat requests and is fully configurable — you can enable or disable features like inline completions or socket pooling using the corresponding option values.

To get started, clone the repository and run the server using the following command:

cd src
dotnet run -c Release --InlineCompletions false --SocketPolling false

Once you run the server, you should see the following output in your terminal, indicating that it’s ready for action:

Server started
ServerOptions: Port=9096, Address=0.0.0.0, MaxRequestSizeInByte=512, InlineCompletions=False, SocketPolling=False

At this point, the server is live, and you’re ready to run your benchmarks to test its performance!

Benchmarking the TCP Server

To ensure our TCP server performs well, we need to measure its throughput and latency. We’ll compare its performance against a baseline, the same Heartbeat server built using .NET Kestrel. Kestrel is highly optimised in the .NET ecosystem, and while our server is purpose-built, this comparison helps us verify that we’re using sockets efficiently.

For benchmarking, we used Bombardier, a widely used fast HTTP benchmarking tool also employed by the .NET team in the Crank project to measure Kestrel’s performance.

To establish the baseline, I created a simple ASP.NET application exposing a minimal endpoint for handling heartbeat messages.

Our tests will make HTTP requests to both the heartbeat server and the baseline Kestrel server. Since IoT devices establish new TCP connections for each heartbeat, we need to simulate this behavior by ensuring each test request creates a new TCP client connection. Unlike Kestrel, which supports connection reuse (keep-alive), our server closes the connection after every request.

Here’s the command we used to run the benchmark:

~$ bombardier -c 32 -m PUT -H "Device-Id:1234" --http1 -a  127.0.0.1:9096

Bombarding http://127.0.0.1:9096 for 10s using 32 connection(s)
[==================================================================================================================] 10s
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec     10114.69    1085.95   12151.48
  Latency        3.13ms     0.96ms    20.28ms
  HTTP codes:
    1xx - 0, 2xx - 101093, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     1.84MB/s

While the default output is shown above, in our actual benchmarking, we used JSON output for easier parsing and charting.

Benchmark Results

The benchmark was run 35 times, with the first 5 iterations discarded as warm-up. We tested with varying numbers of connections: 1, 16, 32, 128, and 256 concurrent connections.

In all scenarios, the custom heartbeat server outperformed the baseline Kestrel server, both in terms of latency and throughput (requests per minute). This demonstrates that our event-based server, despite being purpose-built, handles high connection loads more efficiently than the baseline, proving that our socket implementation is well-optimised.

The following chart provides a comprehensive overview of the results:

Wrap Up

In this post, we demonstrate the process of building a high-performance TCP server using .NET’s Socket class to handle heartbeat messages from IoT devices. We take a deep dive into socket handling, connection management, concurrency models, and efficient I/O techniques like event-driven programming. We explore the custom TCP server with both blocking and non-blocking I/O, along with advanced techniques such as thread pooling and socket multiplexing. Finally, we benchmark the server against the highly-optimised Kestrel server and show how a well-crafted, event-driven architecture can outperform Kestrel in handling high connection loads while maintaining low latency.

Next Step: Moving forward, we can further explore the data handling aspect by examining efficient ways to manage and consume buffers. This will allow for more optimised processing of incoming data, enhancing the server’s overall performance.