Proof Engineering: FIX Gateways

Prerak Sanghvi
Proof Reading
Published in
20 min readSep 24, 2021

How we use the FIX Protocol within our Algorithmic Trading Platform

This is a two-part technical article about FIX and FIX gateways. FIX stands for Financial Information eXchange and is the lingua franca of most electronic trading systems. The first part of this article is a general description of FIX and FIX gateways, while the second part details how we modified QuickFIX/J to work with our sequenced-stream-based architecture, such that our gateways can recover state without using disk-based FIX logs.

This is the third installment of our technology series on how we built a high-performance trading system in the cloud:

The Algorithmic Trading Engine
The Message Bus
FIX Gateways ← this article

Table of Contents

Part 1 — The FIX Protocol
- FIX Message Structure
- FIX Session Layer
- FIX Application Layer
- FIX Domain Model (aka Data Dictionary)
- FIX Alternatives
- FIX Gateways
Part 2 — Proof Modifications to QuickFIX/J
Concluding Thoughts

Part 1 — The FIX Protocol

FIX was created in the 1990s and quickly gained popularity with a wide range of participants and asset classes. There are many versions of FIX, and the protocol has changed in some significant ways in the later versions. Yet, the most popular version of FIX in use today is FIX 4.2 (at least in US Equities), which initially came out back in 2001.

A lot of the technology/messaging purists only grudgingly accept that FIX essentially runs our industry. They appreciate it because it drastically simplifies the job of communicating with counterparties across the street, but they also kinda hate it because it “mixes the layers”. You see, FIX is an all-in-one protocol/standard.

Here are all of the things that FIX defines:

  • Message Encoding
  • Session Layer Messaging Protocol
  • Application Layer Messaging Protocol
  • Domain Model (aka Data Dictionary) for Orders, Executions, Market Data, Allocations, etc.

While this co-mingling of layers infuriates purists, I think this simplicity is also the reason for its dominance in the industry. Once you adopt FIX, you don’t really need to refer to any other protocol or data dictionary. Even when FIX doesn’t quite fit the bill, you can make it work because customization and extensibility are built natively into the protocol definition.

Let’s take a quick look at each of these layers. Unless otherwise specified, the following discussion applies to FIX 4.2 and FIX 4.4, the two most popular versions.

FIX Message Structure

FIX is an ASCII-encoded key-value pair protocol, where keys are integers, referred to as tag numbers. FIX messages have a standard header and a standard trailer, and the fields are separated by the SOH character (i.e. ASCII code 1, or Ctrl-A).

A FIX message always starts with tags 8 (BeginString, aka FIX Version), 9 (BodyLength), and 35 (MsgType), in that order. And the message always ends with tag 10 (CheckSum). Thus, there isn't much in the way of message framing other than tags 8 and 10. There is no such thing as a record or a message separator. Not only that, all messages have the same structure, and tag 35 (MsgType) ends up being the load-bearing tag; depending on the type of the message, different parts of the software application will be responsible for processing the message.

A sample FIX message looks like this (where spaces are really the SOH character):

8=FIX.4.2 9=65 35=A 34=1 49=PROOF 52=20210907-12:00:56.460 56=TEST 98=0 108=30 10=009

FIX Session Layer

FIX is a point-to-point, sequenced, session-oriented protocol. This means that a FIX session must be established before any sequenced messages may be exchanged. However, sequenced messages must be exchanged in order to establish a FIX session (you see what I mean by “FIX mixes the layers”?).

FIX is meant to be agnostic to the underlying transport, although, in practice, it is almost always implemented over a long-running TCP connection, where the lifecycle of the FIX session coincides with the lifecycle of the TCP connection. The use of FIX sequence numbers over TCP is seemingly redundant, given that TCP itself is a sequenced, session-oriented protocol, although that’s not a fair complaint, since FIX’s sequence numbers are meant to survive TCP disconnects.

Each FIX session starts with a Logon message (35=A) and usually ends with a Logout message (35=5). There are two parties to each FIX session - one is designated as the initiator, and the other as the acceptor. Once the TCP connection is established, the initiator sends a Logon message, and the acceptor responds back with its own Logon message. If both of these messages are accepted, the FIX session is established.

FIX also provides mechanisms for recovering lost messages through the use of sequence numbers. Each party keeps track of its own outgoing sequence number, but also an expected incoming sequence number from the counterparty. If a party sends a Logon with an unexpectedly high sequence number, the counterparty will issue a Resend Request (35=2) message with the sequence numbers it is missing. In response to this, the first party should respond back with the missing messages, or else, it can send a Sequence Reset (35=4) indicating that there are no "interesting" messages to resend (this could happen if all of the missed messages are session-level messages from a previous session that are irrelevant in this session).

Once a FIX session is established and message recovery has been completed as necessary, the parties send periodic Heartbeat (35=0) messages to indicate to each other that they're still connected. If one of the parties hasn't received a Heartbeat for a certain amount of time, it can issue a Test Request (35=1) to see if the counterparty is still connected and will respond to this probe. If the counterparty does not respond in a timely manner, the session will likely be terminated (either by using a Logout message or just disconnecting at the TCP level).

Lastly, if a party encounters a malformed or an otherwise unacceptable message from the counterparty, it can use a session-level Reject (35=3) message to indicate the same.

FIX Application Layer

Once a FIX session is established and the sequence negotiations have been completed, application-layer messages may be sent. I won’t go through every single message type available, but below I’ll talk about some of the messages involved in order processing. There are other message types available for representing market data, clearing/settlement information, and even freeform news.

Let’s take a look at a typical lifecycle of an order (in FIX 4.2):

  • An order is typically placed using a New Order Single message (35=D)
  • The order is acknowledged or rejected using an ExecutionReport message (35=8) with OrdStatus=New (39=0) or OrdStatus=Rejected (39=8)
  • Assuming the order is accepted, it may receive a partial fill, which is again represented using an ExecutionReport with OrdStatus=Partially filled (39=1). If the order is fully filled, the message will contain OrdStatus=Filled (39=2).
  • A request to amend the order parameters is sent as an Order Cancel/Replace Request message (35=G). [A pet peeve of mine is the use of "Cancel/Replace" when they just mean "Replace". Not only is it a mouthful, but easy to confuse with an Order Cancel Request]
  • An amendment request is accepted, again using an ExecutionReport message, by specifying ExecType=Replaced (150=5).
  • A request to cancel the order is sent as an Order Cancel Request message (35=F).
  • Acceptance of the order cancelation is sent using, again, an ExecutionReport message with OrdStatus=Canceled (39=4).
  • An amendment or a cancelation request can be rejected using an Order Cancel Reject message (35=9)

FIX Domain Model (aka Data Dictionary)

As you can see above already, FIX provides a model for how to think about and represent trading-related messages. Besides the sequence of messages, it also codifies the content of these messages.

For example, all order-related messages carry an id called the ClOrdId (tag 11), which must be unique for each "request" (where request = order/amendment/cancel). If an order is replaced, the replace request should carry a new ClOrdId, and the original ClOrdId of the order being replaced should be referred to in the OrigClOrdId field (tag 41). For some of the fields, such as OrdStatus (tag 39), FIX even provides the list of acceptable values. For example, here is the complete list of order status values. Of course, an order may have multiple logical statuses — for example, it may be Partially filled (39=1), but also Pending Cancel (39=6). In such cases, FIX also prescribes the precedence for order statuses (in this example, Pending Cancel status has higher precedence than Partially filled).

Aside from the usual order attributes such as Symbol (tag 55), Side (tag 54), OrderQty (tag 38), and Price (tag 44), FIX specifies tags for other order attributes such as OrdType (tag 40), ExecInst (tag 18), MinQty (tag 110), and MaxFloor (tag 111). I won't go into details of each tag, because if you're familiar with equity trading, the names should be self-explanatory (you can browse the entire list of order attributes here).

The list of order attributes supported by FIX are usually rich enough to represent even the most complex order types, but for cases where they are not, FIX provides an easy solution — “custom tags” (aka User Defined Fields). As the name implies, you can just create your own tags/fields, as long as your counterparty agrees to the same tag numbers and values. And indeed, most FIX specifications (including ours) use a whole host of custom tags. Some of these tags are so common, that they’ve become a “standard” in their own right. As an example, order executions are often reported with “liquidity indicators”, which are essentially codes that identify the conditions under which the order was traded. FIX 4.2 does not include this concept at all, but FIX 4.4 sort of does, and almost all FIX 4.2 implementations have adopted the LastLiquidityInd (tag 851) from FIX 4.4. Additionally, even tag 851 doesn't really suffice, so most firms using FIX 4.2 also end up using tags 9730 and 9882 to represent the raw liquidity codes (the actual codes are still unfortunately non-standard and specific to each trading venue).

This built-in extensibility is extremely powerful. Different segments of the trading community have been able to create their own set of “standard tags” to meet their needs. This mechanism of designating custom tags has even been formalized by a “Governance Board” that helps reserve tags for specific use.

FIX Alternatives

The common complaint of “FIX mixes the layers” was addressed finally in FIX 5.0, which separates the FIX session layer from the FIX application layer. At least in Equities though, it has not gained any traction.

In terms of alternatives, nearly every exchange accepts orders over a binary order entry protocol. These protocols solve various issues such as performance (FIX is computationally intensive to encode/decode), verbosity (FIX is text-based), session-layer complexity (if you assume TCP is the underlying protocol, you can simplify some of the session layer mechanics), or novel/custom interactions (e.g. conditional orders). However, all instances of these protocols remain one-off and exchange-specific, and no real consensus or standard has emerged. Also, in every instance, the actual content of the messages in such protocols is very much inspired by the FIX Data Dictionary.

FIX itself has a binary standard called “Simple Binary Encoding” (SBE). We actually use this at Proof, but as a general-purpose binary encoding mechanism. It is extremely fast, but it only addresses the message representation and encoding/decoding of messages, not what fields should be included in any particular message. For example, as far as I know, there is no way to represent a New Order Single (NOS) message in SBE such that all possible tags/fields are part of the message (without blowing up the message size to over 2KB and losing all performance benefits). You have to pick and choose the exact fields you want on all of your NOS messages, and if there is a field that is infrequently used, you have to make a decision of whether you want to include the field on every NOS or not (and there are consequences either way). I can see why it was designed this way (fixed-position fields provide fast random access within the message), but SBE can't be considered functionally equivalent to regular FIX, which is infinitely flexible and does not constrain you in the tags you can use in a message.

Lastly, certain online trading firms provide a REST API (example here) for entering and interacting with orders. The data model for these APIs is often derived from FIX, although the protocol/paradigm is certainly different. These do a great job of making the trading interfaces accessible to a wider audience but aren’t typically used in performance-sensitive scenarios, where persistent connections and compact representations are favored over ease-of-use and interoperability.

FIX Gateways

Typical Functions

FIX Gateways are typically used to bridge external FIX communications with a firm’s internal systems. These internal systems may use FIX as well for inter-component communication, but more often, there is a different FIX-like protocol (binary or otherwise). FIX Gateways then provide the following functions:

  • Translation: They translate FIX to any internal protocols and vice-versa. For instance, the clOrdId values sent by the clients are typically translated into an internal value. There are multiple reasons to do this, the primary one being that this tag value may not be unique across clients or even different FIX sessions of the same client.
  • Normalization: Different counterparties may speak slightly different dialects of FIX, and the gateway usually normalizes between these dialects. For example, every exchange has a slightly different way of designating a midpoint-pegged order. The FIX gateway to each exchange would be responsible for taking care of the venue-specific nuances in most cases. The same applies to symbology in incoming orders — some clients may send exchange tickers, while others may send CUSIP, Sedol, ISIN, etc.
  • Validation: If invalid values for fields are detected, the gateway can flag the problem and either reject the message at the business level (e.g. order reject) or at the session level. Most FIX engines provide a data-dictionary-level validation of tags, where required tags and acceptable values for tags are automatically checked against the FIX 4.2/4.4 data dictionary. Aside from this though, there may be further logical validations (e.g. for a Short Sell order, you may require that the client obtain locates elsewhere and set LocateReqd=N (114=N)).
  • Transformation: For gateways that interact with lots of counterparties with lots of different dialects, there is often a whole rules-based transformation engine that helps with the normalization. For example, in an algorithmic trading engine, if the algo suite supports 30 parameters, and there are multiple versions of these parameter sets over time, you may need a transformation engine to manage the conversion of parameters from older versions to the most recent version.

Typical Challenges

Below are some of the typical challenges that a FIX gateway developer/operator may face:

  • Server affinity: FIX Gateways often tend to get confined to the server they are started on. Firstly, they run a persistent TCP connection, so it is difficult to move them around without disrupting the connection/session. Secondly, if they’re running in the “acceptor” mode, they are dependent on the server having the correct IP address assigned. This is often addressed by using a load-balancer or a virtual IP of some sort in front of the FIX gateways so that the connections may be directed to the appropriate instance. Thirdly, most FIX engines write disk-based FIX logs that are needed for sequence number negotiation. If the FIX engine is moved to a different server, they wouldn’t have access to these FIX logs and would not be able to negotiate a connection. This is often solved by using a shared file system like NFS (see next point).
  • High availability: Most trading systems are created with high availability as a design goal, where if a particular component dies, the functions fail over to a backup instance (automatically or manually). FIX Gateways keep local session state (the aforementioned FIX logs), which makes this challenging. Some ways of sharing this session state are shared file systems like NFS or even databases. The downside of this approach is that making reliable writes to NFS or databases can be extremely slow and prohibitive in terms of latency. The next section of this article discusses how we solved this problem by integrating QuickFIX/J (the most popular open-source FIX engine) with our sequenced stream architecture.
  • Performance: As discussed, FIX is an ASCII-based key-value pair protocol. This means creating and parsing FIX messages requires a lot of string/character manipulation. This tends to slow things down, and in languages like Java, this is much worse given that strings are immutable and any manipulation creates garbage. A key-value pair structure also does not easily provide random access to fields in the message. This problem has been solved in a few different ways, but essentially it comes down to writing efficient code using lower-level constructs (e.g. using C++, or using pre-allocated Java byte arrays). For incoming messages, a standard technique is to parse the message byte-by-byte and create a tag → value map. This ensures that the FIX message is traversed only once, and conversion from string to native data types (e.g. integer) is done only once as well. For outgoing messages, a standard technique is to pre-create a FIX message template for each outbound message type, with all of the expected tags and placeholders for tag values, and then to quickly replace the designated placeholder bytes with actual data content at send time.
  • State management: As discussed above, FIX gateways often map clOrdId values to internal ids. This requires the gateways to often keep track of the mapping between internal and external clOrdId values. If the gateway handles a million different requests, it might need to keep a map of those million ids, which can get memory-intensive, and difficult to recover if the gateway was restarted. There are ways to pare this down and only keep track of ids for open or filled orders, but the gateway might still need to know every id it has seen, for the purpose of detecting duplicates. Aside from this, the gateways often end up keeping a bunch of other state such as client/security/account-related information and even the order state for the currently open orders, for the purpose of echoing certain tags on response messages.

Part 2 — Proof Modifications to QuickFIX/J

This part of the article is for those brave souls who’ve dared to glance into the belly of the FIX Engine beast and lived to tell the tale. Or perhaps those who haven’t dared yet, but would like to hear, from a safe distance, our story of how we modified QuickFIX/J to integrate into our sequenced stream architecture.

If you don’t already know, QuickFIX/J is the Java version of QuickFIX, the most popular open-source FIX engine. There’s also a QuickFIX/N for .NET lovers, and QuickFIX/Go, if golang is your thing. QuickFIX/J is not the fastest FIX Engine available, but it is one of the only free engines with a Java API, and arguably has the most complete FIX implementation (although, this is starting to look really good).

Background

The FIX protocol is a sequenced message protocol, in the sense that the messages sent using this protocol are assigned incrementing sequence numbers. When two parties communicate using the FIX protocol, there are a total of 4 sequence numbers being tracked. Each party keeps track of two sequence numbers: 1) its sender sequence number, and 2) the counterparty’s sequence number, also known as the target sequence number.

The most basic responsibility of any FIX application is to keep track of its own sender sequence number. If a FIX application loses track of the target sequence number, it is possible to recover the connection without any loss of messages by using standard resend mechanisms (see above). However, if the sender sequence number is lost, there is no hope — no recovery is possible, and this is considered a fatal error with the connection. Even if both parties agree to recover the session by manually resetting the sequence numbers, there is a significant potential for message loss.

Most FIX applications use a FIX engine library (such as QuickFIX/J or b2bits or onixs or Chronicle FIX or philadelphia), which typically handles the session-level concerns of the FIX protocol. This includes persistence and recovery of the FIX session state (e.g. sender sequence numbers) using file-based stores. In addition, the library often sends administrative (aka session-level) messages to the counterparty on its own, either in response to incoming messages or because of the passage of time (e.g. Heartbeats).

Problem Statement

We would like to create a highly available FIX application while continuing to use QuickFIX/J as the FIX engine library for convenience. In more technical terms, we would like to be able to replicate or recreate the local file-based stores used by the FIX engine on a backup server.

Possible Solutions

There are a few brute force methods for achieving this replication of file-based stores:

  1. Use a shared file system like NFS. This would cause high latencies as data is stored over the network. Since we do not intend to use NFS or such for any other production application, this would mean that the FIX applications would have special infrastructure needs.
  2. Use a network-based synchronous replication mechanism. This would require a backup instance of the application to be running at all times, which would need to be discovered by the primary instance, and a network connection would need to be maintained between the two. If the connection between the instances fails, there would need to be a recovery mechanism.
  3. Use file system (block-level or file-level) replication tools for file copy. This would be an asynchronous copy, which means it may not work reliably. And again, this would create special infrastructure and deployment hassles.

We reject the above solutions and propose that the application be written in such a way that the store can be recreated simply by replaying the inputs into the application. This would allow multiple instances of an application to run in parallel, consume the same inputs and create the same session store.

Session-Level Messages Get In The Way

The main problem with this approach is that at least some of the information in the store comes directly from the FIX engine library and it is not possible to recreate this information from the inputs of the application.

Ideally, we would change the FIX engine to generate new messages exclusively through application callbacks. In other words, instead of generating and sending new messages on its own, the FIX engine would generate the new messages, but not send them, and instead hand off these messages to the application. The application would then ensure that the messages are sent out after they have been recorded on the system sequenced stream. We call this “roundtripping the message” through the system, and we can call such messages “deferred session-level messages”.

Session-Level Messages

If we want to ensure that we intercept all session messages generated by the library, we need to understand the circumstances under which each session-level message type is generated and the impact of not sending it immediately. Incidentally, the below discussion is valid for all versions of FIX from 4.0 to 5.0.

Logon (35=A): For initiators, this is sent based on a timer. For acceptors, it is sent in response to an incoming Logon.

  • We need to ensure that the session is not marked as fully logged on until the Logon is fully sent
  • Any message replay resulting from tag 789 (NextExpectedMsgSeqNum) in the incoming Logon must be suppressed until Logon is sent

Heartbeat (35=0): Sent based on a timer, as well as in response to a Test Request. No special considerations when deferring.

Test Request (35=1): Sent based on a timer. No special considerations when deferring.

Resend Request (35=2): Sent in response to an incoming message. No special considerations when deferring.

Session Reject (35=3): Sent in response to an incoming message. No special considerations when deferring.

Sequence Reset (35=4): Sent under two circumstances: (1) Gap Fill mode: Allows skipping over certain messages during normal resend processing (2) Reset mode: Allows reestablishing a FIX session after an unrecoverable application failure (afaik, QuickFIX/J does not use this)

There are no special considerations when deferring. Sequence Reset messages do not “burn” a sequence number, so as such they do not even need to be intercepted. In Gap Fill mode, a Sequence Reset message always uses the sequence number of a previously transmitted message, with an indicator for what the next sequence number should be.

Logout (35=5): Sent under these circumstances: (1) Based on a user request (for example, the app is shutting down) (2) Based on an illegal state (such as when an incoming sequence number is too low, or a garbled message is received) (3) In response to an incoming Logout

There are a couple of tricky things to consider when deferring:

  • The connection should typically be disconnected after sending a Logout in some cases, but not all. The Logout message that is intercepted and roundtripped through the system should include enough information to make a decision on whether to disconnect.
  • While the Logout message is being roundtripped through the system, we must suppress any further messages being transmitted. I didn’t quite figure out how to do this yet (given that we would need to also deal with the sequence numbers for the suppressed messages). For now, we always just disconnect and the deferred Logout is not sent. This is far from ideal, as the Logout message could have important information such as “Sequence number too low. Expected: X”.

Business Message Reject (35=j): Strictly speaking, this isn’t a session-level message, but QuickFIX/J sends it in response to certain types of malformed messages, so we need to account for it. There aren’t any special considerations when deferring these messages.

Key Concepts When Intercepting Messages

1. For an incoming message that triggers an outbound session-level message, we should fully complete the incoming message processing, which can include items such as:

  • Validation: missing sequence number field, incorrect FIX version, data dictionary validation
  • Sequence number validation: Too Low or Too High conditions
  • Increment target sequence number, if the sequence number is equal to the expected target sequence number
  • Collect any session state such as Logon Received, Logon Sent, Logout Received, Logout Sent

2. When handing off the intercepted session-level message to the application, don’t increment the sender sequence number until the message is ready to be transmitted to the counterparty. The key idea is that a sender sequence number can only be “burned” due to some app input from the system.

Code Repository

For the following description, refer to the forked QuickFIX/J code repository here: https://github.com/prerak-proof/quickfixj

A sample application demonstrating the principle is located here: https://github.com/prerak-proof/quickfixj/tree/master/quickfixj-examples/executor/src/main/java/com/prooftrading/demo

Intercepting Generated Session Messages

  • The quickfix.Session class implements nearly all of the FIX protocol handling.
  • First, we observe that the sender sequence number is incremented using this.state.incrNextSenderMsgSeqNum()method call, where state is an object of the class quickfix.SessionState.
  • If we search through all usages of this method, we are delighted to find that there is a single invocation to this method in the entire library: Session.sendRaw.
  • Following the sendRaw usages further, we find that all of the generated session-level messages are sent using this method.
  • If we replace the session-message-related usages of sendRaw with a new method sendAdmin, we can install a hook to intercept these messages:
  • Note that for the interception to work, the Application object provided to the library must implement the ApplicationAsyncAdmin interface.
  • Note the addition of the message.isAsyncAdminEligible method, which returns true for all session-level messages (A012345 message types) and the j (Business Message Reject) message type.
  • Once we successfully intercept the session-level messages and roundtrip them through the system before sending them out, we realize that the processing of certain messages requires special considerations (see above).

Sending Deferred Messages

After a deferred message is roundtripped and ready to be sent, the application can call the following method:

Unit Testing

  • The acceptance test harness was enhanced to also test the library in an async-admin mode. All current tests pass.
  • For the rest of the unit tests, the UnitTestApplication has been enhanced to work in async mode optionally. All but 45 tests pass, and the only ones that fail are the ones that expect responses to messages to be sent out synchronously. All of these tests passed after adding a short Thread.sleep(10) after the session.next() calls, or after adding a CountdownLatch to prevent tests from proceeding until a requested async send operation is complete.

Known issues:

  • If the sender or target sequence numbers are reset either using the API or using the JMX Bean admin functionality, the changes are not represented in a replicated store. ApplicationAsyncAdmin could be enhanced to provide additional callbacks to help defer these administrative actions until after they’ve roundtripped through the system.
  • As discussed earlier, Logout processing is incomplete for the case when a message with a lower-than-expected sequence number is received. This needs additional work.

Concluding Thoughts

This article was an odd mix of a high-level introduction to FIX and a low-level deep dive into a FIX engine. I suspect only one of the sections was useful for any particular reader. However, if you’re the rare kind who already knew everything in the first section, and was interested in the gory details of the second section, please come work with us (doesn’t matter where you’re located)! You can reach us at careers @ prooftrading.com. We are a lean startup and are not afraid to hand out hefty equity grants to those who share our mindset and our mission enough to join us.

For comments, feedback, or suggestions on what topics we should cover in future posts, you can reach me on Twitter: @preraksanghvi

--

--