Rearchitecting the Symbology technology stack towards a new GraphQL API

FactSet
FactSet
Published in
5 min readFeb 1, 2024

The investment industry trust Symbology as the glue binding various applications together to provide a seamless experience. Symbology allows software applications to map symbols to financial instruments and entities. For example, the CUSIP identifier 303075105 denotes the common stock issued by FactSet. Similarly, the Legal Entity Identifier 549300ZSJE7NBK6K9P30 denotes FactSet, the company. Symbology is typically used to translate a given symbol to another. For example, CUSIP to Legal Entity Identifier or vice-versa.

This article illustrates considerations to unify Symbology data into a coherent graph data model. During this effort, various graph databases were tested to host Core Symbology data, but none were as efficient and affordable. In order to address evolving industry requirements and replace legacy webservices, a custom solution proved to be the way forward. This project involved rearchitecting data storage, API definition, and service transport, to utilize more modern and performant technologies.

Data storage

In order to retain performance, we need to:

  1. Keep data in-memory within the service process.
  2. Look up keys with sub-microseconds latency.
  3. Scale read operations with CPU cores leveraging parallelization to further speed up request processing.

Keeping the above goals in mind, RocksDB stood out. Developed and maintained by Facebook’s Database Engineering Team, this embedded key-value data storage solution fits the requirements well. The following are some notable features that are quite useful for Symbology:

  • PlainTable format support — which is optimized for high-performance point lookups.
  • RocksDB’s OpenForReadOnly functionality allows us to further optimize lookup performance by eliminating some locks within its read-only code paths.
  • Built-in support for Bloom filters, to quickly handle the significant number of unrecognizable symbols (i.e. symbols that do not exist in the database).
  • RocksDB allows you to keep the data segregated into different tables using ColumnFamilies.
  • PinnableSlice is similar to std::string_view and allows you to read the values from RocksDB without copying them.

Data Format

RocksDB allows fast lookups. However, it is necessary to optimize deserialization performance to read values without much overhead. Flatbuffers serve that purpose. Deserializing flatbuffers is effectively a no-op and therefore it allows to read values from the database without any memory allocation or copies. This makes the application more threading-friendly. Here is a stripped down flatbuffers schema for attributes table:

Flatbuffers’ flatc compiler generates C++ code that allows to read flatbuffers as follows:

The above command generates a GetAttributes function (amongst other things) that can cast a buffer to an Attributes table. The function looks like this:

Use GetAttributes and PinnableSlice to perform very low overhead data reads as follows:

GraphQL API

FactSet offers access to Core Symbology data using a GraphQL API accessible via HTTP(s). GraphQL is the preferred API standard for several reasons:

  • Core symbology has a well-defined ontology and can easily be translated into a GraphQL type system.
  • GraphQL has a powerful tooling ecosystem with GUIs, editor integrations, code generation and linting.
  • GraphQL allows fetching multiple data items in one request.

You can use cppgraphqlgen to create an introspection-enabled GraphQL API implementation. The library provides a schemagen utility that generates C++ code based on a GraphQL schema. The generated code consists of C++ concepts that can be used to implement the application logic. This approach enables interfacing with the generated code while avoiding tight coupling with it. Please refer to the cppgraphqlgen documentation for more details.

Optimizing JSON handling

JSON is the defacto response format for GraphQL APIs, but it is usually not the most efficient. To make JSON handling more performant, the following optimizations can be made using rapidjson:

  • cppgraphqlgen’s toJSON function makes an unnecessary copy of the json buffer after serialization. In order to avoid that, it can be replaced with a custom StringBuffer implementation that allows to move the underlying buffer.
  • parseJSON parses the request json into a new buffer. Replacing that with an implementation that uses rapidjson’s ParseInsitu function, and consume the input buffer without making any copies, improves performance.
  • Enabling SSE based SIMD optimization for rapidjson in a custom implementation improves performance for whitespace handling. .

Web Server

Finally, GraphQL API can be integrated with proxygen, an open-source web server created by Facebook. Here are some reasons to leverage proxygen:

  • Non-blocking, performant web server that supports HTTP 1.1/2/3.
  • Battle tested by Facebook for almost a decade in a production environment.
  • Supports response compression (using zstd and gzip compression formats) out of the box.
  • Supports chaining of request handlers that allows for a more modular implementation.

Putting it all together

The following picture outlines how the different modules described above fit together to constitute our technology stack.

The components depicted in yellow were written by the FactSet Symbology team while the blue boxes highlight the various open-source libraries that we use in our application.

Enhancing API performance and usability

The new GraphQL API allows FactSet applications to make more complex queries in one request. As opposed to making requests to multiple REST endpoints and then stitch together the responses. For example, fetching all securities issued by the subsidiaries of a given company. Consequently, reducing time spent on network roundtrips and querying complexity.

Some FactSet applications make substantial symbology requests when processing large portfolios. This new technology stack leverages parallel processing to significantly reduce response times for such requests which leads to more responsive and functional products.

Final thoughts

We have successfully demonstrated that this technology stack meets functional and performance requirements by leveraging modern open-source technologies. In general, this technology stack can be considered for applications that:

  • Require sub-millisecond response times.
  • Work with highly connected datasets (usually modeled as a Graph).
  • Want to expose a GraphQL API to serve more complex requests in a performant way.

Author: Himanshu Barthwal (Principal Software Engineer)

Contributors: Nick Gallagher, Michael Kristofik, Jens Maurer, Darin Miller, Philip Ng, and Nick Weatherley

Reviewers: Hannes Schreiber (Lead Sales Engineer) & Josh Gaddy (VP, Director, Developer Advocacy)

--

--

FactSet
FactSet

FactSet delivers data, analytics, and open technology in a digital platform to help the financial community see more, think bigger, and do their best work.