A common data format
Every day we parse financial data from hundreds of sources, including custodial banks, prime brokerages, fund administrators, private banks, and market data vendors. We receive data in a variety of formats: CSVs, TSVs, text files, custom formats — and there are basically as many representations as sources.
This variety of data formats is a side effect of Finance’s long history with Tech: the industry has been a major force of innovation. Financial institutions have invested a lot to ensure that they have the technology they need to operate. As a result, competing complex systems are now used to represent the same data. Each different system introduces different constraints, and each constraint can have implications for how people think about the inner workings of the financial world.
How can we push all this data into a consistent Financial Graph? The answer lies in defining a simple common format.
Receiving data in a lot of different formats isn’t necessarily a hard problem on its own: the greater challenge is handling the significant differences in how the same data is represented from source to source.
To better illustrate this problem, let’s take a very simple example. We have a single account that already has 100 shares of Alphabet and $15,000 cash. From that account, we buy 20 shares of Apple at $140 per share with a $50 broker fee for the transaction.
The following are just a few hypothetical ways this same, simple set of transactions could be represented at the end of the day:
Date ,Account,Symbol,Name ,Quantity,CurrCode ,Type,Exchange
20170131,1234 ,GOOG ,Google Inc,100 ,USD ,STK ,NASDAQ
20170131,1234 ,USD ,US Dollars,12150 ,USD ,CRY ,
20170131,1234 ,AAPL ,Apple Inc ,20 ,USD ,STK ,NASDAQ
20170131,1234 ,20 ,2800 ,AAPL ,USD ,BY ,20170131
20170131,1234 ,0 ,50 ,AAPL ,USD ,FEE ,20170131
Account Date Symbol Quantity
1234 20170131 GOOG 100
1234 20170131 USD 12150
1234 20170131 AAPL 20
Date Account Symbol Type Quantity Price Fees
20170131 1234 AAPL BUY 20 140 50
Symbol Name CC Cat Date Price
GOOG Alphabet USD STOCK 013117 830
AAPL Apple USD STOCK 013117 140
USD Cash USD CASH 013117 1
As you can see, the files are completely different not only in their format, but also in the way they capture what happened on this day. While we could create mappers to parse this data and directly push it into the Financial Graph, doing so would limit us in our goal to become a platform for the financial world. Anyone who wanted to build applications on top of Addepar would need a deep knowledge of the inner workings of the graph; it would be harder to ensure that the data pushed directly into the platform met the same requirements as data processed by us; and you would need to test on ‘live data’.
Avoiding direct mappers is a challenge, but ultimately also offers an opportunity to standardize the data format for the interchange of accounting data.
By separating out common concepts from the process we use to transform the data, we can validate the data before we try to fit it directly into the graph. We divide these processes to keep our interfaces to the external world as simple as possible while taking in a wide breadth of inputs. Our internal data teams also gain efficiency by dividing ownership between 1) translating the incoming data, and 2) extending our models to new financial entities and transactions.
Our Current Solution
The nature of the financial data we receive dictated our initial, structural decisions.
First of all, we have to account for many fields that might appear in only one source leading to our data being very sparse. When handling sparse data, you generally want to store it in a dynamic way having an entry only for the fields that are populated. Besides, if you use pre-determined columns then you have to account for the union of all possible attributes of the data types that you’re trying to capture and you might end up with hundreds of columns or have complex logic to use a single column for multiple purposes.
Additionally, we had to find a level of abstraction general enough to correctly model the vast majority of common financial situations but also flexible enough to allow us to capture the nuances in the ways financial institutions differentiate their data. We looked at existing solutions, such as OFX or SWIFT, but they didn’t allow us to capture the variety in the financial data that we see on a daily basis.
Finally, we wanted to reduce the cognitive load required to inspect the data. Storing different kinds of data in separate files then switching between them or filtering for the right account introduces a lot of context switching when answering even simple questions. We want to have all the relevant data for a single portfolio in one place.
To accommodate the above considerations, we arrived at a JSON-based representation where we differentiate three main sections: positions, transactions, and securities. It allows us to have an object-oriented representation, solves the problem of sparsity by only populating fields that apply, and allows all the information for a given day to be stored in a single file.
In the first section, we store all the account-specific end of day position information (i.e. what the custodian tells us you have in your investment account at closing). We only store basic information here such as the security identifier, units held, or the market value of the position.
Daily transactions are stored in the transactions section. We can easily capture the variety in different transaction types and only store the information relevant to each. For example, a forex transaction would require specifying two currencies as well as the two cash accounts that were involved while in most cases populating a second currency or cash account field is unnecessary. We keep this section lean by only allowing transaction-specific information here and linking that to other relevant information through internally consistent identifiers.
Finally, in the securities section we store information about all the entities that appear in the positions or transactions and link them through an internally consistent ID. We populate as many attributes as we can capture from the source here, such as security type, name, currency code, ISIN, SEDOL, ticker symbol, identifier for the underlying security for derivatives, etc.
Of course, using a JSON-based format has its drawbacks: in general, people are less used to thinking in tree-based structures than tabular representations. Most people are familiar with Excel, and many people in the field know how to query a MySQL database even if they don’t have a computer science background. Using a compatible format, we could easily concatenate multiple rows to search across a large date range; however, there are fewer programs that allow you to easily view and explore across many JSONs.
To ensure that the data in every file is correct before pushing it into the Financial Graph, we perform a number of validation steps. Using Java and an object-oriented representation guarantees that the right type of data ends up in each field, but we also want to make sure that we don’t accidentally populate certain fields for various types that make no sense. For example, having an option exercise transaction on a cash account would raise quite a few eyebrows and result in many errors further along the road. Similarly, a treasury bond with a maturity date more than 30 years away is assumed to be incorrect; however, this logic can be easily adjusted in our model if longer maturities are introduced by the Department of Treasury.
Additionally, we validate that the data is complete by making sure that every security identifier that appears in a position or a transaction has a corresponding entry in the securities section. We also double check that we’re not missing any necessary information for any entries. For example, cash accounts would always need an associated currency, bonds need a maturity date, or options need an exercise date.
This data format helps us process hundreds of thousands of accounts every day efficiently and provides a simple common language across all our sources. It is also a work in progress and we hire the best people in the industry to expand our format to more data types and ensure that our data is represented accurately. For example, we’ve recently added the ability to capture custom security attributes and support for representing performance summary data for various securities.
We’re also working to represent transactions as transaction effects to match how they are represented in our Financial Graph. For example, buying 50 shares of Apple for $7,000 would cause two transaction effects: one on Apple to increase our number of shares by 50 and another on our USD cash account to decrease it by 7,000. Using this approach allows us to model even more complex transactions than what our current format allows. Right now, a three-way split with cash effects is generally represented in 3–7 different transactions (depending on the source) — imagine being able to capture all those legs in a single transaction!
We’re working hard on improving our data format to capture information more granularly while keeping its accuracy and flexibility. By sharing our work, we hope to start a discussion and work towards widespread adoption of the format — if you want to be part of the conversation, leave a comment or email us at firstname.lastname@example.org!