XBRL at Clarity AI
At Clarity AI, we want to make it easier for investors to allocate their capital towards companies which have a positive impact on society. To achieve this goal, we need data to feed our AI models and web app. Thus, we are always searching for new, good quality data sources. That is where XBRL enters into the equation.
XBRL (eXtensible Business Reporting Language) is a freely available and global framework for exchanging business information. It is XML-based and uses the XML syntax and related XML technologies. SEC (US Securities and Exchange Commission) requires all American public companies to fill their reports using XBRL and makes this information available to the public.
XBRL is a standard that can be tough to understand for beginners. The standard specifications can be found here:
I also found this book by the creators of the standard very useful:
Our goal in this article is to share some of our learnings while incorporating XBRL into our data pipelines. Hopefully, you will find this article a good starting point to further progress on your understanding of XBRL and how to access all this publicly available, good quality data.
Let’s first try to understand a specific filing. Let’s take Apple’s annual report from 2020, which can be found here. As you can see, this report has 6 xml documents attached:
- Instance document: This is the document containing the information. The information is divided into concepts. A concept is a small piece of information containing a value, concept name, etc. (we’ll dive further into concepts later on).
- Calculation linkbase: The xml document containing formulas to verify concepts that are not direct. For example, GrossProfit is a calculated concept:
GrossProfit = Revenue — CostOfGoodsSold
. The instance document contains all three concepts, and the calculation linkbase is used to verify that the GrossProfit concept reported in the instance document is right. - Definition linkbase: This linkbase is used to create associations between concepts. We might have a generic concept like
performanceMeasures
with 3 children:NetIncome
,NetIncomeBeforeTax
andNetIncomeAfterTax
. - Presentation linkbase: The presentation linkbase is used as a guide for the creation of user interfaces, rendering, or visualization. It also creates associations between concepts. This linkbase specifies how a user interface should display the concepts so that it follows the conventions and standards of business reporting.
- Label linkbase: The label linkbase provides a human readable label for most common concepts. There cannot be duplicated labels for the same concept. An example would be the concept with name
RevenueFromContractWithCustomerExcludingAssesedTax
, whose label is NetSales. - Schema document: The schema document expresses how instance documents and taxonomies are to be built. The schema document inherits and extends other taxonomies. Two compulsory taxonomies that companies need to report to SEC are dei (used for company-related information such as address or trading symbol) and us-gaap (used for financial reporting). Companies can also create their own taxonomies to report concepts which are unique to their business. As an example, Apple has its own taxonomy called appl.
Warning
We are mainly going to focus on understanding the instance document. However, to make a good parsing of XBRL you should understand all 6 xml files. Explaining all 6 files is beyond the scope of this article and we suggest that you dive into the official standards if your goal is to incorporate XBRL data into your systems after reading this article. Some other good article reads are:
Namespaces and schemas
Let’s see a few examples of how all this data is related. The instance document starts defining the namespaces it will use:
Bare in mind these are not links, they are namespaces. To be able to download the taxonomy related to each namespace, we need to go to the schemaRef or Schema document, which imports the taxonomies:
Concepts and data
Let’s come back to the instance document. Just after defining the namespaces, we have context items and concepts:
Each concept can only have one context, while contexts can have many concepts associated with them. We can later on link the concept with the rest of the xml files to get more information regarding a specific concept.
So far so good. We now understand the structure of an XBRL submission. What now? We would like to transform all this sparse data throughout different xml files into a pandas dataframe we can work with. At the end of the day, our goal is to generate a pandas dataframe which looks like this:
To achieve this goal we have used an open source library to which we have contributed: https://github.com/manusimidt/py-xbrl. This library simplifies the parsing of XBRL submissions substantially.
The tricky part
However there is one tricky part we have not yet spoken about. The most difficult part when parsing an XBRL submission and where most people fail to accomplish their goals (even the SEC itself has not been able to do it yet) is when dimensional concepts appear. Dimensional concepts can be better understood using an example:
Apple divides its Nate Sales by 5 different business lines. This information is shown in the concept as follows:
which references the following context:
The context has a segment attribute which specifies that the concepts which reference this context are part of a dimension. This gets even more complicated as Apple then divides each Revenue not only by business but by continent too.
Verify XBRL data quality
You can verify that the data you get after parsing a submission is right. To do so you can either use SEC’s interactive data or search it on Yahoo Financials or any other freely available data source.
SEC offers free XBRL generated datasets every quarter here. I can imagine their data set generation process uses some of the same ideas as the open source library we have used. However, their datasets fail to determine if a concept is part of a dimension (a child) or not. For example, they provide 6 different values for Net Sales for Apple’s 2020 report, without specifying the segments anywhere. This makes most of the data provided in this datasets useless for many use-cases, as there is no way to differentiate which of the 6 values is the parent.
At this moment, if you want to access all this data you can either buy it from data vendors or parse all XBRL fillings yourself, for which we, once again, suggest to use this open source library:
Thank you
At Clarity AI we want to thank Manuel Schmidt, who has not only created a great XBRL parsing library, but also helped us understood the hidden details of XBRL. Without him, we would have struggled a lot more to accomplish our goals.