Unlocking the world’s financial data
In this post I want to update you on our PDF crawling and extraction efforts, the first step towards hosting data not only from US based companies but from any listed company around the world.
How we gather data currently
In case you don’t know yet, we currently get all the fundmental data on SimFin from the SEC database, that offers fundamental data for US companies in a machine readable format called XBRL. While it’s great that the SEC is offering this, the XBRL format still has a lot of problems, more than ten years after it’s been launched. This is not per se due to XBRL (which is just a way of structuring information), but rather due to companies not reporting things correctly in the XBRL format.
In fact we observe times and times again inconsistensies between the data reported in the XBRL filings and the actual reports (which are in HTML on the SEC website, but look very much like a PDF, including page numbers etc.), where the XBRL is simply incorrect and the actual report is fine. This lead us to think that it’s not worth improving our current XBRL crawler but rather to focus on a task that’s much more difficult than parsing XBRL, which is to parse the actual annual/quarterly reports. That are the PDFs that companies around the world publish in quarterly intervals.
Why XBRL is still vital for us
While reading PDFs poses a lot of challenges, there are by now a lot of free software packages available that facilitate this task a lot and make it actually quite doable. In fact the results we are getting so far are so promising that we are positive to be able to release a first version of our PDF extraction engine before the end of this year.
The interesting thing though is that we couldn’t have come as far as we are right now without the help of XBRL — our new crawler relies heavily on machine learning in order to identify which PDF from a company website is relevant and where the information we are looking for is located inside the PDF. The dataset for these machine learning models was built using the data that is on SimFin already, and this data comes from XBRL. Basically we compare the structured data on SimFin with the unstructured data in the actual reports from the companies in the SimFin database to automatically build big datasets for supervised machine learning models.
Introducing the SimFin PDF crawler
All this wouldn’t have been possible without all the amazing open source software packages we used to come this far, so besides contributing to the “open data” movement with our data on SimFin, we are now also starting to release more software as open source, the start of which is our crawler that crawls all PDFs from a given company website (the starting point for the PDF extraction task), you can find it here if you are curious:
This crawler was mainly built by Remy Gwaramadze and Joseph Albers, two developers that wanted to support SimFin’s mission of making financial data more accessible in their free time — if you also want to help us coding the future of financial data feel free to reach out to firstname.lastname@example.org.