Topology of Business: A Knowledge Graph of Federal Tax Service
One of the first knowledge graphs (KG) that we at DataFabric built (and continue to evolve) was the knowledge graph of Federal Tax Service of Russian Federation (FTS) which accumulates tremendous amount of data about Russian companies and individuals. In this story I dive into the details of the steps we took and what we continue to do to maintain the knowledge graph.
What is the Federal Tax Service’s data about?
Here and after, the numbers may be inexact, since during the generation of the knowledge graph some data may be lost because of some problems (syntactic errors, etc.) in the original data.
The knowledge graph contains information about 10,325,245 companies that includes their:
- names, IDs, status (active, inactive, etc.),
- registered address, commercial activities,
- stockholders, signatories and managing companies or people,
- licenses, etc.
Information about 13,562,875 sole partnerships that includes:
- names, IDs, status,
- commercial activities,
- owner, etc.
Also contains information about 26,739,725 people which may play roles of owners, stockholders, signatories or managers of companies or sole partnerships. The following attributes are known for people:
- names (in Russian and sometimes English),
- IDs, gender, etc.
And the other entities, 28,690 governments organizations and 558 funds.
The other large part of the knowledge graph consists of registered addresses of companies and the buildings extracted from the addresses, deduplicated and linked to the hierarchy of regions, cities, streets and etc. The numbers:
- registered addresses — 2,314,147,
- the buildings — 11,690,604.
Read another our story about the technologies we use to build knowledge graphs and terminology we employ.
How is the original data published?
Unfortunately, the data is private, though FTS is a public agency, so you have to buy it from them to use. We, same as number of other companies, buy it and provide end-user services on top of it.
Challenges we’ve faced
There exist several challenges we’ve been facing while working on the knowledge graph. These challenges aren’t unique for this particular knowledge graph, but not all of them you may face working on a KG.
Big Data as it is. The knowledge graph has grown from 1 billion to more than 6 billion triples that’s already quite a big number. It requires Blazegraph, which we currently use, to work at the limit of it’s capabilities. We use the single node edition of Blazegraph 2.1.4 on a machine with 4 CPUs, 26 GB memory and a 1.5TB SSD disk. An import of the whole knowledge graph in raw RDF (N-Triples) takes several days on a machine with 64GB memory and some queries may fail to complete within a reasonable time. In addition to that, any processing involving all the data requires a cluster of dozens of machines, but it isn’t a problem if you use Apache Beam and Google Dataflow as we do :)
As for the problem with the triplestore, we’re experimenting with Apache Rya. It’s an RDF triplestore running on top of Apache Accumulo and HDFS. So far results are promising, follow us for more details in next stories.
Dirty and broken data. The original data is created manually, by filling all sorts of forms by FTS employees and company owners, therefore there may be dirty values, like typos in the names, share sizes, etc. and different names for the same things, e.g. cities and street names. Apart from that, there may be syntactical errors in the XML files and ZIP archives which are used to publish the data.
For now we fix only the share sizes and percentages by recovering them from other related information, e.g. total share size, percentage from size, size from percentage and etc. Typos in the names are fixed semi-automatically by creating fixed catalogs of cities, street names, etc. Entries with syntactical errors are skipped and as much as possible valid entries are read from the XML files and ZIP archives.
Multiple URIs for the same entity. Triples describing an entity may be generated in parallel pipelines, e.g. person’s full name, gender, VAT number are generated in one pipeline, but an ownership relation between a company and a person in another. To make these triples describe the same person, the Я person in both pipelines should have the same URI, in other it should be generated in a deterministic way. And to generate such stable URIs usually some existing stable IDs are used, e.g. VAT number of a person, registration ID of a company and so on. Unfortunately, there are situations when there is no any stable ID to use, in example, because of a gap in the data or a mistake in the ID itself. In such situations we have a person or a company with multiple URIs and can say exactly that these URIs denote the same entity.
Currently, we don’t do anything to fix it, but there are several approaches to deal with such entities. For now, I can only suggest to look at existing research on this topic, e.g. A. Hogan and others, “Scalable and distributed methods for entity matching, consolidation and disambiguation over linked data corpora”.
Historical data and everyday updates. The original data is changed on an everyday basis. However, the historical data is important for due diligence, BI analytics, etc. So the challenge is to be able to store the historical data and apply the everyday updates to a running triplestore without any downtime.
To deal with it, we developed an ontological model based on RDF reification statements that allows us to keep both current and historical data and a set of pipelines which incrementally apply incoming changes to a running triplestore. More details in the next sections.
The schema of the graph is based on the FIBO ontologies, their extensions that are specific for the Russian jurisdiction, and a number of domain independent RDFS and OWL ontologies.
FIBO is huge and to easy the usage it’s divided in modules each of which covers a specific topic, e.g. corporations, sole proprietorships, loans and etc. I’m going to describe a few examples of entities, because the full schema is to big to be presented in details.
A company. On the screenshot below you see how Yandex LLC. is represented in the user interface. Only some of the relations are shown, but we see that it has an owner, 21 subsidiaries, it’s registered in Moscow and the CEO is Бунина Е.И. We can go even further by opening other relations, you can do it yourself, just open the link.
Now let’s look at how the company is represented in RDF with FIBO. In the snippet below, you can see three relations. Two of them are current relations which are actual at this moment, and the other one is a historical which is not actual anymore (it’s signatory relation).
Each relation has a reification statement, e.g. line 6 is a relation and 8–11 is a reification statement, which convey the version number (line 12) and version date (line 13) of the relation. A version date and number come from the original data, so it’s possible to trace any relation back to the source. The difference between a current and a historical relation is that for the current one we have both a relation and a reification statement when as for a historical one we only have a reification statement. Look at lines 17 and 23–26 for an example of a current relation and lines 37–40 for a historical relation. This is really simple model, isn’t it?
A person. Now on the screenshot we see a person and we see that he owns 10 companies and at another 2 companies he is a signatory.
Below the information about the same person, but in RDF. Here only two relations are shown: full name and relation with a company which he owns 90% of the shares that is equal to 61,323,318 rubles.
Hopefully, the snippets above have gave you an idea of the schema used in the knowledge graph.
FIBO extensions. Obviously, FIBO ontologies don’t have everything you may need, because every jurisdiction is a bit different, so we added several extensions to it. You can find all the extensions at our repository.
- Types of commercial activities (a.k.a. OKVEDs).
- More specific properties, e.g. to denote a VAT number.
- Company statuses. In total 25 statuses.
- Types of signatories, e.g. CEO, Chief Accountant, etc. In total 10 types.
An overview of the ETL pipelines
The knowledge graph is generated with a set of ETL pipelines which we developed from scratch using Kotlin, Apache Beam (GCloud Dataflow) and Apache Jena. The pipelines are orchestrated by Apache NiFi. Read more about using Apache NiFi with Apache Beam in our previous story.
The pipelines are divided into two sets:
- full loading pipelines — they process all the original data we collected for several years at ones. So they’re used to generate the whole graph from zero to 6 billion triples.
- rolling update pipelines — they take as input only some snapshots of the original data, calculate the difference between the current data in the knowledge graph and the snapshots and apply detected changes. These pipelines are executed to apply monthly or daily updates.
In the next chapter I describe the full loading pipelines in much more detail, but I leave the rolling update pipelines for another story on Medium, so don’t forget to follow us :)
The full loading ETL pipelines
The input data is ZIP archives with large XML files in them. Each XML file contains up to 10k snapshots of data about companies or sole proprietorships. A snapshot in the original data created each time when something change, e.g. the share of one of the owners. The data, which didn’t change, is copied from the previous snapshot as it is.
There are two types of snapshots, for a company and a sole proprietorship, other entities such as people, government orgs, etc. don’t have their own snapshots,but are extracted from these ones.
In the pipeline above the snapshots are converted to JSON objects and at the same time the pipeline filters out snapshots with syntactic errors. The JSON objects are converted to POJOs, for each snapshot there is one POJO. Since there may be more than one snapshot per a company, POJOs are merged and duplicated fields are discarded. Two fields are duplicates if they have the same version date and number, so at this step we don’t look at the values.
Now having these POJOs for companies and sole proprietorships, we’re able to generate POJOs for the other entities.
The next set of pipelines are responsible for the generation of RDF out of POJOs. Each field in POJO is unique in terms of object structure, so for each field we have a separate processor written in Kotlin which maps the field to corresponding triples. Below is the rest of pipelines which transform POJOs to RDF.
Also the pipelines do another deduplication, but now the fields are deduplicated based on the actual values and a latest value is marked to easy further generation of RDF.
As you may noticed, there is an intermediate ontology which is used before FIBO ontologies. This is so, because the FTS ontology (the intermediate one) were used at the first version of the knowledge graph instead of FIBO. We decided at that moment that it’d be easier to write mapping rules between ontologies, instead of rewriting the whole pipeline. The mapping rules, are simple SPARQL CONSTRUCT queries that are executed by Apache Jena. So, as the output of the pipelines we get the RDF that is ready to be bulk imported in a triplestore.
The second set of pipelines are run for each entity type separately, e.g. one work only with companies, the other one with people. The reason is that data about people exist in both company and sole proprietorship snapshot, and we want to reuse the code. The linkage between pipelines on the level of relations between entities, e.g. person is an owner of a company where the RDF for the person is generated in one pipeline, but the RDF for the ownership is in another, is guaranteed by the usage of stable URIs for such entities.
And…lesson we learned
Take a simplest schema, but be ready to change it. At the beginning we used an in-house developed ontology as the schema, but then requirements have changed and we needed to migrate. Be ready for that, since ontologies aren’t set in stone, they’re changing.
Validation is must have. It’s not possible or desired to cover transformation pipelines with unit tests that would guarantee the validity of the generated RDF. Look at the automated ways to run validation tests based on some declarative constraint rules, e.g. SHACL or OWL axioms.
There is room for improvements :) We’ve spent a lot of time developing these pipelines and we’ve got an idea how such ETLs could be improved. We even published a short survey on existing tools for the non-RDF to RDF transformations. Having such hands-on experience and knowing that there is no tool which could help us, we decided to develop one ourselves. If you’re interested in our recent developments, contact us and we’ll be glad to share more details.