The vision of Fairhair.ai is to harness the power of diverse online sources by using 5 advanced AI systems in combination to surface, structure, connect, understand, reason and derive actionable insights. In the Fairhair.ai Knowledge graph team we are connecting, cleaning and distilling all this information from different sources into one coherent store and view.
All these data points tells us something different about each entity, sometimes complementary and sometimes conflicting. Online sources vary in both coherence, reliability and accuracy.
Our customers wants to answer questions such as “Which of Pepsi Co competitors in the beverage segment has done the most acquisitions?”
To answer this question we need to aggregate all information pertaining to Pepsi Co and its competitors. This information is available in both unstructured format such Wikipedia, news articles, corporate websites; and structured sources i.e DBPedia and Wikidata. How do we fuse information from all these sources to create one source of knowledge? .
This problem is often called the veracity problem or truth discovery: which information, from a range of conflicting sources, is the most accurate information .
In this post I’ll walk you through one possible approach to this problem for structured data integration and provide coding examples and explanations.
TruthFinder  is an algorithm that tries to solve the veracity problem in an iterative process, similar to the HITS algorithm or PageRank. However in comparison to the HITS algorithm, which tells you the odds of a random walker/surfer ending up on the source, TruthFinder emphasizes correctness of facts and not the quantity of facts provided from that source[1, 7].
Representing the problem
The problem is best represented visually as in the paper, using a graph structure:
Where w is a source and f is a fact or attribute and o is a book or object. Different sources provide different facts and they provide facts for different objects. All of these facts are of the same type, they all provide authors for books. Here’s our graph with some actual author data filled in.
TruthFinder algorithm assumptions
There are some underlying assumptions/heuristics that the TruthFinder algorithm makes use of:
- There is only one true fact for a property of an object— sometimes referred to as the “single-truth assumption”  — a simplification we’ll use.
- (A) true fact is the same or similar on different websites — true facts are often similar across sources.
- False facts on different websites are less likely to be the same or similar— false facts are often more random than true facts.
- In a certain domain, a website that provides mostly true facts for many objects will likely provide true facts for other objects— similar to the Authority Hub assumption: a good hub represents a page that points to many other pages, and a good authority represents a page that is linked to by many different hubs.
You may well be thinking these are all very general assumptions and not always true. However, they will serve as a good enough hypothesis for now. Condensed these assumptions become:
- A fact has high confidence if it is provided by (many) trustworthy websites.
- A website is trustworthy if it provides many facts with high confidence.
We’ll walk through the 2 major steps in the TruthFinder algorithm  :
- Compute fact confidence score
- Compute source trustworthiness score
Repeat this until there is no change in trustworthiness or fact confidence.
Exploring the data
Let’s start by exploring the data by loading the data into Pandas and displaying the first 10 rows.
Each book has a unique ISBN identifier, which we’ll use to identify each book also called “object”. Each datapoint has a source, a title and an authors attribute field. The authors field is the fact or attribute that we are trying to reconcile and find the most accurate value for; which author(s) are the correct ones for a specific book.
Additionally checking for row duplicates, with -
reveals a substantial number of rows that are identical. Let’s drop all of the duplicates so that our results are not skewed.
data = data.drop_duplicates()
The TruthFinder steps explained
Let’s continue defining some constants and initialize our sources to make sure all w (sources) have an equal trustworthiness of 0.9 to start with .
we continue to define the general structure of our iterative process using the basic assumptions/heuristics of the TruthFinder algorithm. Let’s define the iterative process we described so far in Python, defining each step as a separate function for easier implementation.
Don’t worry I’ll walk you through each step,
Step 1. Compute fact confidence score
There are 3 sub-steps to compute the facts confidence using TruthFinder.
- Compute the attribute/facts confidence by summing all source confidence scores from sources providing that exact fact
2. Adjust the confidence score by adding to the score if there are implicating similar facts, but subtract from the score if there are conflicting facts.
3. Handle negative probabilities of facts and the independence assumption of sources.
Adding a dampening factor gamma to the sigmoid function.
Let’s dive into more detail of the 3 sub steps of the fact confidence score calculation.
- Compute initial fact confidence by summating the log of the trustworthiness of the sources providing that fact.
In this step we want to calculate the fact confidence using the sum of the source trustworthiness for all sources providing that specific fact.
So in our initial instance two sources provide George Luger and the initial state both have 0.9 this would be calculated as follows
First, we want to get all of the sources that agree on a particular fact. Two sources claims that the author of “Art of computer programming” is “Donald Knuth”. One source claims it is “D. Knuth”. Sum the scores of all sources providing identical facts. Remember at this stage all sources have the same trustworthiness, thus “Donald Knuth” will get double the score of “D. Knuth”.
σ(f) is what is called the confidence score, it is the negative log probability of a fact being incorrect by summing τ(w), the confidence scores of all sources w in W(F). τ(w) is the negative log probability of source trustworthiness.
Why are these defined like this? Generally computing the fact confidence is defined as follows.
this could lead to underflow in the multiplication (∏ is the symbol for product of a sequence ). Another reason is given in the paper “suppose there are two websites w1 and w2 with trustworthiness t(w1) = 0.9 and t(w2) = 0.99. We can see that w2 is much more accurate than w1, but their trustworthiness do not differ much as t(w2) = 1.1 × t(w1). If we measure their accuracy with trustworthiness score, we will find τ (w2) = 2×τ (w1), which better represents the accuracy of websites.”
Given that both sides are logarithms, we can use the logarithmic product rule, to produce the initial equation
Last thing we do is update the Pandas DataFrame and data sample with this fact confidence score.
2. Enhance the confidence score for a fact using similarity to other facts
ρ is a number between 0–1 and controls the influence of a fact on another fact. The implication function imp(f’ → f) is a function returning for instance Levenshtein or Jaro Winkler similarity (string distance metrics) between the facts f`and f. I used Jaro-Winkler which tends to work better for names.
Facts that are similar are probably implicating other similar facts; George, Luger implicates George Luger F, because the name is identical except for the last initial. This similarity function is domain specific, meaning, it can differ for different attributes. For numerical values, such as country population numbers, the difference between two values might be good enough.
One important property of this implication function is that if the fact is conflicting, we want it to be negative, we solve this by subtracting a base similarity constant.
There is an important caveat mentioned in the paper, if the value of the similarity is less than the base similarity constant (0.5), it will count as a conflicting fact and thus be less than 0 (negative) and for all similarities higher than the base similarity they will be larger than 0 (positive).
3. Adjust the facts confidence score
To calculate the fact confidence probability s(f) from σ*(f) we could do
However, this equation assumes that sources are independent, whereas in reality they are not; they copy and steal from each other. To account for this a dampening factor γ is added to the sigmoid function.
Why the sigmoid function? The confidence of a fact f can be negative if f is conflicting with some facts provided by very trustworthy websites. This in turn would make the probability s∗(f) negative. This makes no sense, since even with negative evidences, there is still a real chance that f is correct. To account for this, the paper uses a variety of the widely adopted sigmoid function, but as mentioned adjusted using the gamma dampening factor.
The code below implements the equation above,
We have now implemented the first part of the algorithm! If you are still with us this far you’ve come a long way. Last stretch ahead.
Step 2: Compute the source trustworthiness
This is not as involved as computing the confidence for the facts; source trustworthiness is the average of all the facts from that specific source.
where F(w) is the set of facts provided by w .
We loop through each source and retrieve all confidence values from all facts that source provides about different objects. We get the average of these confidence score, to get the average confidence score for the facts provided by that source.
Finally if we want the probability s(f) we do ln(1 — confidence score) of the source.
That is it! Now if we continuously iterate and update the facts confidence and consecutively the source trust score, we will eventually reach an equilibrium where the scores won’t change.
The result will be a list of our source trustworthiness scores and a score for each fact. We select the fact for each object that has the highest confidence score.
Results and conclusion
The golden set of authors can be found here. Comparing the results of the paper with my implementation is unfortunately very difficult, because of the author cleaning and accuracy scoring differs greatly from the original paper.
However this implementation of TruthFinder gave a 10% relative improvement compared to simple naive majority voting.
This post is an attempt to serve as an introduction to data fusion of multiple conflicting sources. If you arrived all the way down here, you might already be thinking of both simpler and more sophisticated methods  to solve this problem. A simpler one, is a naive majority voting, counting which fact is provided by the most sources. However, simple majority voting can be problematic, as sources might be biased and plagiarism between sources is common .
There are a plethora of methods that are more sophisticated, more accurate and supports more use-cases. Some interesting papers are referenced in the References section for further exploration.
Hopefully this post enables and encourages you to create your initial data fusion prototype for whatever use case you may have. Let me know if you create something cool or if you have any feedback on the post.
Thanks to all, mostly colleagues, that provided valuable feedback on the post. Thanks to Norah Klintberg Sakal, for additional feedback.
All equations created with https://www.codecogs.com/latex/eqneditor.php
 Truth Discovery with Multiple Conflicting Information Providers on the Web X. Yin, J. Han and P. S. Yu. http://hanj.cs.illinois.edu/pdf/kdd07_xyin.pdf
 Global Detection of Complex Copying Relationships Between Sources. http://www.vldb.org/pvldb/vldb2010/pvldb_vol3/R120.pdf
 A Graph-Based Approach for Effective Multi-Truth Discover, yhttps://arxiv.org/pdf/1708.02018.pdf
 Towards Veracity Challenge in Big Data, https://cse.buffalo.edu/~jing/doc/sdm16veracity.pdf
 A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration http://vldb.org/pvldb/vol5/p550_bozhao_vldb2012.pdf
 Veracity of Big Data CIKM 2015 Slides http://da.qcri.org/dafna/home_sections/tutorial-CIKM2015.pdf
 Veracity of Data: From Truth Discovery Computation Algorithms to Models of Misinformation dynamics https://books.google.com/books?id=eWVOCwAAQBAJ&printsec=frontcover#v=onepage&q&f=false
 Data Fusion: Resolving Conflicts from Multiple Sources https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41657.pdf
 Product of a sequence or Capital Pi notation https://en.wikipedia.org/wiki/Multiplication#Capital_Pi_notation