Truth discovery in practice — aggregating conflicting datasources

Published in

FairhairAI

10 min readJan 21, 2020

The vision of Fairhair.ai is to harness the power of diverse online sources by using 5 advanced AI systems in combination to surface, structure, connect, understand, reason and derive actionable insights. In the Fairhair.ai Knowledge graph team we are connecting, cleaning and distilling all this information from different sources into one coherent store and view.

All these data points tells us something different about each entity, sometimes complementary and sometimes conflicting. Online sources vary in both coherence, reliability and accuracy.

Our customers wants to answer questions such as “Which of Pepsi Co competitors in the beverage segment has done the most acquisitions?”

To answer this question we need to aggregate all information pertaining to Pepsi Co and its competitors. This information is available in both unstructured format such Wikipedia, news articles, corporate websites; and structured sources i.e DBPedia and Wikidata. How do we fuse information from all these sources to create one source of knowledge? [5].

This problem is often called the veracity problem or truth discovery: which information, from a range of conflicting sources, is the most accurate information [1].

In this post I’ll walk you through one possible approach to this problem for structured data integration and provide coding examples and explanations.

TruthFinder

TruthFinder [1] is an algorithm that tries to solve the veracity problem in an iterative process, similar to the HITS algorithm or PageRank. However in comparison to the HITS algorithm, which tells you the odds of a random walker/surfer ending up on the source, TruthFinder emphasizes correctness of facts and not the quantity of facts provided from that source[1, 7].

We’ll start out with a real world dataset [2] of 1265 books and 26554 author attributes from 895 different e-commerce sites. It can be downloaded directly from here.

Representing the problem

The problem is best represented visually as in the paper, using a graph structure:

Source nodes (E-commerce sites), attribute/fact nodes(Authors), object nodes (Books)

Where w is a source and f is a fact or attribute and o is a book or object. Different sources provide different facts and they provide facts for different objects. All of these facts are of the same type, they all provide authors for books. Here’s our graph with some actual author data filled in.

Sources e-Campus and Indoo both claim the author Luger George to be the author of AI: structures and strategies

TruthFinder algorithm assumptions

There are some underlying assumptions/heuristics that the TruthFinder algorithm makes use of:

There is only one true fact for a property of an object— sometimes referred to as the “single-truth assumption” [3] — a simplification we’ll use.
(A) true fact is the same or similar on different websites — true facts are often similar across sources.
False facts on different websites are less likely to be the same or similar— false facts are often more random than true facts.
In a certain domain, a website that provides mostly true facts for many objects will likely provide true facts for other objects— similar to the Authority Hub assumption: a good hub represents a page that points to many other pages, and a good authority represents a page that is linked to by many different hubs.

You may well be thinking these are all very general assumptions and not always true. However, they will serve as a good enough hypothesis for now. Condensed these assumptions become:

A fact has high confidence if it is provided by (many) trustworthy websites.
A website is trustworthy if it provides many facts with high confidence.

We’ll walk through the 2 major steps in the TruthFinder algorithm [4] :

Compute fact confidence score
Compute source trustworthiness score

Repeat this until there is no change in trustworthiness or fact confidence.

Exploring the data

Let’s start by exploring the data by loading the data into Pandas and displaying the first 10 rows.

Each book has a unique ISBN identifier, which we’ll use to identify each book also called “object”. Each datapoint has a source, a title and an authors attribute field. The authors field is the fact or attribute that we are trying to reconcile and find the most accurate value for; which author(s) are the correct ones for a specific book.

Additionally checking for row duplicates, with -

sum(data.duplicates())

reveals a substantial number of rows that are identical. Let’s drop all of the duplicates so that our results are not skewed.

data = data.drop_duplicates()

The TruthFinder steps explained

Let’s continue defining some constants and initialize our sources to make sure all w (sources) have an equal trustworthiness of 0.9 to start with [1].

Equal source/website source trustworthiness

we continue to define the general structure of our iterative process using the basic assumptions/heuristics of the TruthFinder algorithm. Let’s define the iterative process we described so far in Python, defining each step as a separate function for easier implementation.

Don’t worry I’ll walk you through each step,

Step 1. Compute fact confidence score

There are 3 sub-steps to compute the facts confidence using TruthFinder.

Compute the attribute/facts confidence by summing all source confidence scores from sources providing that exact fact

2. Adjust the confidence score by adding to the score if there are implicating similar facts, but subtract from the score if there are conflicting facts.

3. Handle negative probabilities of facts and the independence assumption of sources.
Adding a dampening factor gamma to the sigmoid function.

Adjusted sigmoid function

Let’s dive into more detail of the 3 sub steps of the fact confidence score calculation.

Compute initial fact confidence by summating the log of the trustworthiness of the sources providing that fact.
In this step we want to calculate the fact confidence using the sum of the source trustworthiness for all sources providing that specific fact.

So in our initial instance two sources provide George Luger and the initial state both have 0.9 this would be calculated as follows

Compute confidence score for fact f, f4 has a higher confidence than other facts.

First, we want to get all of the sources that agree on a particular fact. Two sources claims that the author of “Art of computer programming” is “Donald Knuth”. One source claims it is “D. Knuth”. Sum the scores of all sources providing identical facts. Remember at this stage all sources have the same trustworthiness, thus “Donald Knuth” will get double the score of “D. Knuth”.

σ(f) is what is called the confidence score, it is the negative log probability of a fact being incorrect by summing τ(w), the confidence scores of all sources w in W(F). τ(w) is the negative log probability of source trustworthiness.

Fact confidence score sigma for confidence s(f)

Source trustworthiness score tau for source trustworthiness

Why are these defined like this? Generally computing the fact confidence is defined as follows.

this could lead to underflow in the multiplication (∏ is the symbol for product of a sequence [9]). Another reason is given in the paper “suppose there are two websites w1 and w2 with trustworthiness t(w1) = 0.9 and t(w2) = 0.99. We can see that w2 is much more accurate than w1, but their trustworthiness do not differ much as t(w2) = 1.1 × t(w1). If we measure their accuracy with trustworthiness score, we will find τ (w2) = 2×τ (w1), which better represents the accuracy of websites.”

Given that both sides are logarithms, we can use the logarithmic product rule, to produce the initial equation

Last thing we do is update the Pandas DataFrame and data sample with this fact confidence score.

2. Enhance the confidence score for a fact using similarity to other facts

ρ is a number between 0–1 and controls the influence of a fact on another fact. The implication function imp(f’ → f) is a function returning for instance Levenshtein or Jaro Winkler similarity (string distance metrics) between the facts f`and f. I used Jaro-Winkler which tends to work better for names.

Facts that are similar are probably implicating other similar facts; George, Luger implicates George Luger F, because the name is identical except for the last initial. This similarity function is domain specific, meaning, it can differ for different attributes. For numerical values, such as country population numbers, the difference between two values might be good enough.

One important property of this implication function is that if the fact is conflicting, we want it to be negative, we solve this by subtracting a base similarity constant.

There is an important caveat mentioned in the paper, if the value of the similarity is less than the base similarity constant (0.5), it will count as a conflicting fact and thus be less than 0 (negative) and for all similarities higher than the base similarity they will be larger than 0 (positive).

3. Adjust the facts confidence score
To calculate the fact confidence probability s(f) from σ*(f) we could do

However, this equation assumes that sources are independent, whereas in reality they are not; they copy and steal from each other. To account for this a dampening factor γ is added to the sigmoid function.

Why the sigmoid function? The confidence of a fact f can be negative if f is conflicting with some facts provided by very trustworthy websites. This in turn would make the probability s∗(f) negative. This makes no sense, since even with negative evidences, there is still a real chance that f is correct. To account for this, the paper uses a variety of the widely adopted sigmoid function, but as mentioned adjusted using the gamma dampening factor.

The code below implements the equation above,

We have now implemented the first part of the algorithm! If you are still with us this far you’ve come a long way. Last stretch ahead.

Step 2: Compute the source trustworthiness

Source trustworthiness is computed now using the f confidence scores.

This is not as involved as computing the confidence for the facts; source trustworthiness is the average of all the facts from that specific source.

Equation for source trustworthiness computation

where F(w) is the set of facts provided by w [1].

We loop through each source and retrieve all confidence values from all facts that source provides about different objects. We get the average of these confidence score, to get the average confidence score for the facts provided by that source.
Finally if we want the probability s(f) we do ln(1 — confidence score) of the source.

Final steps

That is it! Now if we continuously iterate and update the facts confidence and consecutively the source trust score, we will eventually reach an equilibrium where the scores won’t change.

The result will be a list of our source trustworthiness scores and a score for each fact. We select the fact for each object that has the highest confidence score.

Results and conclusion

The golden set of authors can be found here. Comparing the results of the paper with my implementation is unfortunately very difficult, because of the author cleaning and accuracy scoring differs greatly from the original paper.
However this implementation of TruthFinder gave a 10% relative improvement compared to simple naive majority voting.

http://da.qcri.org/dafna/#/dafna/exp_sections/comparativeStudy/rwd-analysis.html

This post is an attempt to serve as an introduction to data fusion of multiple conflicting sources. If you arrived all the way down here, you might already be thinking of both simpler and more sophisticated methods [5] to solve this problem. A simpler one, is a naive majority voting, counting which fact is provided by the most sources. However, simple majority voting can be problematic, as sources might be biased and plagiarism between sources is common [8].

There are a plethora of methods that are more sophisticated, more accurate and supports more use-cases. Some interesting papers are referenced in the References section for further exploration.

Hopefully this post enables and encourages you to create your initial data fusion prototype for whatever use case you may have. Let me know if you create something cool or if you have any feedback on the post.

I work in the Knowledge Graph team at Meltwater where we do this stuff all day. You can learn more about what we do at underthehood.meltwater.com, fairhair.ai and meltwater.com.

Acknowledgments

Thanks to all, mostly colleagues, that provided valuable feedback on the post. Thanks to Norah Klintberg Sakal, for additional feedback.
All equations created with https://www.codecogs.com/latex/eqneditor.php

References

[1] Truth Discovery with Multiple Conflicting Information Providers on the Web X. Yin, J. Han and P. S. Yu. http://hanj.cs.illinois.edu/pdf/kdd07_xyin.pdf

[2] Global Detection of Complex Copying Relationships Between Sources. http://www.vldb.org/pvldb/vldb2010/pvldb_vol3/R120.pdf

[3] A Graph-Based Approach for Effective Multi-Truth Discover, yhttps://arxiv.org/pdf/1708.02018.pdf

[4] Towards Veracity Challenge in Big Data, https://cse.buffalo.edu/~jing/doc/sdm16veracity.pdf

[5] A Bayesian Approach to Discovering Truth from Conflicting Sources for Data Integration http://vldb.org/pvldb/vol5/p550_bozhao_vldb2012.pdf

[6] Veracity of Big Data CIKM 2015 Slides http://da.qcri.org/dafna/home_sections/tutorial-CIKM2015.pdf

[7] Veracity of Data: From Truth Discovery Computation Algorithms to Models of Misinformation dynamics https://books.google.com/books?id=eWVOCwAAQBAJ&printsec=frontcover#v=onepage&q&f=false

[8] Data Fusion: Resolving Conflicts from Multiple Sources https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/41657.pdf

[9] Product of a sequence or Capital Pi notation https://en.wikipedia.org/wiki/Multiplication#Capital_Pi_notation