[Update — several days after this article was published Google did fix the attribution error. Bing also finally fixed the attribution, but there was a much longer delay. So I still think it is a great case study. Thanks for everyone that told Google they were wrong!]
I have been doing some research on how Knowledge Graphs extract facts from text documents to summarize information. I was rather surprised to find out that the very first quote of my hero, Charles Darwin, was wrong. In the Google right margin (called the InfoBox summary) the first quote is something that Darwin never said. Here is the quote:
It is not the strongest of the species that survive, nor the most intelligent, but the one most responsive to change.
One would think that Google, being a well respected company, would do a little bit of fact checking to verify the first quote on one of the most important figures in all Science. Wrong!
I know this quote well because I frequently use it in my slides about database selection. Just replace the word “species” with “database architecture” and you get the picture. It also applies to companies, schools, and business strategies. This quote is really at the heart of business agility. But Darwin never said this. The author was Leon C. Megginson, a business professor who said this in 1963.
Now let’s look at the Wikipedia InfoBox summary for Darwin:
You will note a few important differences between the Google summary and the Wikipedia summary. Both summaries include the birth and death dates. That’s good. The person(s) that created the Wikipedia InfoBox decided that the most important facts are not the names of Darwin’s children, but the links to his groundbreaking books: The Voyage of the Beagle and On the Origin of the Species. They got that right! And their is no incorrect quote attribution in the Wikipedia article. Darwin fans police this page to make sure it is accurate and trustworthy. Thank you Wikipedia authors and editors!
So why did Wikipedia get this right and Google fail? The answer is that Google’s InfoBox was entirely generated by machine learning. Google’s web crawlers scan every document on the web, parse sentences with natural language processing tools and find people’s names like “Charles Darwin”. They create a graph with a single “Node” for Charles Darwin. They then add facts to this node as new arcs to other nodes. Some of the arcs are of the “quote” type and they point to a node with the quote of the text. The graph gracefully grows as each new page is scanned. The more pages scanned, the bigger the graph. It is rumored that Google has about 85 billion facts in their graph. Very cool! Right?
However, because this was a quote mis-attributed to Darwin a long time ago, the error just keeps getting propagated across the internet like a virus. The fact that Google has it as the very first quote for Darwin’s page even shows the power of the quote. It really does sound exactly like something that Darwin might have said. Now every junior-high school student that types in “Charles Darwin” into Google’s search engine will copy and past it into their school report and their Tumbler blog, which will in turn get picked up by Google’s search engine and the ranking will continue to go up. Google’s probability that it must be right gets even stronger. An endless cycle of falsehoods and fake news.
However, there is one important point we must make clear. It is not the architecture of a graph that causes the error. It is the limitations of the machine learning algorithms that are at fault. They blindly scan for text and the more text they get, the higher the ranking of a quotation. At Google statistics rule.
Now to be fair, Google does have a feedback link at the bottom of the InfoBox. You can click on it and it puts each “fact” in a selection list and you can select the fact you think is wrong and explain why they are wrong. However, despite my doing this the quote is still there. At Google, machines rule over people’s feedback. Google might also say they don’t really claim Darwin actually said it — only that this quote is the most strongly associated with something that Darwin should have said! Right?
Here is the take home point. As a database architect, whenever we use machine learning to extract facts, we must remember one of those “V”s associated with data — Volume, Velocity, Variety and Veracity. Veracity is conformity to facts — the accuracy of our data. We need ways to validate that facts that we extract from text and place into our graph really are truthful. Many well intentioned projects fail due to poor data quality. It is only after machine learning systems can actually search the Darwin collections and lookup articles on sites like quoteinvestigator.com to check their work will we be able to trust Google’s summary more than WikiPedia’s summary. Until then, beware of the many machine generated Knowledge Graph! Double check your facts with other sources — and Wikipedia is a good place to start.
And one final request. Please do Leon Megginson and Darwin scholars everywhere favor and continue to tell Google they messed up! Perhaps if we all do this we can correct this travesty of attribution of one of the greatest quotes of our time.
Thanks! — Dan