Freebase is dead, long live Freebase
Earlier this week an important chapter in the history of technology quietly came to an end. On its launch Tim O’Reilly described Metaweb’s Freebase as “a system for building the synapses for the global brain”, and nearly a decade later this remarkable part of the global brain — having since been absorbed into Google’s Knowledge Graph — was switched off.
Thomas L. Friedman recently remarked upon the cornucopia of invention which occurred in 2007:
In 2007, Apple came out with the iPhone, beginning the smartphone/apps revolution; in late 2006 Facebook opened its doors to anyone, not just college and high school students, and took off like a rocket; Google came out with the Android operating system in 2007; Hadoop launched in 2007, helping create the storage/processing power for the big data revolution; Github, launched in 2007, scaling open-source software; Twitter was spun off as its own separate platform in 2007. Amazon came out with the Kindle in 2007. Airbnb started in 2007.
But he failed to note the most significant of internet-related developments in 2007, the launch of Freebase by Metaweb in March of that year. ‘Significant’ not in terms of popularity or brand-recognition, but in terms of the weight of ideas and vision of a greatly changed future. A future driven by artificial intelligence.
Artificial intelligence has made significant strides of late and received a lot of publicity, notably the triumphs of Google DeepMind’s AlphaGo over Lee Sedol and IBM’s Watson with Jeopardy!. The term ‘deep learning’ has risen to become the buzz-word of 2016, but as Alexander Wissner-Gross noted the algorithms behind ‘deep learning’ have been around in various forms for a number of years. These current advances have only been made possible by the recent increase in availability of information about the subject at hand.
This, I believe, is what Danny Hillis, Robert Cook, John Giannandrea and colleagues perceived in the years running up to 2007. They understood artificial intelligence could be far more powerful with more information. In particular they recognised the limitations of computers: machines struggle to absorb knowledge in the way humans do. If computers had access to the entire wealth of human knowledge the possibilities were near limitless — an android Minerva. All that was required was a means to transform and store information in a form which could readily be absorbed by computers. This was to be Freebase.
The creators of Freebase understood that information as a long list of facts or data isn’t much use on its own. What makes data useful is the relationship between facts; the connections between facts matters as much, if not more, than the fact itself. As every detective in a TV show or film knows, it is the connecting of the pieces in the puzzle of evidence that make the case; you only go to court if you’ve conclusively linked the weapon to the accused.
What this required was a rethinking of how data is stored in computers in a way that made the relationship between facts as important as the facts itself. The typical way of storing data in a way that computers understand is to use a relational database. These types of database make use of tables of data, which superficially looks much like an Excel spreadsheet. These work well when the type of information you use is known ahead of time, such as corporate finances or customer purchases. But when you have to introduce different types of facts to existing information using typical software this quickly becomes difficult, e.g. “I’d like to add the birthday to my database of clients so I can remember to send them a gift, and I’d like to record what gift I sent them each year” (you can hear the groans from IT already, and expect a large invoice from your external software developer shortly). For the near infinite types of connections possible between all the facts in the sum of all human knowledge, it very quickly becomes impractical.
Thinking about how to store all human knowledge leads to deeper philosophical introspection — “what is a fact?”. Parts of this question are well explored, and we can build upon well established schools of thought such as mathematical graph theory. Here, a fact can be represented very simply as two dots connected by a line.
If we label the first dot as “Bono”, the second dot as “ U2” and the line as “Member of”, we can represent the fact that Bono is a member of U2. To represent another fact about Bono, we create another dot (“Ireland”) and another line (“nationality”), and so on until we have created a network of facts. We can continue to grow this ad infinitum until we have reached the sum of human knowledge.
Danny Hillis outlined his idea in a paper in 2000, but the technology began in earnest in 2003 and spun off into a separate company, Metaweb, in 2005. The database which powered Freebase was known as graphd. Graphd introduced a number of innovative ideas which, when combined, created a powerful system on which to build a global-scale ‘brain’.
As an example of some of its features, each concept in graphd was given a unique identification number, much like a barcode; so the band U2 was 0dw4g and George Clooney 014zcr. Without unique ID’s, ideas sharing the same name — such as the 41 places in the USA named Springfield — could not be differentiated between. This would lead to ambiguities and inevitable confusion for any artificial intelligence trying to make sense of the world.
Graphd was ‘append only’, which meant that new information would be added to the bottom of the table, creating an immutable, fully-auditable timeline chronicling the growth of knowledge in Freebase. Graphd could only ever read information once it had saved it. Information could not be deleted (except in rare cases — i.e. confidential information removal), so the entire history of edits and amendments were available and transparent for all to discover.
The most powerful feature of Graphd was the ability to not have to worry about how the data might be connected in the future. With most relational databases, particularly those used in large corporations, there are teams of people whose job it is to work out what data they want to store and how it might be used later. They then have to spend a lot of time working out, in simple terms, what database tables are required, how they will be connected and what the columns in each table will be called (the database’s ‘schema’). All of this needs to be determined and fixed in stone upfront before the database can be used and any data added to it. Graphd didn’t require any of this upfront work. It allowed you to throw in any data you wanted (with a touch of ‘magic’ to identify potential duplicates and reconcile them with existing facts) and later, at a time of your own convenience, and only should you wish to, try to model the schema explaining richer connections between the data.
Graphd alone didn’t make Freebase a great application, there had been many graph databases before it — the popular Neo4J database was first released in 2002 for example. What made Freebase great, particularly for people like me who aren’t software engineers by trade, was the relatively simple user interface available at Freebase.com. As an example of innovation in its user interface, Freebase launched its autocomplete search tool, Freebase Suggest, in December 2007. At the same time Google was similarly working on its Google Suggest feature which launched in August 2008, later becoming the Autocomplete feature we are familiar with.
Freebase provided the ability in a single click to edit facts about an entity, much like Wikipedia allows collaborative editing of articles.
Freebase allowed anyone to create their own schema, their own interpretation of how data should be connected. Freebase made this complex and esoteric task as easy as pointing and clicking. For example, Freebase already contained an article for David Beckham but did not have a way of showing that he is an Officer of the Order of the British Empire. Some years after information on both David Beckham and the Order of the British Empire had been added to Freebase, I created a schema which allowed these two topics — and hundreds of other OBE recipients — to be connected. This community driven effort to create more ways to ‘connect the dots’ — a type of folksonomy — was a novel and powerful feature of Freebase’s, allowing it to build on the idea that the connections adds far more value to data.
It is here, in the task of modelling your own representation of how facts are connected, that it becomes apparent that your understanding of the world differs from others. It is particularly interesting to reflect on how a language — which is just one representation of information- can be so ambiguous and unclear. It was interesting to find within Freebase’s community and Metaweb’s employees philosophers and linguists working alongside software engineers.
“Die Grenzen meiner Sprache bedeuten die Grenzen meiner Welt”. (The limits of my language mean the limits of my world.) — Ludwig Wittgenstein
Freebase obviously had great appeal to the technically-inclined and introduced a number of technologies related to web development. Metaweb created their own query language for Freebase, MQL, which was closer to the syntax of Datalog than a SQL query. This had benefits for querying the underlying graph structure; queries aligned with the shape of the expected data, joins were implicit and patterns could be applied to filter or select variables. MQL was effectively similar to what we now know as GraphQL. MQL was created by Metaweb around 4 years before GraphQL was developed at Facebook, and 8 years before Facebook released GraphQL to the public.
Automated bots extracted information from Wikipedia and other databases and attempted to understand where these facts connected into Freebase. As we’re being shown with the rise of chatbots, much of this work is powered by humans. Freebase created an application that utilised the power of crowdsourcing, asking questions of humans: “Is David Beckham a football (soccer) player? Did Winston Churchill die in 1965?”. Freebase opened the API to this application, RABJ, to allow other developers to ask questions in the particular domains that interested them. As a result of hundreds of thousands of questions responded to by members of the community and contractors, the breadth and quality of Freebase’s database grew.
Danny Hillis’ original vision discussed the idea of the shared knowledge web, perhaps building on Tim Berners-Lee idea of a semantic web, which he described in 2006 as:
[The Semantic Web] is about making links, so that a person or machine can explore the web of data.
For Metaweb this meant not just consuming data to be integrated into Graphd, but seeking out other databases and creating links directly between topics elsewhere. The idea of a unique identifier for each fact which was embedded into the Graphd technology was reflected in Freebase’s API and website. Each topic had its own url allowing direct, deep linking by others. Freebase’s RDF API was launched in October 2008, providing a W3C compatible machine interface to its data over the internet.
Not all publishers of data, in fact very few, adhere to the rules of Linked Open Data and the Semantic Web. Instead, most published data lives in text files, Excel spreadsheets and database dumps; it is often erroneous, ambiguous and incomplete. In short much data is messy and not usable without much effort being spent on tidying it up. It became apparent that much of the community contributing bulk data to Freebase were struggling with these problems and there existed no good method to deal with it. From this struggle arose Gridworks, later renamed to OpenRefine, an open source tool started in collaboration between Metaweb’s employees — notably David Huynh and Stefano Mazzocchi — and Freebase’s developer community. This open source desktop-based software allows anyone to open a file filled with data, and visually inspect problem data. Freebase further decreased the barrier to contributing data to its database and in return provided a novel tool which can be used in a diverse number of ways.
Metaweb was purchased by Google in July of 2010. The technology, team and data have, in parts, been integrated into Google’s Knowledge Graph and other search technologies (Metaweb’s John Giannandrea now leads Google Search).
Many of Freebase’s applications, including its website, have been released as open source software. Tools such as OpenRefine are still maintained by an active community. However, Graphd, the underlying database technology has never been released. Similar database technology with matching features— append-only, ACID, schema-last graph database utilising datalog-type queries- can be found in software such as Rich Hickey’s Datomic and Barak Michener’s Cayley.
At the time Freebase was closed it had 1.9 billion facts in its database. Its entire database is still available to download, and has additionally been made available as open data to the Wikidata project.
I first came across Freebase in 2008 when I was looking for a publicly accessible project which was collating raw data. I’d just read Toby Segaran’s Programming Collective Intelligence (still a good book and recommended. Coincidentally, Toby later worked with Metaweb) and was looking for some data to play around with. From that point I went deep into the rabbit hole for a few years, being part of the public community of volunteers contributing, maintaining and exploring Freebase. I found in Freebase a supportive community of like-minded people who, for wide-ranging reasons of their own, were working together on this hugely ambitious project.
Freebase.com closed this week not because it failed but because it was successful. At the vanguard of Machine Learning and the Semantic Web, Freebase showed a way forward for the likes of Google’s Knowledge Graph and Wikidata into which it has now been absorbed.
And thank you to everyone who has been part of Freebase’s awesome community over the past decade.