When I started working on Blikk I didn’t think about the Semantic Web. I had forgotten that it was ever a thing and that billions of dollars had been poured into it. Perhaps I had wanted to forget. Helping to structure existing data on the web, one of the things we are trying to do, is what the Semantic Web was all about. And it didn’t work out that well. But a couple of days ago this discussion on HN brought the Semantic Web back onto my radar. So perhaps it’s time for a recap. This post is my personal interpretation of what happened.
We probably should start by defining what exactly the Semantic Web is. Unfortunately that’s a problem already. The Semantic Web means different things to different people. For some people it’s all about technologies such as RDF, SPARQL and OWL, and for some people it’s more about the overarching goal of annotating content with semantic meaning. The term is as fuzzy as Big Data is today, and RDF is like the Hadoop of the Semantic Web. I think the argument over whether or not the Semantic Web failed has evolved into an argument over terminology. People redefine what the Semantic Web means in order to argue in their own favor. While some people are convinced that the semantic web is dead, others argue that we are using it every day. Of course, both are right.
The original vision of the Semantic Web, that individuals and organizations would annotate their own content with metadata, clearly didn’t come to fruition. Cory Doctorow explains why such human-created metadata is utopia and will likely never exist. Many of Cory’s points rely on the fact that the incentives to publish truthful metadata aren’t aligned across consumers and producers of data. Why would I want to publish my data if doing so doesn’t directly benefit my business, makes it easier for competitors to challenge me, or makes me look bad?
However, many of the technologies that were originally built for the Semantic Web are being successfully used by companies today, including Google, Yahoo!, Microsoft, IBM and many others. And if nothing else, the Semantic Web has clearly spurred research. The ISWC conference demonstrates that the research community is healthy. It even has an industry track. This is both good and bad since the overly-academic nature of many semantic technologies was probably one of the things that hindered adoption.
But let’s stop arguing over details and terminology. Let’s forget about RDF, SPARQL, OWL, triple stores and quad stores for a moment. Instead, let’s take a step back and look at the high-level goal of the Semantic Web. According to W3C it was to:
Allow data to be shared and reused across application, enterprise, and community boundaries.
That clearly sounds like something that would be useful. One of the reasons the Internet is so powerful is because it makes the world’s information accessible to anyone. But that’s only partially true. In fact, getting the information we want is often harder than we may realize. We have gotten so used to keyword-based search that we don’t consider the possibility that there may exist a better way to do things. And if we believe that searching for something would involve too much work then we may just go and look for alternative solutions. Paul Graham calls this Schlep Blindness, the tendency to unconsciously avoid doing things that are unpleasant. To demonstrate my point, let me ask some basic questions:
- Which journalists wrote about Travis Kalanick this week? Was the sentiment positive or negative?
- Who left Microsoft last year, and where did those people they usually go?
- What kind of technical skills do I need if I want to get a job as a “growth hacker”?
- Which places are available for rental if I want to spend 3 months in Barcelona sometime next year?
Is the raw data to answer these questions available online? Yes. Are the answers to these questions at your fingertips? Nope. You would have to combine data from a range of sources, put them into an Excel spreadsheet, and then analyze it somehow. If you are a developer you may be able to query APIs or write a custom scraper to get the data you need, and then write code to aggregate them into something meaningful. But in many cases even that wouldn’t be feasible due to technical, legal or time constraints.
But I think anyone would be hard-pressed to argue that knowledge like the above wouldn’t be valuable. But to answer such questions we need to make the right data accessible to machines, which can aggregate and analyze it on our behalf. I believe there is still an incredible amount of value hidden online that isn’t currently being utilized, because we don’t have easy access to the information we need. Doctorow, whose essay I mentioned earlier, got this right over a decade ago. He calls such data observational metadata:
Observational metadata is far more reliable than the stuff that human beings create for the purposes of having their documents found. It cuts through the marketing bullshit, the self-delusion, and the vocabulary collisions.
Did the API Economy succeed where the Semantic Web failed?
Instead of the rise of semantic technologies we have been seeing a different trend over the years: APIs are proliferating. Some people like to call it the API Economy. REST and JSON have become the de-facto standard for exposing structured data. Tools for dealing with REST and JSON are mature and easy to use, exactly as they should be, because the target audience are web developers and production systems. Semantic Web technologies were complex and opaque, made by academics for academics, not accessible to many developers, and not scalable to industrial workloads. If you think about it, REST APIs come pretty close to what the W3C was trying to do with the Semantic Web, to “allow data to be shared and reused across application, enterprise, and community boundaries.” And more recently, startups like Kimono and Import.io are making it easier to create APIs for websites that don’t have them. Now anyone can contribute to the API economy.
However, it is also true that REST and JSON don’t expose the same semantics you would get with RDF. LinkedIn’s /peopleresources are not semantically connected to Twitter’s /people resources. But REST APIs are a pragmatic approach that just works. They are good enough for most of the use cases. Technologies like JSON-LD may change this. At least they have what I believe to be the right idea:
JSON-LD was created for Web Developers that are working with data that is important to other people and must interoperate across the Web.
The role of Machine Learning and AI
Besides the proliferation of APIs there has been another trend over the past decade: Machine Learning techniques have gotten significantly better at extracting meaning from semi-structured data, unstructured text, images and videos. This is both due to the availability of more data (which often beats smarter algorithms) as well as actual algorithmic advances such as those in Deep Learning.
So, what if, instead of letting individuals and organizations annotate their own data, or build their own APIs, we could let machines do it for us? This is already happening. Google’s Knowledge Vault project, which is the repository behind its Knowledge Graph product, can extract meaningful facts from the web corpus with surprisingly high accuracy. It uses a combination Machine Learning and probabilistic techniques, combined with lots of data. And then there is IBM’s Watson, which (among other things) identifies meaningful facts in unstructured data sources and uses them to provide Question-Answer capabilities for a range of industries. And as more semantically annotated data becomes available online, the easier it will become to train Machine Learning systems to do an even better job.
However, most likely the process of structuring data can never be fully automated. At least not until we reach a state of Strong AI, which won’t happen anytime soon. Machines are lacking the type of commonsense knowledge and life experience that we humans use to make decisions. Even in the case of Google’s Knowledge Vault, extracted facts must still be reviewed by humans before they make it into the search results. But perhaps this is as it should be. Let both machines and humans do what they are good at. We don’t need machines to do all of the work for us, as long as they can drastically reduce the time we spend on mundane tasks or allow us to do things we couldn’t do before. As Peter Thiel likes to say:
The most valuable companies in the future won’t ask what problems can be solved with computers alone. Instead, they’ll ask: how can computers help humans solve hard problems?
So perhaps there are opportunities for machine-human collaborations that could make use of the existing API economy and bring the overarching goal of the Semantic Web a bit closer?
Note: I originally posted this here.