Tackling Big Data Challenges with Linked Data

The Challenges

Big Data is a deceptively simple term — it immediately implies “large amounts of data”, which is true but not the whole picture. Categorising data as “big data” using size alone can be difficult, as what is perceived as large amounts of data depends on whether you are speaking to someone like Amazon or a small business owner. It is easier to think of Big Data as data which is too much for traditional data storage and querying methods to handle. This includes data which changes rapidly, varies wildly, and may be inconsistent. Any combination of these aspects is enough to qualify a dataset as Big Data.

More and more data is becoming available which companies can use to optimise business processes, predict trends or aggregate to understand their customers. Unfortunately, this data is often stored in isolated data silos due to the vast number of data sources and convoluted data governance. Different departments subsequently arrive at vastly different conclusions as they can only see a subset of data without any additional context. Data integration is therefore a critical challenge for companies that wish to make well-informed business decisions.

The Semantic Web

The semantic web makes it easy to find, share, reuse and importantly combine information. The relationships between data and real objects are recorded in a common language to create a mesh of information that can be processed by machines on a global scale. Tim Berners-Lee identified the seven layers in the semantic web which defined the architecture as follows:

T. Berners-Lee, J. Hendler, O. Lassila, et al. The semantic web. Scientific american, 284(5): 28–37, 2001.

The semantic web uses Unicode as the data format which is used to mark a resource on the web. Unicode is responsible for handling the encoding of the resource while the URI is responsible for resource identification. XML is a streamlined standard markup language that combines the richness of the standard markup language with the usability of HTML. XML is machine readable and simplifies data sharing, transport, availability and platform changes.

RDF is that data modelling language which all semantic web information is stored and represented in. This language connects the information using RDF Triples to form a directed, labelled graph such as the one shown below:

In this graph we have a collection of many RDF Triples, one of which is (http://...isbn/000651409X, a:title, The Glass Palace). Without the RDF language, datasets are difficult to combine due to the fact that the data is not represented in a common language. However, this common language allows an easy integration as further datasets can be attached from infinite sources to form an information web as shown below:

The full context of data can be found by traversing the web of linked data, ensuring business decisions are based on all of the available data.

Current Uses

The semantic web is an important characteristic in what is being called “Web 3.0”. While it was not created for big data, the development of semantic web technologies has been proven to solve many of the limitations that arise when using traditional methods to tackle Big Data challenges. Ontologies have been created to standardise the representation of medical data. Alzheimer’s disease for example requires information from many disciplines including: neurology, neuronal physiology, psychiatry, microscope anatomy, genetics, bioinformatics, biochemistry and molecular biology so it is vital to enable researchers to easily analyse data over multiple sources. Semantic web technologies RDF, SPARQL and OWL are being used in pursuit of this goal. Aeronautical researchers are similarly using these technologies to combine their data but commercial uses are also prevalent. The BBC was one of the first organisations to utilise linked data as they could not maintain all of their site pages. In 2009 they had almost 8 million track pages, that had to be up to date, which could not have been handled without the use of linked data. Now, almost a decade later, the BBC still utilises the same technologies to manage their ever growing expanse of information. The semantic web has also been used to combine electrical power data that was used to create the smart grid ontology. This is not only used to analyse and resolve power outages, but to enhance electrical power security as anomalies can be identified more accurately.

Conclusion

While Big Data means that more data than ever is being collected and analysed, the unstructured and unrelated nature of these datasets makes cross-referencing across this data near impossible, which limits their potential. The relationship between Big Data and the semantic web is best exploited when the data is stored as RDF data, which will guarantee that the exponential growth of global data will be an asset rather than a burden.

In future posts I plan to dive into the RDF query language called SPARQL and how Wallscope has used these technologies to help businesses find, share, reuse and combine their data.