Building Your Data Network

Tony Seale
10 min readFeb 16, 2023

--

… how to give each data item a Data URL

by Tony Seale

Making Data Searchable

There’s a growing understanding that the key to a successful modern business lies in the quality and depth of the data it holds. Established companies have access to vast amounts of valuable data, but it’s scattered and fragmented across various systems, making it difficult to access and use. This results in inconsistent, duplicated, and often incomplete information that can’t be easily measured for accuracy, coherence, or origin.

Raw data is limited in its value. True value derives from the information and knowledge bound within the relationships between different datasets. When an organisation’s data is connected into a unified whole, its value and richness multiply. This is because when an organisation’s data is unified and connected in this way, the scope, complexity and magnitude of questions that can be asked of it are revolutionised. Suddenly, answers to questions such as “what would be the consequences of leaving market X?” or “how many applications use the data from this particular system?” or “what are all our interactions with this or that customer?” is now available to them at the touch of their fingertips.

The solution lies in a ubiquitous aspect of modern life that we now often take for granted: the URL. The URL makes documents on the web connected and searchable and it can do the same thing for your data.

Digital Doorways

The URL may seem like a simple thing, but it has played a critical role in shaping the digital landscape we know today. It serves as a cornerstone for the World Wide Web and it has also sparked a revolution in application integration through the use of something called RESTful Microservices.

The power of URLs lies in the following properties:

Universal Identifiability: URLs provide a globally unique identifier for a resource on the web, allowing for easy referencing and retrieval of the resource.

Standardisation: URLs are standardised, allowing for consistent and interoperable access to resources via open HTTPS protocols.

Human-readable format: URLs are written in a clear and concise format that is easily readable by both humans and computers. This subtle but pivotal feature creates an identifier that means something to both humans and machines.

Searchability: URLs are used by search engines to crawl and index, making it possible to discover and access resources through search. This has made it easier for users to find the information they need and has helped to fuel the growth of the Internet.

These properties have allowed URLs to play a critical role in integrating the digital world and they are the reason why the humble URL turns out to be the key to connecting our organisations’ data.

By providing each piece of data with a URL, we establish a data connectivity mechanism that is straightforward, universal, scalable, adaptable, and searchable. Data URLs are platform-agnostic and can be used with a wide range of tools and technologies. This means you can achieve seamless data integration without having to invest in expensive software or specialised expertise.

Data URLs can serve as the foundation of a cutting-edge data integration strategy, connecting text, data, and applications in one machine-readable and AI-ready network. With this approach, we can simplify our data architecture, improve data quality, and enable powerful analytics.

How to give your data items a URL

So we need a simple way to give our data items a URL, and also a way to link to related data items that have been given URLs by other teams. JSON-LD (JavaScript Object Notation for Linked Data) is a lightweight and flexible format for expressing structured data.

JSON-LD uses URLs to both identify and dereference resources, and it does so in three ways:

1) Through the use of a special @id property. This property can be used to assign a unique URL to each resource described in the JSON-LD document, and this is how a data item is given a URL.

2) In the values of properties. For example, when describing a person, the property “colleague” can have another person’s URL as its value, and this is how data items can be linked together.

3) Through the use of the @context property. This property is used to map the terms used in the JSON-LD document to URLs, which can then dereference shared vocabularies, and this is how data items can be conformed to common organisational and departmental models. See this related article for more details.

JSON-LD is simple, making it easy for developers to work with. It uses familiar JSON syntax, which is widely used and well-understood by developers.

URL Naming Conventions

“What’s in a name? That which we call a rose, by any other word would smell as sweet.” — Romeo and Juliet.

When designing the URL naming conventions for your data, it’s important to take into account the same considerations as when creating a user-friendly website or RESTful API. These include factors such as readability, usability, consistency, persistence, and the security of the URLs for your data items.

A good way to organise your URLs is to have two main categories:

  1. A “conformed” or upper namespace that is based on the concepts in your.schema.net model. Data in this namespace must come from an authoritative source and conform to the relevant type in your.schema.net. URLs in this namespace would look something like this: “https://data.your.net/person/tony-seale” and each data item would conform to the schema specified at “https://schema.your.net/Person”

2. An “open” or lower set of namespaces where each application has its subdomain, where anyone can publish anything. URLs in this namespace would look like this: “https://myapp.your.net/widget/sub/123”

Having both a structured upper namespace and a flexible lower namespace creates a healthy balance between centralisation and decentralisation within the organisation. The structured namespace makes data easily accessible and linkable, while the lower namespace allows for innovation without requiring constant approval from a central team.

Why is it important to have the upper namespace? The reason is that if “https://data.your.net/person/chuck-norris" is present in your HR system, you want the sales dataset to use the same URL in the case of any sales made by Chuck. That is how you invert the cost of linking your data.

Why not simply have the upper namespace? Because these are URLs, not just URIs, and if you do not own the domain name “data.your.net/person” then you cannot resolve that URL over HTTP. You can still provide your opinions using your applications subdomain “https://myapp.your.net/data/223" and include a link back to the main record indicating that this Chuck Norris is the same as “https://data.your.net/person/chuck-norris".

Searching for Hyperlinks

The applications that need to publish data are typically aware of the other applications and datasets they need to connect to. The solution to reducing the cost of integrating an organisation’s data lies in making it easy for these applications to discover the URLs they require. If I want to publish trade data and I have the name of the trader then I need a simple way to get the unique URL for that person.

Linking Method One: Add the URL naming convention to your.schema.org

Making connections to data items in the upper namespace is straightforward as it is based upon strong conventions. To make these conventions well known you simply include a section in your.schema.org so developers can look up the type that they wish to connect to and then find the URL naming convention for that type.

For example, if you know the URL for a person is structured like this: https://data.your.net/person/{first name}-{last name}, and you have a person’s first and last name, then you can create their unique URL by joining the text together. So Chuck Norris would become https://data.your.net/person/chuck-norris.

It is surprising how far you can get with this very simple, fast and efficient, decentralised, linking strategy.

Linking Method Two: Building a Lookup Service to index your data

Sometimes, simple text concatenation will not suffice. For instance, if you only have a person’s email address but need their first and last names to build a URL, a lookup service can assist. In time, this service can evolve into a sophisticated decentralised entity resolution mechanism, but for this article, we will keep things simple by using Elasticsearch.

Elasticsearch is a distributed, open-source search engine that is capable of handling large amounts of data. You can load all the URLs from your datasets into Elasticsearch and index all the data URLs in your organisation, similar to how Google indexes all webpages on the internet.

You can also add labels and descriptions, and even index additional data attributes for specific types. For example, for people, you could index by the email field, so that a person’s URL could be located by their email address.

Elasticsearch has a bulk search option so that many URLs can be resolved in one request and this speeds up the linking process for large datasets.

By using these two linking mechanisms, linking our data items can be as effortless as creating a hyperlink to connect this article with another webpage.

URLs Should be Clickable

The URLs you use for your data items must be accessible via HTTPS. As long as you have access rights, any data item should be easily retrieved by simply inputting its address into a web browser.

Clickability guarantees that all identifiers are globally unique, as each network address can only lead to one place. It also ensures that all data across the organisation is universally accessible by one simple mechanism.

For data items in the lower namespace, where the application has published its data in its subdomain, the company proxy server will handle finding the correct server. For example, the proxy server will see the URL “https://myapp.your.net/widget/sub/123” and will resolve the application's RESTful API at “https://myapp.your.net”. It will then hand over to that application to find the specific data item “widget/sub/123” and return the JSON-LD for that resource.

For data items in the upper namespace, a redirection system is required to route requests to the correct server. NGINX, an open-source HTTP and reverse proxy server, can serve as a high-performance solution to route requests to the canonical data service that the organisation has chosen to provide the data for a specific type defined in the your.schema.org model.

For instance, NGINX could have a rule that redirects requests for people to the human resources application. So calls to “https://data.your.net/person/” would redirect to “https://hr.your.net/person” and a call to “https://data.your.net/person/chuck-norris” would therefore redirect to “https://hr.your.net/” and then the HR application would have to return the JSON-LD for “person/chuck-norris”.

For performance reasons data items are also grouped into downloadable datasets and for large or complex datasets, making each data item individually resolvable may not be feasible. In such cases, a hash URL can serve as a persistent identifier for specific information within the larger resource. For instance, a hash URL of “http://example.com/books/123#chapter2" can be used to refer to a specific chapter in a book with the URL “http://example.com/books/123". Hash URLs transfer the burden of finding the specific data item from the server to the clients.

Clickable Knowledge

Leveraging the power of data URLs within an organisation can unlock untold potential and drive success in the information age. By creating a seamless web of interconnected data, you can tap into the collective knowledge of your organisation, streamlining processes and making information more easily accessible.

The process of linking one system to another is a valuable investment, allowing for efficient and effective data sharing throughout the organisation. It should be done once, done right and done at source.

Through the power of Network Effects, based on Metcalfe’s law, the value of your data network grows exponentially with each additional connection. The more data points that are linked, the more valuable the information becomes, providing a wealth of insights and knowledge at your fingertips. By carefully curating these connections, you can reap the rewards of increased knowledge and stay ahead in the increasingly competitive landscape of the information age.

We can rarely see what is right in front of our eyes, but in retrospect, the URL is a blindingly obvious solution.

--

--