Workshop report: Building Linked Data heatmaps with Clojurescript & thi.ng

thi.ng
thi.ng
Nov 16, 2015 · 13 min read

Last week, during the second 3-day data visualization workshop with Clojure, Clojurescript and the thi.ng libraries, we built a small web application using data from the UK’s Office for National Statistics (ONS) and the London Data Store.

Image for post
Image for post
Still from a previous London 3D data viz project done with & for The Open Data Institute in 2013, here showing knife crime incidents by borough.

The workshop was intended to introduce participants to important Clojure/script concepts (e.g. laziness, reducers, transducers, concurrency primitives), data structures (sets, maps, protocols, custom types), libraries (core.async, reagent, thi.ng) and workflows (figwheel), as well as teach Linked Data & Semantic Web concepts, technologies, standards and learn how these can be used to combine data from disparate sources, in order to simplify their analysis and later visualization. After much theory, sketching and many small, isolated exercises, we finally tied everything together by building an interactive heatmap and price charts about London property sales, prices & trends (data from 2013/14) in both Clojure and Clojurescript. The web UI also includes a query editor to experiment with new queries (and visualize their structure) directly in the browser…

Image for post
Image for post
Heatmap of average London property price per borough (2013/14) shows clear bias of high-price areas. Important note: We only used a sample of ~23k transactions to save time during the workshop. The full dataset published at data.gov.uk contains >200k transactions.
Image for post
Image for post
Alternative heatmap based on number of sales per borough in 2013/14. There were ~2x as many sales in the south-east (orange) than in other areas (dark blue).
Image for post
Image for post
Charts of individual property sales per borough in 2013/14, sorted by date. Note the clearly visible upward trend for most boroughs. Charts are generated for all 33 boroughs, with the remainder omitted here for space reasons.
Image for post
Image for post
Screenshot of the query editor and auto-generated visualization of the shown query’s structure (courtesy thi.ng/fabric and Graphviz on the server, editor uses CodeMirror)

Clone the project from Github and run locally (instructions further below).

The remainder of this article sheds some more light on implementation details, the role of the thi.ng libraries in realising this project and lots of links to further reading…

Marcin Ignac (one of the participants and fellow computational designer) also just shared some of his own workshop experiences (and some great “homework” examples) over on his blog.

Graphs, RDF, SPARQL and the Linked Data model

With all the current excitement around these two platforms (for example), I’m always somewhat taken aback by the fact that so many developers have either never heard of RDF (the W3C’s Resource Description Format) or (worse) never want to work with it again. I can somewhat share the top-level thinking behind the latter sentiment, since historically (RDF is in existence since 1999), the format had been closely associated with XML (originally the de-facto encoding) and its large/verbose Java tooling. However, RDF is an abstract data model encoding knowledge statements with clearly defined semantics. It’s not tied to XML or any other particular representation. All what really matters is a standard way to encode data / knowledge as triples of subject, predicate, object, and use, where possible, URIs to provide uniqueness and dereferencing capabilities to lookup unknown terms. The result of this extremely simple setup is that knowledge can be encoded and stored completely distributed, largely becomes self-describing, and equally important, becomes open for aggregation, regardless of where a piece of data has been retrieved from. This is what the term Linked Data (LD) stands for.

Placing our data in a graph system not supporting these principles, will still give us potentially more flexible query capabilities locally, but will not automatically solve the old questions of how to easily combine knowledge from multiple sources or how to provide our own data in a semantically, interoperable format to others.

The Freebase Parallax UI, a research project by David Huynh from 2008, still is one of my favourite examples, nicely showing this current limitation and contrasting it with the potential a Linked Data approach can offer (even though he’s only using one large dataset in the example [Freebase]). Now, 7 years later we still can’t get answers like this from the market leaders in search and this lack of interoperability in many (most?) public datasets is also holding back entire disciplines (UX, UI, data vis) at large:

Over the past 10 years the LD community has adopted a number of alternative, more lightweight formats, standardized embedding of metadata in HTML (RDFa), defined licensing options etc. RDF is also increasingly used and embraced by the biggest commercial players and governments worldwide. Institutions like the Open Knowledge Foundation and The ODI are actively furthering this course by tirelessly working with holders of datasets large & small.

Image for post
Image for post
The Linked Open Data cloud as of 2014, an overview of interlinked open data sets describing over 8 billion resources

In addition to the sheer amount of Linked Data available, there’re are as well hundreds of well-defined, freely available data vocabularies (ontologies) to define terms and express semantics of all complexities in a standard, interoperable and machine-readable way — something anyone seriously interested in working with data analysis / visualization should be embracing, or at least be welcoming… For many use cases, a handful of core vocabularies is sufficient to at least express the most common relationships in an interoperable manner. Data integration almost always is a continuous effort, but small, incremental changes can go a long way. It’s also important to recognize that these vocabularies themselves are expressed in RDF, so there’s no distinction between data and language. This should feel familiar to any Lisper/Clojurian ;)

Back to the workshop exercise…

Dataset 1: London geodata

Dataset 2: London property sales

I prepared a little Clojure utility namespace and demonstrated how we can convert the CSV to a RDF graph model by using terms from the general purpose schema.org vocabulary (and supplement a few ad-hoc ones of our own). From a user perspective of this code, this boils down to just this:

thi.ng/fabric

Image for post
Image for post

thi.ng/fabric is a still young, modular framework for Clojure/Clojurescript, providing a general purpose compute graph model as the foundation to build more context specific applications on (from spreadsheets, navigation, inferencing to knowledge graphs). In the compute graph, nodes store values and can send & receive signals to/from any connected neighbor. At first, this sounds similar to the well-known Actor model, however in the Signal/Collect approach this library is loosely based on, each graph edge is represented as a function, which can transform or inhibit outgoing signals and hence perform computation on the original node values (similar to the mapping phase of Map-Reduce). Furthermore, the collection of received signals also happens via user defined functions, enabling further transformations only possible when combining multiple signals (reduction). A choice of customizable schedulers (incl. parallel & async options) allows for different approaches to control the overall computation. Many graph algorithms can be expressed (more) succinctly using this setup, but for the workshop we focused on the two library modules allowing this architecture to be used as a fairly well featured and ready-to-go Linked Data development server. To my best knowledge, it’s also currently the only pure Clojure solution there is thus far.

thi.ng.fabric.facts & thi.ng.fabric.ld

What do I mean with “SPARQL-like” query engine? Well, since I’ve been using Clojure for most of my data centric work in recent years, I wanted to have the ability to apply these kind of queries to any data without having to restrict myself to the requirements of pure-RDF tools. Furthermore, since the boundary between code and data can be easily blurred in Clojure, it made sense to define everything in an as Clojuresque as possible way, e.g. by using Clojure data structures (maps, symbols, s-expressions) to define the queries and gain programmatic manipulation/construction as a result. In some way this is similar to Allegrograph’s SPARQL S-expressions, though these were not a motive and is quite natural to do in any Lisp. The other reason for “SPARQL-like” is that fabric still being a young project, not all aspects (e.g. federated queries, construct queries, named graphs) are yet implemented, but it’s a work-in-progress.

As an example, the query to compute the complete aggregate heatmap data and lat/lon polygons for all boroughs, can be expressed with this Clojure EDN map. Note: The :aggregate expressions are not function calls, but will be compiled into functions during query execution:

Ready, set, go…

The thi.ng/fabric readme for the LD module contains several examples how to interact with the server via HTTP.

Important: Do not view the web app via localhost:3449 (figwheel’s port), since none of the queries will work (originally due to CORS, but I also switched to relative query paths). Use http://localhost:8000/ only, figwheel will still apply in any code changes via its WebSocket connection.

The code in the repo is fully commented and also acts as an illustrated use case and combined example of various libraries in the thi.ng collection (among others). A quick breakdown of the various parts follows:

Clojure server

brew install graphviz

Clojurescript frontend

Integrating 3rd party JS libraries into CLJS projects used to be somewhat painful until not so long ago. However, since the advent of the cljsjs project, which is re-packaging JS libs for CLJS, this is thankfully a thing of the past and in some ways almost easier to handle than with npm. As an integration example, we imported CodeMirror and used Reagent’s create-class mechanism to build a “reactified” editor instance with Clojure syntax highlighting, learned about the reaction macro mechanism to minimize component render updates and experimented with submitting queries to the server and visualizing their structure.

Always be visualizing…

Image for post
Image for post
Dynamically generated 3D meshes rendered in SVG with different (composable) software shaders
Image for post
Image for post
Blender’s Suzanne imported as STL and rendered in SVG with Phong sofware shader

In order to translate a value range to colors, some form of gradient lookup table (or function, or both) is required. The thi.ng/color library provides a namespace to define complex color gradients using just 12 numbers (4x RGB cosine wave params). The original idea for these gradients comes from the master, IQ himself, and the library provides a few useful presets for our purposes (The library also provides some of Cynthia Brewer’s categorical pallettes which are often better suited for visualization purposes).

Image for post
Image for post
Example gradient presets

Adding some of these presets to a dropdown menu, then allows the user to see London in “different colors” (not all of them good or useful):

Image for post
Image for post
Image for post
Image for post
Heatmap based on average sale price per borough
Image for post
Image for post
Heatmap based on number of sales per borough, same color preset as above. Dark green lowest, cyan highest numbers.

Using core.async & SVG chart pre-rendering

At application start we execute two queries: 1) Retrieve the set of polygons and aggregate values for each borough, 2) Obtain individual property sale transaction details, about 23,000… Both of these queries utilize fabric’s registered query feature, which means these queries are stored as part of the compute graph and their results are always immediately available (without incurring new work for each connected client) and will update (on the server side) automatically, should the underlying set of facts change. Since the second query returns approx. 1.4MB of EDN data which needs to be parsed, processed and transformed into SVG charts, the entire application startup process is handled asynchronously using the fabulous clojure.core.async library. Replacing the original callbacks with async channel operations to coordinate the different processing steps, allowed us to keep a linear structure in our functions and avoid blocking the DOM during pre-processing of the charts.

thi.ng/geom’s visualization engine is completely declarative and essentially transforms a specification map definining all axes, datasets, layout methods and styling configs into a nested DOM-like data structure. Because both the initial visualization spec and the result is pure data, it’s easily possible for either to be defined or manipulated programmatically/interactively. E.g. Changing axis behavior or layout method is very easy to do: just update a key in the input spec map or add an event listener in the result tree, post-generation. Together with the other CLJS workflow ingredients (e.g. figwheel live code updates), this allows for a quick iterative design exploration…

Here we explore the impact of different axis scales and rendering methods:

Image for post
Image for post
Using a linear scale y-axis is a bad choice for this data due to extreme price fluctuations in some boroughs (e.g. outliers like Kensington’s 27.9 million or Lambeth’s 7 million property sales cause havok)
Image for post
Image for post
The same data for the same boroughs mapped using a logarithmic scale
Image for post
Image for post
And once more using line chart with gradient

Previous workshop

Image for post
Image for post
47k airports (magenta = IATA, cyan = non-IATA)

The Github repository for that workshop has more information.

Future workshops

The next workshop will be about one of my other passions:

Embedded devices, ARM C programming and DIY polyphonic synthesizer

London, 5–6 December 2015

The ARM Cortex-M processor family is used in many embedded devices, from IoT, wearables, phones and more demanding use cases and is rapidly gaining traction. This workshop will give you an overview and hands-on experience how to program the STM32F4 (a Cortex-M4 CPU) and work with various peripherals (GPIO, gyro, USB I/O, audio). Over the 2 days we will be building a fully customizable, CD quality, polyphonic MIDI synth and cover some generative music approaches to round off.

Image for post
Image for post

Finally… Clojure community FTW!

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch

Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore

Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store