Making a Geohash of it

The technology behind BBC Terrific Scientific

What is BBC Terrific Scientific?

Despite intervention, STEM (Science, Technology, Engineering, Maths) subjects continue to be less popular amongst school children, with a low percentage of 10–14 year olds wanting to pursue a career in science. This could potentially leave us with a skills gap in various fields in the not too distant future.

Additionally, there is also the scenario that there may be some teachers out there who find science a difficult subject to teach, or others who may find it challenging to keep the engagement and enthusiasm levels up.

Terrific Scientific is a campaign launched by the BBC back in November 2016, aiming to inspire and encourage primary school pupils to get into science. The campaign involves a number of child friendly experiments to complete in class, the results of which will be sent to a selection of universities across the UK to collate and analyse.

The project began with all primary schools in the UK being issued a pack containing the necessary equipment and instructions to get going with the investigations. The investigations have been designed to be fun, hands on experiments that (hopefully!) the children will enjoy taking part in. Schools are invited to sign up to the campaign, which will enable them to upload their investigation results to the Terrific Scientific website. These results will then be used to populate our Terrific Scientific map with data.

By using a map to host our school data, we aim to get a really clear illustration on how the results differ across the UK for each of the different experiments. Children and teachers can search for their individual schools on the map and view the results submitted by each class in that school.

How is it put together?

As the campaign ideas were formalised, we identified a number of technical challenges and questions to answer before we could launch the site.

Let’s have a look at a few of these questions:

Question 1: How can we build a map to display the results of multiple experiments, in an accessible and responsive way?

When we said we were going to be plotting school results on a map, our first thought was to use Google Maps. This seemed like a good solution as they are accessible and certainly responsive, but we also wanted to have a detailed level of control over how the map looked and how the data was represented.

The construction of the map would be a collaborative effort between the BBC and a third party company who would provide the initial build, so it was essential we went with something straightforward that we could all get to grips with.

Answer: OpenLayers

OpenLayers is an open source JavaScript library for displaying map data. The beauty of OpenLayers is its configurability and its responsiveness.

We provide OpenLayers with a base image for the map, then a series of GeoJSON files to provide the map with the school data, town labels and landmark images.

GeoJSON is a simple, lightweight format for encoding a variety of geographic data structures. Individual data points can be represented as a “Feature”, whilst a group of data points can be represented as a “FeatureCollection”. A single school may be represented using the following structure:

{
"type": "Feature",
"properties" : {
"schoolName": "Test School",
"schoolPostCode": "M52 2BH"
},
"geometry": {
"type": "Point",
"coordinates": [-104.99404, 39.75621]
}
}

When we had a new experiment to render, we swapped out the base image and provided OpenLayers with a new GeoJSON file containing our data.

Here’s how our different experiment data could be represented:

Terrific Scientific Base Map
Water Investigation
Time Investigation
Trees Investigation

The next challenge was how to represent this data.

Question 2: How can we display lots of school experiment results on a map of the UK, whilst making it look terrific ?

Fun fact: There are approximately 22,000 primary schools in the UK.

Say every single one of these 22,000 schools participated in this campaign and submitted their results, could you imagine what our map would look like trying to display all that data? A severe case of chickenpox comes to mind. We needed a method of clustering our results across multiple zoom levels in a way that could easily be understood and navigated by 9–11 year olds.

There are numerous ways of clustering data on a map. One suggestion that was mentioned was clustering by postcode. Every school has one and a postcode is a good way of representing a geographic location. The problem with postcodes is that the area that they represent can differ greatly in size. Some postcodes represent a small area, yet others can represent very large, oddly shaped areas.

Area covered by “M5” postcode

The above image highlights the area represented by the postcode “M5”, as you can see, it is very oddly shaped, not to mention the gap in between the two blocks. If we used this method, we may end up with some very odd looking clusters indeed with schools potentially ending up in clusters that we would not expect.

Answer: Geohash clustering

Given the challenges with using postcodes, we decided to utilise geohashes.

A geohash is a method of representing a location with a short code. The longer the code, the more specific the region it represents. Traditionally, we’d use latitude and longitude values but these can get rather long and of course, to represent a location you need two values. Geohashes simplify this a bit by hashing the lat/long values of a given location into a single string.

If we look at the following images of a UK map, we can see how the UK can be represented by four separate 2 character geohashes (gf, gc, gb, u1):

2 Character Geohash

If we zoom in a level, we start adding characters on to our geohash. The more characters in the hash, the more precise the location it represents:

3 Character Geohash
4 Character Geohash

Let’s look at how we could apply this to our school data:

So we have two example schools above, each one represented by a ten character geohash.

At zoom level zero(starting overview), we used a three character geohash for our data clustering. We can see in this instance that both School A and School B have the same first three characters (gcw) so they will fall into the same data cluster.

At zoom level one, we use a four character geohash. We can see that once again, School A and School B still share the same geohash (gcw2) so they will remain in the same data cluster.

At zoom level 2, we use a five character geohash. This time, School A’s geohash now differs to that of School B, so each school will now fall into a separate data cluster.

Using this method, we had an effective way of representing our school data on the map across multiple different zoom levels, as the geohash represents a grid area rather than the random shapes that we’ve observed with postcodes.

So we had our school results and our map looking great with some nicely clustered data. The final piece of the puzzle was identifying the potential performance risks and how these could be dealt with.

Question 3: How can we ensure our map data renders seamlessly when moving about within the map

Once again, say we had 22,000 schools worth of data submitted to the map, that is a lot of information needing to be clustered and rendered at once. If we zoom in, we’d want our data to be re-clustered and re-rendered. That is a lot of work for our API to be doing each time.

Our first approach at improving this was to come up with a bounding box clustering approach, where only data within the current viewport would be retrieved, clustered and rendered. Basically, this means we’d get the coordinates of the box that is currently visible in the users browser and tell the API to get me all data that falls within this region. Each time, this would hit our Amazon Aurora database, pull back results, cluster them and return them to our map. If we zoomed in, zoomed out, dragged the map up or down, we would be doing another database query and cluster. This resulted in a very interrupted user experience where data clusters or individual schools would move about or flicker when rendering.

Answer: cache the clusters (using Redis)

We decided that a caching layer would be a sensible approach to solve our data rendering issues. We could set up an AWS Lambda function to check the database every couple of minutes and add new data into our cache. Our map could then hit the cache for its data. For this caching layer, we decided on Redis through Amazon ElastiCache, an in memory data structure store. I think the best description I have found of it is “memcache on steroids”. The real difference between Redis and memcache is that Redis uses data structures, meaning you can fine tune exactly what you are storing in cache, although there a number of cool things that Redis offers in addition to pure caching. The real deal clincher for the use of Redis was its geospatial functionality. Since Redis 3.2 release, it comes with some really useful Geo functions that we could utilise to store and retrieve geographical data. It saved us manually having to work out areas to look at as Redis did all the maths for us. The two functions that we were interested in for Terrific Scientific were:

  • GEOADD
  • GEORADIUS

Let’s have a look at these in a bit more detail.

GEOADD

Data can be added to Redis using the GEOADD function. A typical command to add an individual school may look like the following:

GEOADD water_0 53.464 0.078 “School 1”

As with memcache, all values are added against a specified key. In our case, we wanted a key per experiment, per zoom level. So for the example above, the key would be “water_0” (water experiment at zoom level 0). We provide lat/long values and the item to be added against the key. This can be a single item or a number of items can be added at once.

GEORADIUS

To retrieve data from Redis, we can use the GEORADIUS function. This is a particularly useful one:

GEORADIUS water_0 53.464 0.078 100 mi

So here, we call GEORADIUS with a key (same kay as previous), a lat long value and a radius value (in this case, 100 miles). The function queries the key using the lat/long values as a center point. It will then retrieve all data from the key within the given radius. So this query will retrieve all data within a 100 mile radius of lat/long value of 53.464/0.078.

There’s plenty more that Redis has to offer in terms of geospatial functionality so I’d advise you check it out if this kind of thing interests you.

Conclusion

Of course, there were several other things involved in putting Terrific Scientific together but hopefully this article has highlighted a few of the main technical challenges we ran into and how we attempted to overcome them. If maps are your thing, I’d highly recommend you check out OpenLayers, it’s really straightforward to get up and running with some simple maps and there are some great demo apps online. I’d also recommend you checking out Redis if you’re not already familiar with it as there are loads of useful features just waiting to help you improve your apps.

If you know of any primary school teachers who may be interested in the campaign, or if you just want a look yourself, please feel free to check out the Terrific Scientific website.

Useful Links: