TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Geocoding and generalisations

6 min readJan 23, 2021

--

Image by author

Grid cells are excellent for generalising information. This story is about geocoding, generalisations and the availability of spatial datasets which can be used to present spatially referenced data with geographical information systems. It concludes with pointing at the excellent Zenodo website where processed files can be distributed.

Geocoding

Geocoding is a way of using a combination of text and numbers to uniquely represents a geographic entity. It could be a location or an object. This is embedded in our language through our use of place names.

Most of us live in a named street in a numbered house. Abbey road 5 is one such geocode. It is not a particularly good one as there are tens if not hundreds of roads with that name around the world. Some of you might be familiar with the geocoding system what3words. It is based on building a unique identificator based on three words. It works just alright.

The geocoding system presented in this article is not perfect, not best and not the easiest to use. But it is good and it is free!

Generalisations

In ecology and social sciences generalisation is key to understand spatial patterns. Spatial patterns allows us to understand ecological conditions, disease spread, demographic patterns and much more.

Generalisations is a tradeoff. Knowing exactly where the animals in a part of nature are at any given point in time is not very helpful. To manage a national park it makes sense to know more generally where the animals are throughout time. If we know that the animals move within certain cells in a system of regular grids we can use this to intersect this information with other things we know about the area. Less details thus gives us more useful knowledge about an area. So to manage the animals you will have to let go of their exact position. It does not matter as long as you can generalise their positions to a useful level. This is the power of not knowing everything.

How the grid cells can be used illustrating species observations in Tanzania. (Image by author)

Finding the right level of a generalisation is not easy. The system I am presenting data sets from in this article allows the user to choose a relevant size for the grids to use. It allows you to go all the way from full degree grids to the size of a tea cup — if that's what you fancy. If that’s your cup of tea…

Quarter degree grid cells

Quarter degree grid cells combines both of the above. The standard represents a way of making (almost) equal area squares covering a specific area to represent specific qualities of the area covered. The squares themselves are based on the degree squares covering earth. Around the equator we have 360 longitudinal lines lines, and from the north to the south pole we have 180 latitudinal lines. Truth be told — 178 lines and two points (the poles). Together this gives us 64800 tiles covering earth.

How does this become a geocoding system? Each degree square is designated by a full reference to the main degree square. S01E010 is a reference to a square in Tanzania. S means the square is south of equator, and E means it is East of the zero meridian. The numbers refer to longitudinal and latitudinal degrees.

How a longitude/latitude cell is divided into smaller grid cells. (Image by author)

A square with no sublevel reference is also called QDGC level 0. This is square based on a full degree longitude by a full degree latitude. The QDGC level 0 squares are themselves divided into four where the squares are named A, B, C and D becoming QDGC on level 1.

The process can described as follows:

  1. Each degree square is designated by seven characters. The first character indicates whether the square is east or west of the zero meridian (E or W). The three following numbers indicates in which longitude the square starts. The next character indicates whether the square is north or south of the equator (N or S). Next is two numbers indicating the lateral distance to the equatorial line. This is a reference to a unique degree square and is referred to as QDGC level 0.
  2. Each degree square is divided into four squares where the upper left quadrant is designated A, the upper right B, the lower left C and the lower right D. This is QDGC level 1.
  3. Each resulting quadrant is subject to the same process as described in the Step 2. A character quadrant indicator (A, B, C or D) is added at the end of each iteration. The level number is increased by one per iteration (Figure 5).
The steps necessary to make a QDGC code. (Image by author)

You can read more about this system in an article I and some colleagues wrote some years ago.

Producing the QDGC grid

I started producing the grid cells for personal use some 15 years ago. Back then I used ArcGIS and python to produce the files.

Code sample from when Python was used. (Image by author)

Producing the grid is not difficult when we are looking at level 1 or 2. Tanzania has 529 cells on level 1, 1.980 on level 2 and 7.830 on level 3. On level 7, which is around 700x700 meters the number has grown to 1.960.800. It does not take a math genius to figure that producing grid cells for all African countries is daunting task. For each level the number of increases by a factor of four.

Today the production is done completely using functions written i PL/pgSQL functions which rely heavily on PostGIS 3.1. Doing all the processing in PostGIS sped up the whole process. It still takes 4 minutes and 36 seconds to process grids from level 1 to 7 for Tanzania. Doing that for all African countries takes a lot of time.

Coding functions in PLpgSQL. (Image by author)

Using a couple of evenings this last week I completed the processing and can now present QDGC-files for all African countries. The raw files are distributed in geopackage format. To save space they are compressed using 7-zip program.

The 54 files for Africa in compressed format are a total of 1.2 GB. The uncompressed size is around 32 GB.

Sharing using Zenodo

Where I earlier relied on distributing the files from my own website, I have now moved on to sharing them through Zenodo.

The OpenAIRE project, in the vanguard of the open access and open data movements in Europe was commissioned by the EC to support their nascent Open Data policy by providing a catch-all repository for EC funded research. CERN, an OpenAIRE partner and pioneer in open source, open access and open data, provided this capability and Zenodo was launched in May 2013. (Source https://about.zenodo.org/)

Zenodo has communities which represents an affiliation for files uploaded to Zenodo. All QDGC files are uploaded to the qdgc community:

All files stored with Zenodo equips them with a unique digital object identificator (doi). Should you end up using a dataset in your research or for other good reasons need a reference Zenodo securs this. The quarter degree grid cells dataset version 1.0.0 for Tanzania can therefore be referenced like this:

Ragnvald Larsen. (2021). qdgc Tanzania (Version 1.0.0) [Data set]. Zenodo. http://doi.org/10.5281/zenodo.4452672

From now on this is where I will deposit all files related to quarter degree grid cells.

What does it mean?

Not much really. If you need a set of geocoded spatially referenced objects which are suitable for generalisations just use it. If you don’t I hope you have enjoyed the read. Thank you for reading!

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Ragnvald Larsen
Ragnvald Larsen

Written by Ragnvald Larsen

Geographer working with GIS, data management and development cooperation. My opinions are my own. https://www.linkedin.com/in/ragnvald/

No responses yet