# The Geolytix Retailpoint Dataset

Part I: National Level exploratory data analysis

I am having a fabulous time analysing this dataset (which can be found here)looking at the locations of the major supermarkets using Excel and visualising the results with 3D maps.

From there I wondered about how the stores are located. The Huff model indicates that stores are located near to areas of high population. So from the UK Office of National Statistics (ONS), I downloaded a population dataset which gave the 2011 census population by postcode. Again using Excel I analysed this and visualised this dataset.

It is apparent that there is a good qualitative agreement between the distribution of the population and the distribution of retail stores. It was time to load the data into Python.

Fortunately, Python has some great GIS tools including matlibplot Basemap, geoPanda, shapely and some functions such as the map distance measuring function Haversine.

By using geopanda and UK shapefiles to compare the number of shops in a postcode with the population of the same postcode I was able to see how the number of shops varies with the population. Let’s look at the regression line using seaborn. For small areas, the number of shops is approximately linearly proportional to the local population.

Python Statsmodels indicates a correlation, R-squared: 0.950.

Now let’s look at the spatial distribution of shops. Here I use haversine and sjoin with shapefiles to analyse the cumulative distributive of retail stores against distance for London, Birmingham, Newcastle and Glasgow.

In Part II of this series, I’ll explore these cumulative curves in detail, explaining why these curves are not smooth and how the cumulative number of stores depends on the population size and spatial distribution. I was interested to see that whilst more populous cities have more stores it’s a comparatively weak dependence.

I’ll now look at how many stores there are at a particular distance from the centre of a city.

What is going on here? A couple of ideas. From 0–10 km we are simply crossing London so the circumference of the sampling ring is increasing, therefore, the number of stores is larger. The core seems to have roughly a uniform density of stores, judging by this data and the cumulative store number curves. Then from approximately 10 km, the number of stores at a particular distance starts to decrease because we have effectively left the core populous centre of London and are sampling the less populous fringe areas. At larger distances, the sampling ring occasionally crosses satellite towns, for example at 80 km, and the number of stores sharply increase.

In the next couple of stories, I’ll be looking at the distribution of stores in finer detail aiming eventually trying to explain why Tesco, New Malden has the highest sales in the Tesco estate.

Written by

## David Horgan

#### I am a theoretical physicist with a data science background. At present, I am developing a UK retail market using ABM, ML and computational econometrics.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just \$5/month. Upgrade