A Gentle Introduction to Geospatial Data Science

Get familiar with Spatial Data, Location Intelligence, Heatmap, GeoPandas, and many more topics.

Mangesh Gupta
CodeX
6 min readJun 9, 2021

--

Photo by Brett Zeck on Unsplash

Have you ever wondered how do most successful companies like Burger King, Amazon, Lenskart etc. decide their physical store locations across a nation or a city? Is this a random intuition-based process or there is an underlying analysis of demographics, sales trends, traffic, weather etc.?

Did you know it is possible to predict how much money a humongous chain like Walmart will earn in the next quarter? Yes, Geospatial Data Science enables us to use technology in incredible ways. In this article, we’ll get familiar with this blooming area of Data Science.

What is Spatial Data?

“Without geography, you’re nowhere.” — Jimmy Buffett

Let us start with the most foundational concept in Geospatial Data Science i.e. Spatial Data. Spatial data is the information about the location and shape of geographical features and the relationship between them. As far as a geographical location is concerned, we most commonly represent it using (Latitude, Longitude) coordinates. On the other hand, the shape depends upon the type of data viz. vector data and raster data. Vector data is represented through points, lines, and polygons stored in shapefiles (.shp) whereas raster data is image-like data stored in a grid of pixels (eg. Satellite imagery). Another format to store spatial data is GeoJson. If you want to get yourself familiar with this, you can visit geojson.io which is a very intuitive tool for editing GeoJson data using a map interface. Following is an example of spatial data represented as a polygon:-

Here, you can see how GeoJson files store such shapes using coordinates of geographic locations.

Typically in Data Science, we train a model to take some features of data as input and predict some label. In Geospatial Data Science, the features reference a geographic location and hence can be put on a map. We call such data as spatial data.

Spatial Analysis with Geopandas in Python

Now, let’s get into the shoes of a Spatial Data Scientist. If you are not familiar with The Python programming language, you are free to go through all the graphical outputs and analysis made in the further sections of this article.

While we use Pandas library in python for preprocessing on most types of data, GeoPandas library built on top of Pandas library helps us preprocess spatial data. Sounds Interesting? Want to get started with GeoPandas? It is recommended to use the Jupyter notebook that comes with Anaconda distribution or Google Colab. To install the GeoPandas library and use it in Jupyter Notebook, type the below command in your anaconda prompt.

conda install -c conda-forge geopandas

For installation in Google Colab, you can run a cell with the command mentioned below.

!pip install git+git://github.com/geopandas/geopandas.git

Reading data in Geopandas

I will use the Spatial Data for CORD-19 (COVID-19 ORDC) from Kaggle to demonstrate how to read spatial data with Geopandas. You can practice the same by visiting the dataset page and clicking the “New Notebook” button in the upper right of the page. The notebook will open in the same folder as the dataset and you can start writing your code to read and analyse the data.

Notice that the data that I’m reading has an extension of .shp i.e. shapefile. Shapefiles(.shp) are the most common file format for spatial data. If you download a data archive from the internet and want to use the shapefile(.shp) to read data, all the other files that come with the shapefile (.shx, .prj, .dbf) must be in the same folder for you to be able to read the shapefile using GeoPandas.

Note: GeoPandas also has some inbuilt datasets that you can use to workaround. You can find a list of these datasets if you run geopandas.datasets.available in your ipython notebook. An example of using inbuilt datasets can be seen here.

Case Study: Happiness of Citizens

Now, Let’s try to analyse some geographically important dataset using GeoPandas. Every year, United Nations’ Sustainable Development Solutions Network releases a World Happiness Report which contains a ranking of nations on the happiness level of their citizens based on the happiness index (a score calculated based on the performance of a country on 6–7 happiness parameter eg. GDP per capita, Health expectancy etc.). I will use the World Happiness Report 2021 and Countries population by the year 2020 datasets from Kaggle to analyse the happiness of nations and try to find out what factors make a country happy or unhappy and in what way do these factors impact happiness.

By visualising this data using Geopandas, we find that Happier countries generally tend to be less populous, have fewer children and have older citizens. According to the process of calculating the happiness index, some other factors are also very important that we did not analyse here as this is only meant to be an introduction. These include Corruption, social support, per capita GDP and freedom of living. You can take it as a task for yourself to do a spatial analysis on these parameters and find some more important conclusions :).

Maps for data visualisation

An example of Heatmap

In Python, we use the folium package to create interactive maps like MarkerCluster, Bubble Map, Heatmap, Choropleth Map etc. Heatmap is used when we have to show geographic clustering of some feature in our data. For instance, in covid-19 spatial analysis, we can make heatmaps of several cases and find out which city to categorize as a hotspot. Another example can be strategising physical store locations wherein we can use heatmaps to depict the higher density sales areas. Following code-snippet to generate heatmaps is taken from the official documentation page of GeoPandas.

Another useful type of map to visualise data is Choropleth maps (maps where the colour of each shape is based on the value of an associated variable). It can be easily created with GeoPandas.

With this much knowledge, you can now also give it a try and visualise the MarkerCluster, Bubble Map, Flow map etc. on some data. So, Good Luck with your endeavours :)!

What is Location Intelligence?

Location intelligence is a concept that many industries use to solve real-time problems. We can define location intelligence as insights that we derive from performing analysis on Geospatial data. These insights can be any actionable information concluded through trends and relationships found in the spatial data. These trends and relationships can be seen in anything from consumer behaviours to environmental factors.

One of the best use cases of location intelligence can be seen in “The Traffic Jam Whopper” by Burger king in Mexico. Cities in Mexico witness the world’s worst traffic jam. Burger King treated this situation as a huge opportunity. They used live spatial data to reach customers even during peak traffic hours and made it possible for people to place an order and collect it while being stuck in the traffic. Have a look at this short video on traffic jam whopper

Well, this is how Burger King became Mexico’s number one and most beloved fast-food app. Location Intelligence has a great scope to be creatively utilised in near future and evolve rapidly. It has made a palpable difference in the way businesses conduct their market research.

Future of Geospatial Data Science?

So far, we have seen a few things in Geospatial Data Science but how good is the future of this technology?

According to a global survey of hundreds of thought leaders from various enterprises, nearly 68% of organisations are likely to escalate their investment in Geospatial Data Science in the coming years, which explains why this technology is worth learning. Geospatial Data Science is proving to be useful in building resilient cities, tracking biodiversity, smart farming, fighting deadly diseases etc. Hence, my opinion is that computational geography will eventually become a new normal.

Conclusion

Geospatial Data Science is the branch of data science, that encompasses locational analytics, satellite imagery, remote sensing, analysis of projection systems, raster and vector data. If you are a data science enthusiast, you must consider doing at least one case study in this field as it is seldom studied by learners and will add some uniqueness to your portfolio. Geospatial Data Science is still open to more in-depth exploration.

I hope this article was insightful, reach me at mangeshgupta402@gmail.com.

--

--