Spatial Visualization and Network Analysis with Geo Pandas Python

Imam Muhajir
Analytics Vidhya
Published in
7 min readJan 23, 2020

--

Spatial data refers to all types of data objects or elements that are present in a geographical space or horizon. It enables the global finding and locating of individuals or devices anywhere in the world.

Spatial data is used in geographical information systems (GIS) and other geolocation or positioning services. Spatial data consists of points, lines, polygons, and other geographic and geometric data primitives, which can be mapped by location. Spatial data may be classified as scalar or vector data. Each provides distinct information pertaining to geographical or spatial locations. In this section, we can explain two types of data spatial points, spatial lines, and spatial polygons.

1. Spatial Polygon

In python for building with Geo pandas, you can install geo pandas in jupyter notebook with environmental Anaconda,

pip install geopandas

However, if you have a problem with your laptop you can use the notebook in kaggle.com, in Kaggle notebook you don’t need to install packages, Kaggle already has the geopandas package. To get data frame polygon you can search on internet, for example, you can download in my link https://www.kaggle.com/imammuhajir/map-data/settings

or

http://www.info-geospasial.com/2015/10/data-shp-seluruh-indonesia.html

let’s go to try in Kaggle notebook :

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import geopandas as gpd
import os
for dirname, _, filenames in os.walk('/kaggle/input'):
for filename in filenames:
print(os.path.join(dirname, filename))

import data shp

indo = gpd.read_file("/kaggle/input/map-data/INDONESIA_PROP.shp")
indo.head()

now, you can try visualization with polygon data, with package geopandas make visualization so easy …

indo.plot(column='Propinsi', cmap='Set1', legend=True, figsize = (30,40))
plt.title('TITIK GEMPA')
plt.show()

this is visualisation polygon, now you can try filter region you can do …

Filter Geo Pandas

we just get region ‘’ “SUMATERA UTARA”, “SUMATERA BARAT”,”JAWA TENGAH”,”PROBANTEN”, “DI. ACEH”,”JAWA TIMUR”,”JAWA BARAT” ,”LAMPUNG” ,”DKI JAKARTA” ,”RIAU”, “JAMBI”,”BANGKA BELITUNG” ,”SUMATERA SELATAN”,”DAERAH ISTIMEWA YOGYAKARTA”

# get data 
geo_df = gpd.read_file("/kaggle/input/map-data/INDONESIA_PROP.shp")
# data can get
filter_sumatra_jawa = [ "SUMATERA UTARA", "SUMATERA BARAT","JAWA TENGAH","PROBANTEN","DI. ACEH","JAWA TIMUR","JAWA BARAT" ,"LAMPUNG" ,"DKI JAKARTA","RIAU", "JAMBI","BANGKA BELITUNG" ,"SUMATERA SELATAN","DAERAH ISTIMEWA YOGYAKARTA"]
ori_len = len(geo_df)
geo_df['Wilayah'] = list(range(len(geo_df)))
# filter data
for i in range(ori_len):
if geo_df['Propinsi'][i] in filter_sumatra_jawa :
geo_df['Wilayah'][i] = 1
else:
geo_df['Wilayah'][i] = 0
geo_df = geo_df[geo_df['Wilayah'] == 1]
geo_df.head()

Visualization again…

leg_kwds = {'title':'NAMA WILAYAH', 'loc':'upper left','bbox_to_anchor':(1,1.05), 'ncol':1}geo_df.plot(column='Propinsi', cmap='Set1', legend=True, legend_kwds = leg_kwds, figsize = (20,50))plt.title('TITIK GEMPA')
plt.show()

2. Spatial Point and Spatial Polygon

In this visualization we can use earthquake data, you can download data from the link https://www.kaggle.com/imammuhajir/data-gempa

This data contains the coordinate location(latitude and longitude), a combination of two variables can make the point location in your map.

train_09 = pd.read_csv('/kaggle/input/data-gempa/2009.csv')
train_10 = pd.read_csv('/kaggle/input/data-gempa/2010.csv')
train_11 = pd.read_csv('/kaggle/input/data-gempa/2011.csv')
train_12 = pd.read_csv('/kaggle/input/data-gempa/2012.csv')
train_13 = pd.read_csv('/kaggle/input/data-gempa/2013.csv')
train_14 = pd.read_csv('/kaggle/input/data-gempa/2014.csv')
train_15 = pd.read_csv('/kaggle/input/data-gempa/2015.csv')
train_16 = pd.read_csv('/kaggle/input/data-gempa/2016.csv')
train_17 = pd.read_csv('/kaggle/input/data-gempa/2017.csv')
train_18 = pd.read_csv('/kaggle/input/data-gempa/2018.csv')
train_all = [train_09,train_10,train_11,train_12,train_13,train_14,train_15,train_16,train_17,train_18 ]
# me replace name stiap data
for i in train_all:
i.columns = ['index', 'Date', 'Time', 'Latitude', 'Longitude','Depth',
'Magnitude', 'TypeMag', 'smaj', 'smin', 'az', 'rms', 'cPhase',
'Region']
i.drop(0,axis=0,inplace=True)

# menyatukan data mendjadi 1
train = pd.concat(train_all)
t

This data is very dirty, there are still data that have multiple types, to calculate or visualize we have to change to numeric data, so let's go to Cleaning Data…

# mengsplit data dan membersihkan nilai menjadi number
train["Longitude"] = train.Longitude.apply(lambda x : x.split()[0])
train["Depth"] = train.Depth.apply(lambda x : x.split()[0])
train["Magnitude"] = train.Magnitude.apply(lambda x : x.split()[0])
train["Time"] = train.Time.apply(lambda x : x.split()[0])
# meng ekstrak data
train["year"] = train.Date.apply(lambda x : x.split('/')[2])
train["date"] = train.Date.apply(lambda x : x.split('/')[1])
train["month"] = train.Date.apply(lambda x : x.split('/')[0])
train["hour"] = train.Time.apply(lambda x : x.split(':')[0])
train["Date"] = train.Date.apply(lambda x : x.split()[0])
train['Date'] = train[['Date', 'Time']].apply(lambda x : ' '.join(x), axis = 1)
train['Date'] = pd.to_datetime(train['Date'])
train.index = train['Date']train_len = len(train)
for i in range(train_len):
if train['Latitude'][i].split()[1] == 'LS':
train['Latitude'][i] = float(train['Latitude'][i].split()[0]) * -1
else:
train['Latitude'][i] = train['Latitude'][i].split()[0]
train.head()

Change format data to float and filter for Indonesia country

#convert type data ke float
columns = ['Latitude','Longitude', 'Depth', 'Magnitude']
for var in columns:
train[var] = train[var].astype("float64")
# Filter indonesia
train = train.loc[((train['Latitude'] >= -12.5) & (train['Latitude'] <= 7) & (train['Longitude'] >= 94.31644) & (train['Longitude'] <= 142.71813) )]

Visualization of spatial data with background polygon data and coordinate data.

leg_kwds = {'title':'NAMA WILAYAH', 'loc':'upper left', 'bbox_to_anchor':(1,1.05), 'ncol':1}
geo_df.plot(column='Propinsi', cmap='Set1', legend=True, legend_kwds = leg_kwds, figsize = (30,30))
plt.title('TITIK GEMPA')
plt.scatter(y= train.Latitude, x = train.Longitude , c = train.Magnitude, alpha=0.8);
plt.show()

Filter data based on magnitude values…

train5 = train[train['Magnitude'] >= 5.5]leg_kwds = {'title':'NAMA WILAYAH', 'loc':'upper left', 'bbox_to_anchor':(1,1.05), 'ncol':1}
geo_df.plot(column='Propinsi', cmap='Set1', legend=True, legend_kwds = leg_kwds, figsize = (30,30))
plt.title('TITIK GEMPA')
plt.scatter(y= train5.Latitude, x = train5.Longitude , c = train5.Magnitude, alpha=0.8);
plt.show()

3. Analysis Network station seismic with earthquake

To do an earthquake station network analysis with an earthquake, the first stage we have to determine the earthquake station itself because the data is not known the location of the station. To determine the location of the station we assume that earthquakes that have a “smin” of less than 0.3 are earthquakes that are close to the seismic station.
note: “smin” is the distance of the earthquake to the nearest station

Let’s look at the results of the filtering

train_loc = train.copy()train_loc = train_loc[train_loc.smin < 0.3]leg_kwds = {'title':'NAMA WILAYAH', 'loc':'upper left', 'bbox_to_anchor':(1,1.05), 'ncol':1}
geo_df.plot(column='Propinsi', cmap='Set1', legend=True, legend_kwds = leg_kwds, figsize = (30 ,30))
plt.title('TITIK GEMPA')
plt.scatter(y= train_loc.Latitude, x = train_loc.Longitude, alpha=0.8);
plt.show()

After knowing the distribution of earthquake data close to the station. we must determine the exact point at the earthquake station, therefore we use K-means clustering to determine that point.

from sklearn.cluster import KMeanstrain_clus = train_loc[['Latitude', 'Longitude']]
num_clusters = 152
km = KMeans(n_clusters=num_clusters, random_state = 1)
km.fit(train_clus)
clusters = km.labels_.tolist()
train_clus['Clustering'] = clustersloc_statiun = train_clus.groupby('Clustering').mean()
loc_statiun = loc_statiun.reset_index()

Visualization result:

leg_kwds = {'title':'NAMA WILAYAH', 'loc':'upper left', 'bbox_to_anchor':(1,1.05), 'ncol':1}
geo_df.plot(column='Propinsi', cmap='Set1', legend=False, legend_kwds = leg_kwds, figsize = (30,30))
plt.title('Lokasi Statiun Gempa')
plt.scatter(y= loc_statiun.Latitude, x = loc_statiun.Longitude, marker='^', alpha=1, c = '#1f77b4');
plt.show()

Analysis of the earthquake station network with the nearest earthquake point

Create a new feature: the distance between the station and the earthquake point, the name of the nearest earthquake station

#from pandas.tools.plotting import scatter_matrix
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.preprocessing import LabelBinarizer
from sklearn.preprocessing import StandardScaler
from geopy.distance import vincenty
import gcstatiun_coords = {}for dat in loc_statiun.iterrows():
row = dat[1]

statiun_coords[int(row['Clustering'])] = (float(row['Latitude']), float(row['Longitude']))

Function: find closest point Earthquake.

def closest_point(location, location_dict):
""" take a tuple of latitude and longitude and
compare to a dictionary of locations where
key = location name and value = (lat, long)
returns tuple of (closest_location , distance) """
closest_location = None
for city in location_dict.keys():
distance = vincenty(location, location_dict[city]).kilometers
if closest_location is None:
closest_location = (city, distance)
elif distance < closest_location[1]:
closest_location = (city, distance)
return closest_location
train_clus['close_station'] = train_clus.apply(lambda x: closest_point((x['Latitude'],x['Longitude']), statiun_coords), axis = 1)train_clus['close_station_name'] = [x[0] for x in train_clus['close_station'].values]
train_clus['close_station_dist'] = [x[1] for x in train_clus['close_station'].values]
train['close_station'] = train.apply(lambda x: closest_point((x['Latitude'],x['Longitude']), statiun_coords), axis = 1)
train['close_station_name'] = [x[0] for x in train['close_station'].values]
train['close_station_dist'] = [x[1] for x in train['close_station'].values]
train.head()

Visualization network

train.plot(kind='scatter', x='Longitude', y='Latitude',  marker='^', alpha=1, c = '#1f77b4', figsize= (30, 10 )
)
for line in train.iterrows():
dat = line[1]
x1 = dat['Longitude']
y1 = dat['Latitude']
p2 = statiun_coords[dat['close_station_name']]
x2 = p2[1]
y2 = p2[0]
plt.plot([x1,x2],[y1, y2], 'k-',linewidth=0.5)


#plt.imshow(peta_gempa, extent=[ 93 , 118.5, -13.3 , 7.7], alpha=0.9)
#plt.imshow(peta_gempa, extent=[ 90 , 115.5, -14.7 , 7.9], alpha=1)
plt.ylabel("Latitude", fontsize=14)
plt.xlabel("Longitude", fontsize=14)
plt.title("Map Jaringan Lokasi Stasiun dengan Gempa", fontsize = 20)
plt.show()

Network with filter magnitude 5.5

train5 = train[train['Magnitude'] >= 5.5]# graph of vectors connecting points to their nearest citytrain5.plot(kind='scatter', x='Longitude', y='Latitude',  marker='^', alpha=1, c = '#1f77b4', figsize = (30, 10)
)
for line in train5.iterrows():

dat = line[1]
x1 = dat['Longitude']
y1 = dat['Latitude']
p2 = statiun_coords[dat['close_station_name']]
x2 = p2[1]
y2 = p2[0]

plt.plot([x1,x2],[y1, y2], 'k-',linewidth=0.5)


#plt.imshow(peta_gempa, extent=[ 93 , 118.5, -13.3 , 7.7], alpha=0.9)

Thank you for reading, you can find the complete code in my Kaggle notebook

--

--

Imam Muhajir
Analytics Vidhya

Data Scientist at KECILIN.ID || Physicist ||Writer about Data Analysis, Big Data, Machine Learning, and AI. Linkedln: https://www.linkedin.com/in/imammuhajir92/