Visual Approach to Determine Strategic Location for a New Covid-19 Testing Center in Jakarta Using Data Science / Data Analytics Tool Kits

Rama Yusvana
24 min readApr 22, 2020

--

Author: Rama Yusvana

1. Introduction / Business Problem

Since the beginning of 2020, Jakarta, like hundreds (if not thousands) of other cities around the world is currently under attacked by invisible army called ‘Novel Corona Virus’, or also known as ‘Covid-19’. While every efforts have been focussed on to solving or minimizing problems at hands, other professionals such as Data Scientist may need to assess the current situations in the city, such as the availability, the number and the geographical distribution (i.e. location) of health infrastructures such as virus testing centers or ‘specialist’ hospital capable of handling and detecting such disease that standard hospitals cannot. This is due to the nature of the viral disease requiring a more sophisticated instruments, reagents, and assay technique and more importantly high-throughput sample analysis platforms. As a former Nucleic Acid testing scientist for HIV and Hepatitis C testing for Federal-funded laboratory in Australia, I am well aware of this ‘difference’, and can talk a lot more about it in other occasion. However, in this article, I would like to describe a simple analysis for determining strategic location for a new Covid-19 testing center in any city (Jakarta in this case) using data science / data analytics tool sets.

Case Scenario:

As one of the main stake holders, the government of the city may need to allocate an additional resource such as building a new testing center to better prepare the city for future potential outbreak of another virus. Alternatively, private investors could take the business investment opportunity, especially in a big city like Jakarta. Typically, the investors may be interested to know:

  1. The number or the extent of Covid-19 positive cases patients and their relative distribution within the city.
  2. The number and locations of the existing and approved Covid-19 testing centers in the city (i.e. the competitors).
  3. TWO (2) candidate strategic locations for the new Covid-19 testing center relatively far from the competition.
  4. Population density of the proposed area (Social distancing indicator).
  5. Basic profile of the proposed neighborhood (e.g. the presence of unique clustering such as residential areas, business / office areas, the number and types of other businesses venues, etc).

For point (1), the investor obviously would like to know the extent of epidemics in the area. For Point 2, they need to make sure that their investment is located in an area relatively far away from any competitors having similar facilities at the time their business venture starts. The 3rd point requires visualization that we will discuss below. The 4th point also indicates the potential target customers. The higher the population density, the higher the potential users of the facility. The 5th point indicates the current level of business activity in the proposed area. The higher the number of business venues, the higher the number of business activity in the area. Certainly, all investors would like the area to have many different business ventures already established in the proposed neighborhood.

2. Data

Based on the requirements from the potential investors above, the necessary information needed for this analysis will come from the following online sources:

  1. The number of positive cases patients: https://corona.jakarta.go.id/en/peta-persebaran, that update the number of positive Covid-19 patients as well as the number recovered, passed, and under observation on daily basis. The site provide an Excel file readily available for download by the public for analysis.
  2. The population density will be obtained from Wikipedia for Jakarta (https://en.wikipedia.org/wiki/Jakarta). As can be seen, Jakarta is the special capital region of Indonesia having a total area 661.5 square km and a total of 10,770,487 population in 2020. However, I will scrap the Table in the Wikipedia showing the population density in 2014 only. Jakarta has 6 main cities or municipalities as can be shown in the Section 3 — Government & Politics.
  3. The number of existing and approved Covid-19 testing centers or ‘Specialist’ hospitals from the Ministry of Health website (http://www.yankes.kemkes.go.id/) in Indonesia
  4. Foursquare API to collect information on other venues/competitors in the neighborhoods of Jakarta
  5. OpenCage Geocoder API to collect latitude and longitude of cities, area or facility (e.g. hospitals)

In addition to these resources or websites, there are also other potential or highly relevant sources which can be used for example related to the number of available hospital beds for patients. This can be obtained (for future reference) from Ministry of Health website (http://www.yankes.kemkes.go.id/) in Indonesia. I also used other resources such as Folium which can be used for Visualization purposes and BeautifulSoup for scrapping website as described in more details in the following sections. In order to gather all the data, I will start by downloading all the required dependencies (library) for the data analysis toolkits.

import requests # library to handle requests
import pandas as pd # library for data analsysis
import numpy as np # library to handle data in a vectorized manner
import random # library for random number generation
!conda install -c conda-forge geopy --yes
from geopy.geocoders import Nominatim # module to convert an address into latitude and longitude values
# libraries for displaying images
from IPython.display import Image
from IPython.core.display import HTML
from IPython.display import display_html
import pandas as pd
import numpy as np

# tranforming json file into a pandas dataframe library
from pandas.io.json import json_normalize
!conda install -c conda-forge folium=0.5.0 --yes
import folium # plotting library
from bs4 import BeautifulSoup
from sklearn.cluster import KMeans
import matplotlib.cm as cm
import matplotlib.colors as colors
print('Folium installed')
print('Libraries imported.')
Collecting package metadata (current_repodata.json): done
Solving environment: done
# All requested packages already installed.Collecting package metadata (current_repodata.json): done
Solving environment: done
# All requested packages already installed.Folium installed
Libraries imported.

2.1. Population Data of Jakarta City from Wikipedia site

The following describes a web scrapping technique commonly used by data scientist to get the information from public website. After inspecting the original Wikipedia page, one would need to isolate the relevant HTML codes for the desired Table using Requests and BeautifulSoup library

# Use HTML5Lib as the parser
source = requests.get('https://en.wikipedia.org/wiki/Jakarta').text
soup = BeautifulSoup(source, 'html5lib')
my_table = soup.find('table',{'class':'wikitable sortable'})
my_table
<table class="wikitable sortable" style="margin-left: auto; margin-right: auto; border: none;">
<caption>Jakarta's cities/municipalities (<i>Kota Administrasi/Kotamadya</i>)
</caption>
<tbody><tr>
<th>City/regency
</th>
<th width="70">Area (km<sup>2</sup>)
</th>
<th style="width:100px;">Total population (2010 Census)
</th>
<th style="width:100px;">Total population (2014)<sup class="reference" id="cite_ref-pop2014_5-2"><a href="#cite_note-pop2014-5">[5]</a></sup>
</th>
<th style="width:100px;">Population density<br/>(per km<sup>2</sup>)<br/>in 2010
</th>
<th>Population density<br/>(per km<sup>2</sup>)<br/>in 2014
</th>
<th><a href="/wiki/Human_Development_Index" title="Human Development Index">HDI</a><br/><sup class="reference" id="cite_ref-71"><a href="#cite_note-71">[71]</a></sup> 2015 estimates
</th></tr>
<tr style="text-align:right;">
<th><a href="/wiki/South_Jakarta" title="South Jakarta">South Jakarta</a>
</th>
<td>141.27</td>
<td>2,057,080</td>
<td>2,164,070</td>
<td>14,561
</td>
<td>15,319</td>
<td>0.833 (<span style="color:green;">Very High</span>)
</td></tr>
<tr style="text-align:right;">
<th><a href="/wiki/East_Jakarta" title="East Jakarta">East Jakarta</a>
</th>
<td>188.03</td>
<td>2,687,027</td>
<td>2,817,994</td>
<td>14,290
</td>
<td>14,987</td>
<td>0.807 (<span style="color:green;">Very High</span>)
</td></tr>
<tr style="text-align:right;">
<th><a href="/wiki/Central_Jakarta" title="Central Jakarta">Central Jakarta</a>
</th>
<td>48.13</td>
<td>898,883</td>
<td>910,381</td>
<td>18,676
</td>
<td>18,915</td>
<td>0.796 (<span style="color:green;">High</span>)
</td></tr>
<tr style="text-align:right;">
<th><a href="/wiki/West_Jakarta" title="West Jakarta">West Jakarta</a>
</th>
<td>129.54</td>
<td>2,278,825</td>
<td>2,430,410</td>
<td>17,592
</td>
<td>18,762</td>
<td>0.797 (<span style="color:green;">High</span>)
</td></tr>
<tr style="text-align:right;">
<th><a href="/wiki/North_Jakarta" title="North Jakarta">North Jakarta</a>
</th>
<td>146.66</td>
<td>1,645,312</td>
<td>1,729,444</td>
<td>11,219
</td>
<td>11,792</td>
<td>0.796 (<span style="color:green;">High</span>)
</td></tr>
<tr style="text-align:right;">
<th><a href="/wiki/Thousand_Islands_(Indonesia)" title="Thousand Islands (Indonesia)">Thousand Islands</a>
</th>
<td>8.7</td>
<td>21,071</td>
<td>23,011</td>
<td>2,422
</td>
<td>2,645</td>
<td>0.688 (<span style="color: #fc0;">Medium</span>)
</td></tr></tbody></table>
# Isolate the HTML code 'a' tag containing the 6 main cities / municipalities of Jakarta
a_tag = my_table.find_all('a')
a_tag
[<a href="#cite_note-pop2014-5">[5]</a>,
<a href="/wiki/Human_Development_Index" title="Human Development Index">HDI</a>,
<a href="#cite_note-71">[71]</a>,
<a href="/wiki/South_Jakarta" title="South Jakarta">South Jakarta</a>,
<a href="/wiki/East_Jakarta" title="East Jakarta">East Jakarta</a>,
<a href="/wiki/Central_Jakarta" title="Central Jakarta">Central Jakarta</a>,
<a href="/wiki/West_Jakarta" title="West Jakarta">West Jakarta</a>,
<a href="/wiki/North_Jakarta" title="North Jakarta">North Jakarta</a>,
<a href="/wiki/Thousand_Islands_(Indonesia)" title="Thousand Islands (Indonesia)">Thousand Islands</a>]

In order to import population datas and reconstruct the Table in our report, the following lines of Python code describes the process step by step:

# Create an empty list for Pandas dataframe
# Fill the dataframe with 'a' tag containing the 6 main cities / municipalities of Jakarta
cities = []
for data in my_table.findAll('a', text=True):
cities.append(data.get('title'))
df = pd.DataFrame()
cities = (cities[3:]) # Remove irrelevant information
df['City'] = cities
# Isolate population data for the 6 main cities / municipalities in Jakarta
population = []
for data2 in my_table.find_all('td'):
population.append(data2.text)
#print(population) & Construct Pandas dataframe
df_population = pd.DataFrame()
df_population[df['City'][0]] = population[0:6]
df_population[df['City'][1]] = population[6:12]
df_population[df['City'][2]] = population[12:18]
df_population[df['City'][3]] = population[18:24]
df_population[df['City'][4]] = population[24:30]
df_population[df['City'][5]] = population[30:36]
df_population = df_population.transpose()
df_population.columns = ['Area (km^2)', 'Total Population 2010', 'Total Population 2014', 'Population Density 2010', 'Population Density 2014', 'HDI']
df_population.rename_axis(index='City', inplace=True)
df_population.reset_index(inplace=True)
df_population # Display the dataframe

As can be seen from the Table, Jakarta consists of 6 municipalities or cities with total area for each city and population number in 2014 is described in the table. In this analysis, I will assume the population growth rate is similar or constant that the population density in 2020 would also be similar due to unavailability of the information. That is, Central Jakarta remains having highest population density in 2020, followed by West Jakarta and South Jakarta.

2.2. Import the Covid-19 Data from the Official Website for the City of Jakarta

The following section describes data mining from the official website of Covid-19 cases for the city. The section also describes a series of data preparation, cleaning and isolation of the required columns for this analysis.

# Download the latest data from https://riwayat-file-covid-19-dki-jakarta-jakartagis.hub.arcgis.com/ to local drive
df_covid = pd.read_excel(open('/Users/ramayusvana/Desktop/Standar Kelurahan Data Corona (20 April 2020 Pukul 10.30).xlsx', 'rb'), sheet_name='data_kecamatan')
df_covid.head()
# Remove / drop irrelevant columns for this analysis
df_covid.drop(columns =["ID_KEC", "Nama_provinsi", "ODP", "Proses Pemantauan", "Selesai Pemantauan", "PDP", "Masih Dirawat", "Pulang dan Sehat", "Dirawat", "Sembuh", "Meninggal", "Self Isolation"], inplace=True)
df_covid.head()
# Remove irrelevant rows for this analysis
indexNames = df_covid[(df_covid['nama_kecamatan'] == 'BELUM DIKETAHUI') | (df_covid['nama_kota'] == 'LUAR DKI JAKARTA')].index
df_covid.drop(indexNames, inplace=True)
df_covid.head()
df_covid.tail()
# Remove the last row containing NaN and sum of values
df_covid = df_covid.drop([df_covid.index[-1]])
df_covid.tail()
# Rename columns name to English
df_covid = df_covid.rename(columns = {'nama_kota':'CITY', 'nama_kecamatan':'DISTRICT', 'POSITIF':'POSITIVE'})
df_covid.head()
# Get the number and the names of each municipality or city in Jakarta
df_covid['CITY'].unique()
array(['JAKARTA TIMUR', 'JAKARTA PUSAT', 'JAKARTA BARAT',
'JAKARTA SELATAN', 'JAKARTA UTARA', 'KAB.ADM.KEP.SERIBU'],
dtype=object)
# Get the number of districts (i.e. counts) in Jakarta.
# Get the mean number of positive cases of each district in the city and the standard deviation
df_covid.describe()

# Check if there are any missing or null values
df_covid.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 44 entries, 1 to 45
Data columns (total 3 columns):
CITY 44 non-null object
DISTRICT 44 non-null object
POSITIVE 44 non-null int64
dtypes: int64(1), object(2)
memory usage: 1.4+ KB
# Get the number of total / confirmed POSITIVE cases in Jakarta per 20 April 2020
df_covid['POSITIVE'].sum()
1895# Group the data by CITY
df_covid_grp = df_covid.groupby(['CITY'])
df_covid_grp
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a2338fd50>df_covid_grp['POSITIVE'].sum()CITY
JAKARTA BARAT 388
JAKARTA PUSAT 321
JAKARTA SELATAN 445
JAKARTA TIMUR 466
JAKARTA UTARA 275
KAB.ADM.KEP.SERIBU 0
Name: POSITIVE, dtype: int64

From this data mining, data preparation and data exploration activities, the total number of Covid-19 confirmed positive cases in Jakarta is 1,895 per 20 April 2020, distributed accross 6 main municipalities or cities in Jakarta, accross 44 districts (or ‘kecamatan’) out of just over 10 million population of Jakarta. East Jakarta (Jakarta Timur) has the highest number of total POSITIVE cases with 466 confirmed positives. Just like any other cities, each city / municipality has many neighborhoods which can be used to pinpoint the location of the new proposed Covid-19 testing center along with further analysis of the neighborhood using FourSquare API and Folium map visualization technique.

2.3. List of Existing ‘Approved’ Covid-19 Testing Centers (Specialist Hospitals) in Jakarta

According to the information from the Ministry of Health in Indonesia (www.yankes.kemkes.go.id), the following hospitals are the existing reference hospitals for Covid-19 testing in Jakarta areas:

  1. Rumah Sakit Umum Pusat Fatmawati, Cilandak, Jakarta Selatan
  2. Dr. Mintohardjo Naval Hospital, Tanah Abang, Jakarta Pusat
  3. RSUD Pasar Minggu, Pasar Minggu, Jakarta Selatan
  4. Gatot Soebroto Army Hospital, Senen, Jakarta Pusat
  5. RS Umum Bhayangkara Tk. I R. Said Sukanto, Kramat jati, Jakarta Timur
  6. RSUD Cengkareng, Cengkareng, Jakarta Barat
  7. Rs Persahabatan Jaktim, Pulo Gadung, Jakarta Timur
  8. RSPI Prof Dr Sulianti Saroso, Tanjung Priok, Jakarta Utara

Let us construct a pandas dataframe for subsequent data analysis.

# Upload text file containing the names of existing and approved Covid-19 testing centers from local drive
df_hospitals = pd.read_csv(open('/Users/ramayusvana/Desktop/Hospitals.csv', 'r'))
df_hospitals

This sums up our data mining and data exploration section. The following METHODOLOGY section, we will describe the prosess of how to do a ‘Visual’ approach in order to better understand our data using data science and data analytics tool kits.

3. Methodology

In order to assist in the analysis, I will use the ‘’free services provided by Open Cage Geocode (https://opencagedata.com/) to get the latitude and longitude of cities, districts, particular venues or neighborhoods. We will start by opening an account and downloading the required dependencies for our analysis (‘’Terms and condition applies. Please refer to their website for further details).

# Import required package for obtaining Latitude and Longitude of each cities
# Need to get Latitude & Longitude of Jakarta city and the districts
# Get API key from the service provider (Open Cage Geocode)
from opencage.geocoder import OpenCageGeocode
key = 'xxxxxxxxxxx'
geocoder = OpenCageGeocode(key)
query = 'Jakarta, Indonesia'
results = geocoder.geocode(query)
# print (results)
# Isolate only the Latitude & Longitude of Jakarta from the Json file
lat = results[0]['geometry']['lat']
lng = results[0]['geometry']['lng']
print ('The Latitude and Longitude of Jakarta is {} and {} repectively.'.format(lat, lng))
The Latitude and Longitude of Jakarta is -6.1753942 and 106.827183 repectively.

Similarly, we can use the API service from OpenCage Geocoder to obtain the latitude and longitude of all districts in Jakarta.

# Get latitude and longitude of all districtslist_lat = []   # create empty lists for latitude
list_long = [] # create empty lists for longitude
for index, row in df_covid.iterrows(): # iterate over rows in dataframe District = row['DISTRICT']
query = str(District)+', Jakarta'
results = geocoder.geocode(query)
lat = results[0]['geometry']['lat']
long = results[0]['geometry']['lng']
list_lat.append(lat)
list_long.append(long)
# create new columns from lists df_covid['Latitude'] = list_lat
df_covid['Longitude'] = list_long
df_covid.head(10)

In addition, we also need to get the latitude and longitude of all the existing Covid-19 testing centers in Jakarta that we already download from the website of Ministry of Health in Indonesia.

# Get the latitude and longitude of all of the specialist hospitalslist2_lat = []   # create empty lists for latitude
list2_long = [] # create empty lists for longitude
for index, row in df_hospitals.iterrows(): # iterate over rows in dataframe hosp = row['HOSPITAL']
distr = row[' DISTRICT']
query = str(hosp) + ', ' + str(distr) + ', Jakarta'
results = geocoder.geocode(query)
lat = results[0]['geometry']['lat']
long = results[0]['geometry']['lng']
list2_lat.append(lat)
list2_long.append(long)
# create new columns from lists df_hospitals['Latitude'] = list2_lat
df_hospitals['Longitude'] = list2_long
df_hospitals

We then need to know how to get a map of the city that we are interested in (i.e. Jakarta) in order to present our data to the stakeholders using a ‘Visualization’ approach. We have downloaded all the required dependencies earlier in the report, and now we are ready to use the FOLIUM API service as described in the following section.

# Define the map object and then display using the specified latitude and longitude values
map_jkt = folium.Map(location=[-6.2, 106.8], zoom_start=12)
map_jkt

The map shows the main outer ring roads surrounding the city of Jakarta. It does NOT however, show the official territorial boundary of the city with respect to other administrative regions in the east, west and south of Jakarta. However, because the author is from Indonesia, I know roughly which neighborhood belongs to Jakarta and which does not. In this scenario, we want to propose a strategic location (i.e. neighborhood) for the investing group within the Jakarta governmental area.

4. Results

In order to better understand and estimate the territories or areas that are within the administrative government of Jakarta city, we need to plot all the districts that we have downloaded from the Wikipedia site together with their latitude and longitude values. The following line of Python code will execute the task using Folium API.

# Construct a map of all districts neighborhood in Jakarta map_jkt = folium.Map(location=[-6.2, 106.8], zoom_start=11)for lat, lng, label in zip(df_covid['Latitude'], df_covid['Longitude'], df_covid['DISTRICT']):
label = folium.Popup(label, parse_html=True)
folium.CircleMarker(
location=[lat, lng],
radius=5,
popup=label,
color='blue',
fill=True,
fill_color='#3186cc',
fill_opacity=0.7,
parse_html=False).add_to(map_jkt)
map_jkt

As you can see from the above map, most of the districts are within the main outer ring roads (highway) surrounding the city and others are situated outside the main ring roads.

In order to solve the 1st problem in our business case scenario, we will need to show the extent and the distribution of Covid-19 positive case patients within the city of Jakarta based on the number that we obtained from the government site. The following line of Python code will achieve the task and present the data in a clear visual approach.

# Plot a map of Covid-19 distribution in the city of Jakarta per April 14, 2020map_covid_jkt = folium.Map(location=[-6.2, 106.8], zoom_start=11)for lat,lon,area,size in zip(df_covid['Latitude'],df_covid['Longitude'],df_covid['DISTRICT'],df_covid['POSITIVE']):
folium.CircleMarker(
location=[lat, lon],
popup=area,
radius=size/2,
color='r',
opacity=0.5,
fill=True,
fill_opacity=0.5,
fill_color='red',
).add_to(map_covid_jkt)
map_covid_jkt

This is a similar map plot that you can see from the government task force for Covid-19 cases in Jakarta. Their graph can be seen in this link: https://corona.jakarta.go.id/id/peta-persebaran. As you can see, most of the regions of Jakarta is now in the ‘RED’ zone, with the radius of the circle represent the relative extent of Covid-19 distribution in the City of Jakarta. Therefore, if someone from Bekasi wants to go to the ‘Soekarno-Hatta International Airport in the west of the city (Tangerang area), the safer way could be along the outer ring road going counter-clockwise rather than trying to pass through the city. A better presentation of the data would be to use a ‘slider’ in the map that shows the growth of the circle day by day or simply an animation that shows the daily growth of Covid-19 cases in the city. App developer might develop an App that alerts vehicles / road users that alerts that they are or are not trespassing the RED zone within the city. This App could save lives !

The next set of problem that we need to solve is to show the location of existing and approved Covid-19 testing centers (or reference hospitals) and see how well they are distributed to each other within the city and in which regions of Jakarta. The following lines of Python code will carry out the task. We will first try to plot the hosptals WITHOUT the RED circles as that might cause distraction.

# Construct a map of all existing Covid-19 testing hospitals in Jakarta map_hosp = folium.Map(location=[-6.2, 106.8], zoom_start=12)for lat, lng, hosp in zip(df_hospitals['Latitude'], df_hospitals['Longitude'], df_hospitals['HOSPITAL']):
label = folium.Popup(hosp, parse_html=True)
folium.Marker(
location=[lat, lng],
popup=hosp,
icon=folium.Icon(color='blue', icon='header'),
).add_to(map_hosp)
map_hosp

As you can see, the hospitals are quite sparsely distributed within each other except the two hospitals in the south are relatively close to each other (i.e. Fatmawati and Pasar Minggu hospitals). Let’s see how strategic they are in accomodating the extent of positive cases patients in the city. We can do this by overlaying the two data within a single map as shown in the following codes:

# Plot a combined map of Covid-19 distribution in the city of Jakarta & current available testing centersmap_covid_hosp_jkt = folium.Map(location=[-6.2, 106.8], zoom_start=11)for lat,lon,area,size in zip(df_covid['Latitude'],df_covid['Longitude'],df_covid['DISTRICT'],df_covid['POSITIVE']):
folium.CircleMarker(
location=[lat, lon],
popup=area,
radius=size/2,
color='r',
opacity=0.5,
fill=True,
fill_opacity=0.5,
fill_color='red',
).add_to(map_covid_hosp_jkt)
# Construct a map of all existing Covid-19 testing hospitals in Jakarta for lat, lng, hosp in zip(df_hospitals['Latitude'], df_hospitals['Longitude'], df_hospitals['HOSPITAL']):
label = folium.Popup(hosp, parse_html=True)
folium.Marker(
location=[lat, lng],
popup=hosp,
icon=folium.Icon(color='blue', icon='header'),
).add_to(map_covid_hosp_jkt)
map_covid_hosp_jkt

By analysing this hybrid map we can see that there TWO (2) possible sites that lack Covid-19 testing center. One is in the westernmost neighborhood of the city and the other is in the south-east region (near the Airport Halim Perdana Kusuma) where there is quite a big blop of circle surrounding the airport areas. Zooming in closely to these TWO area, I would propose:

  1. Kebon Pala neighborhood (East Jakarta)
  2. Kembangan neighborhood (West Jakarta)

to be the TWO (2) proposed candidate neighborhoods for an investing group to invest in their business venture. The stakeholder could choose their own neighborhood based on this hybrid map. However, for illustration purposes, I will propose the above TWO (2) neighborhoods. The following lines of Python code will visualize their locations (using a GREEn pin) within both the positive Covid-19 cases and their existing competitors. But first, we need to get the latitudes and longitudes of our candidate neighborhood and construct them in Pandas dataframe.

# Construct pandas dataframeprop_neighborhood = pd.DataFrame({
'candidate_neigh':['Kebon Pala', 'Kembangan']
})
# Get their coordinates
list3_lat = [] # create empty lists for latitude
list3_long = [] # create empty lists for longitude
for index, row in prop_neighborhood.iterrows(): # iterate over rows in dataframe neigh = row['candidate_neigh']
query = str(neigh) + ', Jakarta'
results = geocoder.geocode(query)
lat = results[0]['geometry']['lat']
long = results[0]['geometry']['lng']
list3_lat.append(lat)
list3_long.append(long)
# create new columns from lists prop_neighborhood['Latitude'] = list3_lat
prop_neighborhood['Longitude'] = list3_long
prop_neighborhood
# Plot a combined map of all Covid-19 distribution, current testing centers and the proposed neighborhood (GREEN pin)

map_prop_hosp = folium.Map(location=[-6.2, 106.8], zoom_start=12)

for lat,lon,area,size in zip(df_covid['Latitude'],df_covid['Longitude'],df_covid['DISTRICT'],df_covid['POSITIVE']):
folium.CircleMarker(
location=[lat, lon],
popup=area,
radius=size/2,
color='r',
opacity=0.5,
fill=True,
fill_opacity=0.5,
fill_color='red',
).add_to(map_prop_hosp)

# Construct a map of all existing Covid-19 testing hospitals in Jakarta

for lat, lng, hosp in zip(df_hospitals['Latitude'], df_hospitals['Longitude'], df_hospitals['HOSPITAL']):
label = folium.Popup(hosp, parse_html=True)
folium.Marker(
location=[lat, lng],
popup=hosp,
icon=folium.Icon(color='blue', icon='header'),
).add_to(map_prop_hosp)

for lat, lng, neigh in zip(prop_neighborhood['Latitude'], prop_neighborhood['Longitude'], prop_neighborhood['candidate_neigh']):
label = folium.Popup(neigh, parse_html=True)
folium.Marker(
location=[lat, lng],
popup=neigh,
icon=folium.Icon(color='green', icon='pushpin'),
).add_to(map_prop_hosp)

map_prop_hosp

The hybrid map clearly shows that our TWO(2) candidate neighborhoods (GREEN pins) are quite well situated relative to the other competitor hospitals. The hybrid map also shows the potential users of the facility based on the extent of Covid-19 cases (RED circle) per April 20, 2020.

In the following DISCUSSION section, we will choose which of the TWO (2) hospitals is the best candidate for a business venture and discuss the reasons behind it by presenting relevant datas related to the TWO (2) neighborhoods.

5. Discussion

In our business scenario, we will have to propose (ONE) best candidate neighborhood to the investing group out of the TWO (2) chosen areas, which are:

  1. Kebon Pala neighborhood (East Jakarta)
  2. Kembangan neighborhood (West Jakarta)

In order to assist in our analysis, we will need to use FOURSQUARE API to help us gathering datas relevant to a particular location. To do this, we need to open an account with them and go to the developer section to get client ID and client secret. We have downloaded all the required dependencies for this task earlier, and we will just proceed with the following Python code:

# Explore neighborhood for one of the candidate location (Kembangan) using FourSquare API
# Define Foursquare Credentials and Version
CLIENT_ID = 'xxxxxx' # your Foursquare ID
CLIENT_SECRET = 'xxxxxx' # your Foursquare Secret
VERSION = '20180605' # Foursquare API version
print('Your credentails:')
#print('CLIENT_ID: ' + CLIENT_ID) - not printed for privacy
#print('CLIENT_SECRET:' + CLIENT_SECRET) - not printed for privacy
Your credentails:

Next we need to create a variable for latitude and longitude of BOTH Kebon Pala and Kembangan area. We can do this by executing the following lines of Python code:

# Get the Kebon Pala neighborhood's latitude and longitude values.
neighborhood_latitude1 = prop_neighborhood.loc[0, 'Latitude'] # neighborhood latitude value
neighborhood_longitude1 = prop_neighborhood.loc[0, 'Longitude'] # neighborhood longitude value
neighborhood_name1 = prop_neighborhood.loc[0, 'candidate_neigh'] # neighborhood name
print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name1,
neighborhood_latitude1,
neighborhood_longitude1))
Latitude and longitude values of Kebon Pala are -6.253495, 106.8766303.# Get the neighborhood's latitude and longitude values.
neighborhood_latitude2 = prop_neighborhood.loc[1, 'Latitude'] # neighborhood latitude value
neighborhood_longitude2 = prop_neighborhood.loc[1, 'Longitude'] # neighborhood longitude value
neighborhood_name2 = prop_neighborhood.loc[1, 'candidate_neigh'] # neighborhood name
print('Latitude and longitude values of {} are {}, {}.'.format(neighborhood_name2,
neighborhood_latitude2,
neighborhood_longitude2))
Latitude and longitude values of Kembangan are -6.191395, 106.7405859.

Now, let’s try to get the top 100 venues that are within Kebon Pala neighborhood and are within a radius of 500 meters of our candidate Covid-19 testing center using FOURSQUARE API. First, let’s create the GET request URL. Name that URL, url1.

# Get url for the API in Kebon Pala neighborhood
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius in meter
# create URL
url1 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
neighborhood_latitude1,
neighborhood_longitude1,
radius,
LIMIT)
# url1 - not printed for privacy
# Send the GET request and examine the resutls for Kebon Pala neighborhood
results1 = requests.get(url1).json()
# results1 - not printed for shortening of the report

The REQUEST API returns JSON file related to the requested url for the target area. We then need to isolate a specific fragment of this JSON that is relevant for our search query. To do this, we need to define a custom function that will assist us to do the job using the following line of code:

# Define function that extracts the category of the venue based on the returned JSON file
def get_category_type(row):
try:
categories_list = row['categories']
except:
categories_list = row['venue.categories']

if len(categories_list) == 0:
return None
else:
return categories_list[0]['name']

Next, we will use the function to extract information from the JSON file related to venues in the Kebon Pala neighborhood. The following line of code should do the trick:

venues1 = results1['response']['groups'][0]['items']

nearby_venues1 = json_normalize(venues1) # flatten JSON
# filter columns
filtered_columns1 = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues1 =nearby_venues1.loc[:, filtered_columns1]
# filter the category for each row
nearby_venues1['venue.categories'] = nearby_venues1.apply(get_category_type, axis=1)
# clean columns
nearby_venues1.columns = [col.split(".")[-1] for col in nearby_venues1.columns]
nearby_venues1
# Check how many venues were returned by Foursquare?
print('{} venues were returned by Foursquare for {} neighborhood.'.format(nearby_venues1.shape[0], prop_neighborhood.iloc[0,0]))
3 venues were returned by Foursquare for Kebon Pala neighborhood.

As we can see, Kebon Pala neighborhood returns quite a low (i.e. 3) number of venues related to business activities in the area. Let's hope that our second neighborhood returns a much higher number of venues related to the business activity in the Kembangan area. We will repeat the same analysis that we did to Kebon Pala, but this time, we will input parameters related to Kembangan area.

Let's begin by trying to get the top 100 venues that are within Kembangan neighborhood and are within a radius of 500 meters of our candidate Covid-19 testing center using FOURSQUARE API. First, let's create the GET request URL. Name that URL, url2.

# Get url for the API in Kembangan neighborhood
LIMIT = 100 # limit of number of venues returned by Foursquare API
radius = 500 # define radius in meter
# create URL
url2 = 'https://api.foursquare.com/v2/venues/explore?&client_id={}&client_secret={}&v={}&ll={},{}&radius={}&limit={}'.format(
CLIENT_ID,
CLIENT_SECRET,
VERSION,
neighborhood_latitude2,
neighborhood_longitude2,
radius,
LIMIT)
# url2 - not printed for privacy

Next, let's make a request using REQUEST library, and name our query results for Kembangan area, results2.

# Send the GET request and examine the resutls
results2 = requests.get(url).json()
# results2 - not printed for shortening of the report

Next, we will use the above function (get_category_type) to extract information from the JSON file related to venues in the Kembangan neighborhood. The following line of code should do the trick:

venues2 = results2['response']['groups'][0]['items']

nearby_venues2 = json_normalize(venues2) # flatten JSON

# filter columns
filtered_columns2 = ['venue.name', 'venue.categories', 'venue.location.lat', 'venue.location.lng']
nearby_venues2 =nearby_venues2.loc[:, filtered_columns2]

# filter the category for each row
nearby_venues2['venue.categories'] = nearby_venues2.apply(get_category_type, axis=1)

# clean columns
nearby_venues2.columns = [col.split(".")[-1] for col in nearby_venues2.columns]
nearby_venues2
# Check how many venues were returned by Foursquare?
print('{} venues were returned by Foursquare for {} neighborhood.'.format(nearby_venues2.shape[0], prop_neighborhood.iloc[1,0]))
72 venues were returned by Foursquare for Kembangan neighborhood.

As we can see from the results returned by the FOURSQUARE API, Kembangan neighborhood has more business activities already established in the area without we even go there and visit the neighborhood ourself for some actual photos of the area, which we will need to do at some point during our analysis so that we can present them to the investing group. In the mean time, we can present a pie chart of the business categories having more than one venues as shown below:

# Sort venues around Kembangan area
df_kembangan_neigh = nearby_venues2.groupby('categories').count()
df_kembangan_neigh.drop(columns =['lat', 'lng', 'cluster_label'], inplace=True)
df_kembangan_neigh.sort_values(by='name', ascending=False, inplace=True)
df_kembangan_neigh1 = df_kembangan_neigh.iloc[0:17]
df_kembangan_neigh2 = df_kembangan_neigh.iloc[17:]
df_kembangan_neigh1.reset_index()
# Plot the neighborhood business categories for Kembangan area (having greater than 1 venue)
plot = df_kembangan_neigh1.plot.pie(y='name', figsize=(10, 8), legend=False, label=None)

Clearly, Kembangan is a better choice in terms of diversity in business activities that is already established in the area. We can provide further analysis of the area such as population density, any unique clustering if any, for these venues (such as cluster for residential areas, cluster for business / office areas, etc) based on the euclidian distance of these venues to our proposed Covid-19 testing center in Kembangan area.

In order to tackle our next set of challenge required by the investing group, which is population density, we can refer to the population data Table that we show earlier in this report and plot the population density of the Jakarta areas.

# Plot bar chart for the population density in Jakarta (year 2014)
%matplotlib inline
import matplotlib.pyplot as plt
# Remove comma and convert data in column Population Density 2014 to numeric values
popden = df_population['Population Density 2014'].str.replace(',', '').astype(int)
# Visualize the Population Density data as a Bar Chart
plt.barh(df_population['City'], popden)
plt.xlabel('Population Density 2014') # add to x-label to the plot
plt.ylabel('') # add y-label to the plot
plt.title('Jakarta Cities by Population Density') # add title to the plot
plt.show

The bar chart further confirms that Kembangan neighborhood (which is part of West Jakarta city) has the highest population density after Central Jakarta. This could also reflect lower Social Distancing in the area due to high population density and numerous business venues in the area compared to Kebon Pala area.

The next set of challenge that we need to tackle is to gain slightly more insights (profile) of the Kembangan area. As mentioned in our business scenario, the investing group might ask whether there is any clusters in the neighborhood area. To simplify our analysis, we will just use Euclidian (distance-based) clustering technique which is part of unsupervised machine learning technique. In particular, we will use K-means clustering. To start off, we will need to decide the best K-value for our analysis. We will let K-means clustering algorithm to calculate this for us. The following lines of code will carry out the task.

# Apply unsupervised Machine Learning clustering technique to the neighborhood data in KembanganK_clusters = range(1,10)
kmeans = [KMeans(n_clusters=i) for i in K_clusters]
Y_axis = nearby_venues2[['lat']]
X_axis = nearby_venues2[['lng']]
score = [kmeans[i].fit(Y_axis).score(Y_axis) for i in range(len(kmeans))]
# Visualize
plt.plot(K_clusters, score)
plt.xlabel('Number of Clusters')
plt.ylabel('Score')
plt.title('Elbow Curve')
plt.show()

The X-axis of the plot shows the various number of K-values that we can use for our clustering analysis. As we can see from the chart, the curve starts flattening out at the K = 3. Therefore, we will use K-value of 3 in order to cluster neighborhood surrounding our proposed Covid-19 testing center. The following lines of code will assign Cluster label to all venues that are within 500 meter radius of our Covid-19 testing center in Kembangan area :

kmeans = KMeans(n_clusters = 3, init ='k-means++')
kmeans.fit(nearby_venues2[nearby_venues2.columns[2:4]]) # Compute k-means clustering.
nearby_venues2['cluster_label'] = kmeans.fit_predict(nearby_venues2[nearby_venues2.columns[2:4]])
centers = kmeans.cluster_centers_ # Coordinates of cluster centers.
labels = kmeans.predict(nearby_venues2[nearby_venues2.columns[2:4]]) # Labels of each point
nearby_venues2
# Check whether all the cluster labels exist in the data
list(nearby_venues2['cluster_label'].unique())
[0, 2, 1]

In order to better visualize the clustering of our neighborhood, we will need to create custom function that we call 'regioncolors' that will assign color to each venues that are within 500 meter radius of our proposed facility. The following line of code should help us with this task.

def regioncolors(counter):
if counter['cluster_label'] == 0:
return 'green'
elif counter['cluster_label'] == 1:
return 'blue'
elif counter['cluster_label'] == 2:
return 'red'
else:
return 'error'

nearby_venues2["color"] = nearby_venues2.apply(regioncolors, axis=1)
nearby_venues2

At this stage, we have assigned cluster labels to all of our neighborhood venues, and we have assigned unique colors to each cluster. Next, we can then visualize our clustering analysis to a Folium map to see how all of these venues are geographically distributed within the 500 meter radius that we specified surrounding the proposed facility.

# Construct a map of neighborhood venues in Kembangan, West Jakarta map_kembangan = folium.Map(location=[-6.191395, 106.7405859], zoom_start=16)for lat, lng, cat, col in zip(nearby_venues2['lat'], nearby_venues2['lng'], nearby_venues2['categories'], nearby_venues2['color']):
label = folium.Popup(cat, parse_html=True)
folium.Marker(
location=[lat, lng],
popup=cat,
icon=folium.Icon(color=col),
).add_to(map_kembangan)
for lat, lng, neigh in zip(prop_neighborhood['Latitude'], prop_neighborhood['Longitude'], prop_neighborhood['candidate_neigh']):
label = folium.Popup(neigh, parse_html=True)
folium.Marker(
location=[lat, lng],
popup=neigh,
icon=folium.Icon(color='darkblue', icon='header'),
).add_to(map_kembangan)

map_kembangan

The hybrid map above shows 3 unique clusters that are within 500 meter radius of our proposed facility (label ‘H’, dark blue color). However, the cluster does not distinguish unique business categories (e.g. restaurant area, residential areas, etc) and only shows 3 distinct clusters based on the distance of the venues to our proposed site.

6. Conclusion

In this article or report, the author has demonstrated a ‘Visual’ approach to determine the best location for a business venture utilizing Data Science / Data Analytics tool kits. The same processes can be used for other cities in the world as long as the required datas are available. Certainly some improvements could be done here and there as briefly mentioned in the report. For example, how we can use a slider or animation to visualize the changes or growth of our data on daily basis or how we can use different algorithms to improve the clustering analysis that we discussed here. I think a visual approach is a better way to present our analysis to non-specialist such as to the investing group or any other stake holders.

Contact: email: yusvana@gmail.com

Connect: https://www.linkedin.com/in/ramayusvana

--

--

Rama Yusvana
0 Followers

IBM Certified Professional Data Scientist, University Lecturer — BioEngineering