Customized Coordinat Point Grouping

There was a time, dimana gue menemukan sebuah permasalahan bisnis, yang mana untuk menyelesaikannya, gue harus membagi bagi point koordinat ke dalam beberapa grup, dan masing masing grup sudah ditentukan jumlah point nya. As an amateur data engineer, of course what I was doing is checking stackoverflow with “clustering” as the keyword to find a solution. I’ve visited some page, but found no solution which I can actually working on, like below:

Finally, I realize that I’ve been using the wrong keyword all this time. If you want the number of point result member to be same size, you are doing a grouping, not clustering (unsupervised learning). I finally able to solve the problem using simple python script and postgis query.

For example, I have a table called tb_points which has 4000 coordinate point, and you want to divide it into 10 same size group, which will contain 400 coordinate point each. Here is the example of the table structure

CREATE TABLE tb_points (
outlet_id INTEGER,
longitude FLOAT,
latitide FLOAT,
group_id INTEGER

Then what I was doing are:

  1. Find the first coordinate that will be your starting point
  2. Find nearest coordinate from your starting point, order ascending by distance, limit the result by the number of your preferred coordinate point member (in this case 400)
  3. Update the result by updating the group_id column
  4. Do 3 steps above 10 times for the rest of data, which group_id column is still NULL

This is the implementation in python:

import psycopg2

dbhost = ''
dbuser = ''
dbpass = ''
dbname = ''
dbport = 5432

conn = psycopg2.connect(host = dbhost,
user = dbuser,
password = dbpass,
database = dbname,
port = dbport)

def fetch(sql):
cursor = conn.cursor()
rs = None
rs = cursor.fetchall()
except psycopg2.Error as e:
rs = 'error'
return rs

def execScalar(sql):
cursor = conn.cursor()
rowsaffected = cursor.rowcount
except psycopg2.Error as e:
rowsaffected = -1
return rowsaffected

def select_first_cluster_id():
sql = """ SELECT a.outlet_id as ori_id, a.longitude as ori_lon,
a.latitude as ori_lat, b.outlet_id as dest_id, b.longitude as
dest_lon, b.latitude as dest_lat,
4326) AS geography),
CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS
AS air_distance FROM tb_points a CROSS JOIN tb_points b WHERE
a.outlet_id != b.outlet_id and a.group_id is NULL and b.group_id
is null order by air_distance desc limit 1 """
return sql

def update_group_id(group_id, ori_id, limit_constraint):
sql = """ UPDATE tb_points
set group_id = %s
where outlet_id in
(select b.outlet_id
from tb_points a,
tb_points b
where a.outlet_id = '%s'
and a.group_id is null
and b.group_id is null
order by
4326) AS geography),
CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS
geography)) asc
limit %s)
""" % (group_id, ori_id, limit_constraint)
return sql

def clustering():
data_constraint = [100]
n = 1
while n <= 10:
sql = select_first_cluster_id()
res = fetch(sql)
ori_id = res[0][0]

sql = update_group_id(n, ori_id, data_constraint[0])

n += 1

Coordinate Grouping Result

You can actually customize the code, like how many group you want to make, and the number of point coordinate member in each group. Turn out I don’t need fancy algorithm to solve the problem… lol