Customized Coordinat Point Grouping

3 min readSep 27, 2017

There was a time, dimana gue menemukan sebuah permasalahan bisnis, yang mana untuk menyelesaikannya, gue harus membagi bagi point koordinat ke dalam beberapa grup, dan masing masing grup sudah ditentukan jumlah point nya. As an amateur data analyst, of course what I was doing is checking stackoverflow with “clustering” as the keyword to find a solution. I’ve visited some page, but found no solution which I can actually working on, like below:

https://stackoverflow.com/questions/5452576/k-means-algorithm-variation-with-equal-cluster-size (this is the most relevant stackoverflow page I found related to my problem)
https://elki-project.github.io/tutorial/same-size_k_means (and so far this is the best solution I can found, but it was written in Java, so I skipped it)

Finally, I realize that I’ve been using the wrong keyword all this time. If you want to have same amount of point in each group, you are doing a grouping, not clustering (unsupervised learning). I finally able to solve the problem using simple python script and postgis query.

For example, I have a table called tb_points which has 4000 coordinate points, and you want to divide it into 10 group, where each group contains 400 coordinate points. Here is the example of the table structure

CREATE TABLE tb_points (
  id SERIAL PRIMARY KEY,
  outlet_id INTEGER,
  longitude FLOAT,
  latitide FLOAT,
  group_id INTEGER
);

Then I do the following steps:

Find the first coordinate that will be your starting point
Find nearest coordinate from your starting point, order ascending by distance, limit the result by the number of your preferred coordinate point member (in this case 400)
Update the result by updating the group_id column
Do 3 steps above 10 times for the rest of data, which group_id column is still NULL

This is the implementation code in python:

import psycopg2

dbhost = ''
dbuser = ''
dbpass = ''
dbname = ''
dbport = 5432

conn = psycopg2.connect(host = dbhost,
       user = dbuser,
       password = dbpass,
       database = dbname,
       port = dbport)

def fetch(sql):
    cursor = conn.cursor()
    rs = None
    try:
        cursor.execute(sql)
        rs = cursor.fetchall()
    except psycopg2.Error as e:
        print(e.pgerror)
        rs = 'error'
    cursor.close()
    return rs

def execScalar(sql):
    cursor = conn.cursor()
    try:
        cursor.execute(sql)
        conn.commit()
        rowsaffected = cursor.rowcount
    except psycopg2.Error as e:
        print(e.pgerror)
        rowsaffected = -1
        conn.rollback()
    cursor.close()
    return rowsaffected


def select_first_cluster_id():
    sql = """ SELECT a.outlet_id as ori_id, a.longitude as ori_lon,
    a.latitude as ori_lat, b.outlet_id as dest_id, b.longitude as
    dest_lon, b.latitude as dest_lat,
    ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),
    4326) AS geography), 
    CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS
    geography))
    AS air_distance FROM  tb_points a CROSS JOIN tb_points b WHERE
    a.outlet_id != b.outlet_id and a.group_id is NULL and b.group_id
    is null order by air_distance desc limit 1 """
    return sql

def update_group_id(group_id, ori_id, limit_constraint):
    sql = """ UPDATE tb_points
    set group_id = %s
    where outlet_id in
    (select b.outlet_id
    from tb_points a,
    tb_points b
    where a.outlet_id = '%s'
    and a.group_id is null
    and b.group_id is null
    order by
    ST_Distance(CAST(ST_SetSRID(ST_Point(a.longitude,a.latitude),
    4326) AS geography),
    CAST(ST_SetSRID(ST_Point(b.longitude,b.latitude),4326) AS
    geography)) asc
    limit %s)
    """ % (group_id, ori_id, limit_constraint)
    return sql

def clustering():
    data_constraint = [100]
    n = 1
    while n <= 10:
        sql = select_first_cluster_id()
        res = fetch(sql)
        ori_id = res[0][0]

        sql = update_group_id(n, ori_id, data_constraint[0])
        print(sql)
        execScalar(sql)

        n += 1

clustering()

Tadaaa. You got the result. You can also customize the code, like how many group you want to make, and the amount of point coordinate in each group. Turn out I don’t need fancy algorithm to solve the problem… lol, just need to see the problem from different perspective

Customized Coordinat Point Grouping

Written by Mia Renauly