Scikit learn 中 Kmeans的n_job參數會讓結果不一致

Bryan Yang

Published in

A multi hyphen life

4 min readJul 4, 2017

踩雷心得

Scikit learn是大家常用的 machine learning 套件，其中 Kmeans 是大家最愛用的分群模型沒有之一，就算沒用過 Kmeans 也聽過．今天重點不是介紹演算法，而是其中一個參數 n_jobs．

這是 n_jobs 參數的官方說明．

n_jobs : int
The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html

簡單說明一下，Kmeans會預設十個隨機起始點 (n_init=10)，在預設的情況(n_job=1) 會跑十次 kmeans，然後取最好的結果．如果 n_job>1 ，則會平行去跑那十個隨機起始點．聽起來很棒是吧，一次用到16個cpu豪爽啊：

聽起來很棒但是為甚麼結果會像下圖這樣不一致呢？（注意兩者random_state是相同的）：

說明也有說random_state 會固定seed，亦即不管跑幾次結果都會一致，但是為甚麼當n_jobs>1 的時候會不一樣，只好翻原始🐎 QAQ

random_state : integer or numpy.RandomState, optional
The generator used to initialize the centers. If an integer is given, it fixes the seed. Defaults to the global numpy random number generator.

原始碼看到當 n_jobs= 1的時候，就依序把 n_init 送到 kmeans_single 中去，裡面的 random_state 等於我們輸入 random_state．BUT 當 n_jobs >1 的時候，丟到 kmeans_single 的 random_state 不是我們原本輸入的 random_state 啦~~ 註解也有說（但是文件沒說），為了 variety 所以換了seeds，難怪結果不一樣 QQ~

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/cluster/k_means_.py

我用的解法就是把 n_init 設大一點，讓 kmeans多找幾次，多跑幾種不同的 random seeds，從中找出比較合適的組合．

Scikit learn 中 Kmeans的n_job參數會讓結果不一致

Written by Bryan Yang