4/30 — Applying K-Mean, PCA and MDS

Shane Liu
Visualization@SBU
Published in
2 min readMay 8, 2020

Progress for the SBU CSE564 final project

We have applied the data preprocessing method on the previous post. In this pose, we are going to demonstrate how we apply data reduction tools on our datasets.

Merge Data

Before using K-mean, PCA and MDS, we merge our datasets together.

# Using merge trump and DowJones data on 'Data' column
merge_data = pd.merge(trump, DowJones, how=’left’, on=’Date’)
Overview with the merge data

Applying Elbow Method to find the Optimal K

We found that the k = 5 may be the optimal k.

Finding the best PCA

We Used the following code to find the best PCA number for our data. From the Variance, we could know that PCA = 4 is the best one.

def best_PCA(data):
data = data.apply(preprocessing.LabelEncoder().fit_transform)
data = data.astype(‘float64’)
s_kmeans = KMeans(n_clusters=5).fit(data)
pca_dataset = PCA().fit(scale(data))

Visualizing the data via MDS

Euclidian

The code below shows how did we get the the euclidian distance.

def get_euclidean(sample):
mds_df_resulst = pd.DataFrame()
sample_matrix = pd.DataFrame(scale(sample))
euclidean_vec = MDS(n_components=4).fit_transform(sample_matrix)
euclidean_vec = pd.DataFrame(euclidean_vec)
mds_df_resulst['x'] = euclidean_vec[0]
mds_df_resulst['y'] = euclidean_vec[1]
return mds_df_resulst
Plot the result using year as label

Correlation

The code below shows how did we get the the correlation distance.

def get_correlation(sample):
mds_df_resulst = pd.DataFrame()
sample_matrix = pd.DataFrame(scale(sample))
sample_matrix = sample_matrix.transpose()
sample_corr_m = sample_matrix.corr()
correlations = MDS(n_components=4, dissimilarity=’precomputed’)
.fit_transform(1 — sample_corr_m)
correlations = pd.DataFrame(correlations)
mds_df_resulst[‘x’] = correlations[0]
mds_df_resulst[‘y’] = correlations[1]
return mds_df_resulst
Plot the result using year as label

Congrats! We have done with our goal.

--

--