4/30 — Applying K-Mean, PCA and MDS

Published in

Visualization@SBU

2 min readMay 8, 2020

Progress for the SBU CSE564 final project

We have applied the data preprocessing method on the previous post. In this pose, we are going to demonstrate how we apply data reduction tools on our datasets.

Merge Data

Before using K-mean, PCA and MDS, we merge our datasets together.

# Using merge trump and DowJones data on 'Data' column
merge_data = pd.merge(trump, DowJones, how=’left’, on=’Date’)

Applying Elbow Method to find the Optimal K

We found that the k = 5 may be the optimal k.

Finding the best PCA

We Used the following code to find the best PCA number for our data. From the Variance, we could know that PCA = 4 is the best one.

def best_PCA(data):
    data = data.apply(preprocessing.LabelEncoder().fit_transform)
    data = data.astype(‘float64’)
    s_kmeans = KMeans(n_clusters=5).fit(data)
    pca_dataset = PCA().fit(scale(data))

Visualizing the data via MDS

Euclidian

The code below shows how did we get the the euclidian distance.

def get_euclidean(sample):
    mds_df_resulst = pd.DataFrame()
    sample_matrix = pd.DataFrame(scale(sample))
    euclidean_vec = MDS(n_components=4).fit_transform(sample_matrix)
    euclidean_vec = pd.DataFrame(euclidean_vec)    mds_df_resulst['x'] = euclidean_vec[0]
    mds_df_resulst['y'] = euclidean_vec[1]
    return mds_df_resulst

Correlation

The code below shows how did we get the the correlation distance.

def get_correlation(sample):
    mds_df_resulst = pd.DataFrame()
    sample_matrix = pd.DataFrame(scale(sample))
    sample_matrix = sample_matrix.transpose()
    sample_corr_m = sample_matrix.corr()
    correlations = MDS(n_components=4, dissimilarity=’precomputed’)
                   .fit_transform(1 — sample_corr_m)
    correlations = pd.DataFrame(correlations)    mds_df_resulst[‘x’] = correlations[0]
    mds_df_resulst[‘y’] = correlations[1]    return mds_df_resulst

Congrats! We have done with our goal.