Visualizing Data Made Easy with ProSphera
Business analytics often involves data segmentation (clustering) and communicating complex insights to stakeholders. This requires using unsupervised learning techniques to identify data clusters and descriptive statistics. However, are there any simple and efficient methods for visualizing predefined clusters?
To better understand multidimensional datasets in vector data analysis, one option is to use TensorFlow’s Embedding Projector. This tool allows for visually appealing and informative exploration of high-dimensional embeddings. Based on my personal experience, I can confirm its significant impact on data analysis.
However, like any other tool, it has limitations. Integrating it into your code can be challenging due to the complex process of saving your data externally and then uploading it to the online tool. This can be time-consuming and burdensome.
As a result, our team has developed an alternative solution that offers comparable performance but is specifically designed for seamless integration into your workflow.
Our open-source package, prosphera (stands for ‘Projection on Sphera’), utilizes Principal Component Analysis (PCA) and a cosine kernel to generate easy-to-understand visualizations of complex data. These visualizations are organized in a three-dimensional space, clearly depicting your data.
The ‘apply_pca’ method is a crucial component in our data processing pipeline and is located within the “black box” of our package.
def _apply_pca(self, data):
scaled_data = robust_scale(data, quantile_range=(5, 95))
pca = KernelPCA(
n_components=3,
kernel='cosine',
copy_X=False,
random_state=self.random_state,
n_jobs=-1)
return pca.fit_transform(scaled_data)
The method starts by applying robust scaling to the input data using the ‘robust_scale’ function from sklearn. This scaling method resists outliers and effectively normalizes the data while preserving its integrity.
The next step involves initializing a KernelPCA object with the cosine kernel, which is ideal for capturing non-linear correlations within the data. The goal is to reduce the data’s dimensionality to only three components, making it suitable for 3D visualization.
It is important to normalize the data after applying the PCA transformation. This ensures that each data point moves towards a centroid and achieves unit norms, resulting in an accurate and consistent data representation. This is executed within the ‘scale_vectors_on_sphere’ method.
def _scale_vectors_on_sphere(data_pca, scaling_range=(0.1, 1)):
vectors = data_pca - np.mean(data_pca, axis=0)
norms = np.linalg.norm(vectors, axis=1, keepdims=True)
normalized = vectors / norms
scaled_magnitudes = minmax_scale(
np.log(norms**2),
feature_range=scaling_range)
vecs = normalized * scaled_magnitudes
return vecs, scaled_magnitudes
First, it centres the data around zero by subtracting the average value of each attribute. Then, it normalizes the vectors by dividing them by their magnitudes, ensuring that every vector has a uniform length of 1.
This process scales and adjusts the magnitudes of vectors logarithmically within a predefined range. Subsequently, the normalized vectors are modified by multiplying them with the scaled magnitudes to reposition data points, especially those located far from the centre, to make them suitable for representing the surface of a sphere.
This rearrangement of data points towards the surface of a sphere creates a captivating and clear display. This impressive spherical representation enhances understanding of the data’s arrangement and allows for interaction with the data through Plotly’s API.
Here is how prosphera deals with generated data:
One more example with the ‘digits’ dataset:
And another one with a ‘housing’ dataset with labels, created by splitting the target into several bins:
If you are a data scientist, analyst, or anyone involved with complex data clustering tasks, we encourage you to experiment with prosphera for your projects. We hope it will simplify and provide valuable insights.