Andrews Curve For Data Visualization

Published in

EnjoyAlgorithms

5 min readFeb 20, 2024

Analyzing high-dimensional data is challenging as no method exists to visualize more than three dimensions simultaneously. This is popularly renowned as the Curse of Dimensionality and refers to a scenario where we have “too many” features in our dataset.

Although the definition of “too many” is subjective and highly depends on the project we are working on. However, with the new trend of data collection, data engineers tend to collect as many features as possible. This is because removing data features is relatively easier than regenerating the same scenario and collecting the entire data again.

This opens up another challenge in analyzing the data samples. We can only send the data to the ML model if we analyze it first and then ensure that it will be helpful in learning. But to analyze such a huge dimension, we first reduce the dimensionality, explore the relationship, and then feed the data to build ML models.

Data features in the higher dimension have their properties, like variance and relative distance. When we try to reduce the dimensionality, we lose some of these properties, and techniques that provide the least loss are considered to be better.

Two prevalent methods for reducing dimensionality are t-stochastic Neighborhood Estimation (t-SNE) and Principal Component Analysis (PCA). But here, we will learn a new, less renowned yet effective technique to analyze the relationship among data samples in a higher dimension.

Andrews Curve

Andrews curve is a projection of each row of the data onto a vector:

This vector is a dynamic vector and changes with respect to the variable “t”. If a data has “d” features, then the row vector of that data will look something like this: [x1, x2, x3, …., xd]. When we project this sample over the dynamic vector, it will look like this:

We can think of this projection as a variable “t” function with an already defined range of ‘t’ as (-Π, +Π).

What does Adrews Curve represent?

Andrews curve brings the multi-dimensional data into a lower space by retaining the relative distance between other samples and keeping the variance similar. Variance can be correlated to the information present in the data. So, plotting the projected function represents “How similar are the data samples when they are representing the same or similar output?” For example, if data Sample-1 and Sample-2 both give the same class label as the output, they should be relatively closer to each other in the higher dimension. And if the relative distance is lesser in the higher dimension, it should show similar behavior in the lower dimension.

Andrews Curve on Iris Dataset using Python

Introduction to IRIS Dataset

IRIS is a widely popular dataset containing three flower classes: Setosa, Versicolor, and virginica. It has four-dimensional features: Sepal length, Sepal width, Petal length, and Petal width. ML models segregate these flowers based on these features as per their types.

Loading the IRIS Dataset into the Python Code

We can download this dataset from the official GitHub repository of the Pandas library IRIS.csv. The line below will load all of this data into our code environment.

import pandas as pd

df = pd.read_csv(
    'https://raw.github.com/pandas-dev/'
    'pandas/main/pandas/tests/io/data/csv/iris.csv'
)

print(df.head())


'''
      SepalLength   SepalWidth   PetalLength   PetalWidth     Name
-------------------------------------------------------------------------
0          5.1         3.5          1.4         0.2        Iris-setosa
1          4.9         3.0          1.4         0.2        Iris-setosa
2          4.7         3.2          1.3         0.2        Iris-setosa
3          4.6         3.1          1.5         0.2        Iris-setosa
4          5.0         3.6          1.4         0.2        Iris-setosa

'''

Please note that the data is supervised as the labels are present with the column name “Name”.

Plotting Separate Classes

As the data has 4-dimensional features, we might not be able to find similarities among the samples by plotting them. So, we will use the Andrews curve to project each sample on the dynamic vector. To make the understanding more firm, we will first draw the Andrews curve for samples belonging to one class. Let’s start with Setosa.

Setosa

To extract the samples from the Setosa class, we will use df[df[‘Name’]==’Iris-setosa’]. Now, in the Pandas library, we have a direct function of andrews_curve to plot the curve using the dataframe, and the complete function and its arguments can be seen below.

pandas.plotting.andrews_curves(frame, class_column, ax=None, samples=200, color=None, colormap=None, **kwargs)

It expects a data_frame, a column from the dataframe to be drawn, the plot color, and the number of samples that need to be drawn. A complete list of these functional arguments can be found here.

df_setosa = df[df['Name']=='Iris-setosa']

plt.figure('setosa')
pd.plotting.andrews_curves(df_setosa, 'Name', color=['r'])
plt.xlabel('t')
plt.ylabel('f(t)')
plt.show()

Versicolor

Similarly, for the Veriscolor class samples.

df_versicolor = df[df['Name']=='Iris-versicolor']

plt.figure('versicolor')
pd.plotting.andrews_curves(df_versicolor, 'Name', color=['g'])
plt.xlabel('t')
plt.ylabel('f(t)')
plt.show()

Virginica

df_virginica = df[df['Name']=='Iris-virginica']

plt.figure('virginica')
pd.plotting.andrews_curves(df_virginica, 'Name', color=['b'])
plt.xlabel('t')
plt.ylabel('f(t)')
plt.show()

Please note that the curves for Virginica and Versicolor are almost similar. This is the same thing noticed while applying the k-means clustering algorithm on this dataset. The two clusters were very close to each other. If we use the k-means algorithm on Iris Dataset, the clustering will be formed like this:

We can observe that the purple and yellow color samples are very close to each other compared to the green curves representing the Setosa class.

Plotting All Classes to observe the data pattern

plt.figure('Andrews Curve')

pd.plotting.andrews_curves(df, 'Name', color=['r', 'g', 'b'])
plt.show()

This complete plot will now enable us to identify the hidden patterns in the dataset, which is 4-dimensional. With the individual class representation, we learned that the Versicolor and Virginica classes are very similar and confirmed the same via k-means clustering output.

Conclusion

Andrews Curve is a way to represent higher dimensional data into a lower space so that analysis becomes easy. It is represented via a Fourier series and can also be thought of as a projection of higher dimensional data points on a dynamic vector.

Enjoy Learning!