Similarity Search of Spotify songs using GCP Vector Search & Vertex AI Python SDK in 15 minutes

Sid
CodeX
Published in
6 min readMay 3, 2024

--

In this article, I am going to show you how to use a Spotify songs dataset containing over 20k songs with various attributes, which we will use to create vector embeddings and eventually run a semantic search to fetch the relevant songs.

This use case can be expanded to various scenarios where you want to capture the semantic similarity or relationships between items in a high-dimensional space. For instance, it can be applied to recommendation systems in e-commerce platforms to suggest products similar to ones a user has shown interest in, or in content discovery algorithms to recommend movies, articles, or music based on user preferences and content similarity.

The possibilities are vast, making vector embeddings and semantic search invaluable tools across diverse domains.

With that said, we will be using the below services on google cloud platform :

1: GCP Vector Search (managed vector database offering)

2: Vertex AI Embeddings and Vector search public endpoints

3: Vertex AI TextEmbeddingModel (for embeddings)

The entire code execution will be done using Python on Vertex AI Jupyter workbench. You can run this code on your local system as well. Please watch my other articles which shows how to create a jupyter workbench on Vertex AI. Its quite simple. Once you have a workbench up and running, just create a new jupyter kernel and start with the execution below.

Source Code: https://github.com/sidoncloud/CloudAI-LLM/tree/main/gcp-vector-search-spotify

1: Start by installing/upgrading the modules.

%pip install — upgrade — user — quiet google-cloud-aiplatform google-cloud-storage

2. Initialize Vertex AI and read the input dataset into a pandas dataframe.

NOTE: Make sure to upload the input dataset music.csv to the root directory of your workbench before executing the next code block.

import pandas as pd
from google.cloud import storage, aiplatform
from vertexai.preview.language_models import TextEmbeddingModel
import vertexai
import tqdm
import time

PROJECT_ID = “project-id”
LOCATION = “us-central1”

BUCKET_NAME = “ bucket-name”

CSV_FILE_PATH = “music.csv”

vertexai.init(project=PROJECT_ID, location=LOCATION)

df = pd.read_csv(CSV_FILE_PATH)

Once read into a dataframe, we create a new dataframe containing only 2000 rows of data. We do this because creating embeddings of 20k+ songs containing several attributes could take an hour or more.

But feel free to skip this part if you want to embed the entire dataset.

df = df.head(2000)

3. Create Embeddings using the relevant attributes

We will now select the below attributes of all the songs from the input dataset in order to create embeddings and then create an instance of TextEmbeddingModel (textembedding-gecko).

track_name

popularity

danceability

loudness

track_genre

tempo

instrumentalness

This will create a new column called combined_details in the pandas dataframe which will be a combination of all the above attributes which we will use to create embeddings later on.

df[‘combined_details’] = df.apply(lambda row: f”{row[‘track_name’]} {row[‘popularity’]} duration_ms {row[‘danceability’]} {row[‘loudness’]} {row[‘track_genre’]} {row[‘tempo’]} {row[‘instrumentalness’]}”, axis=1)

model = TextEmbeddingModel.from_pretrained(“textembedding-gecko@001”)

Next, we define a simple function which takes an input text and returns the embeddings for them.

def get_embeddings_wrapper(texts,batch_size=5):
embeddings = []
for i in tqdm.tqdm(range(0, len(texts),batch_size)):
time.sleep(1)
batch_texts = texts[i:i+batch_size]
batch_embeddings = model.get_embeddings(batch_texts)
embeddings.extend([embedding.values for embedding in batch_embeddings])
return embeddings

Lets invoke the above function by passing the contents of the column combined_details to it and creating a new column: embeddings inside our dataframe.

The execution of this step will take a couple of minutes.

combined_texts = df[‘combined_details’].tolist()

df[‘embedding’] = get_embeddings_wrapper(combined_texts)

Now that we have successfully created embeddings and stored it in a pandas dataframe, the next step is to export this dataframe into a jsonl file and move this file to a GCS bucket.

This step is necessary as we will be storing these embeddings in GCP Vector search in the subsequent steps which expects the data in a jsonl format stored inside a GCS bucket.

While creating the jsonl file, we will only select the necessary columns from the dataframe rather than all of them as its not necessary.

jsonl_string = df[[“id”,”track_id”, “artists”,”album_name”,”track_name”,”embedding”]].to_json(orient=”records”, lines=True)

with open(“songs.json”, “w”) as f:
f.write(jsonl_string)

BUCKET_URI = f”gs://bucket-name”
! gsutil cp songs.json {BUCKET_URI}

Once you execute this, you should be able to see the json file at the root of your GCS bucket.

Alright, we are half way there. Next, we deal with the Vector search part.

4. GCP Vector Search implementation

We import aiplatform and initialize it. Then we create an index by invoking the method create_tree_ah_index.

You give your index a name (custom name) , pass the bucket uri which contains the json file created in the previous step, set the dimensions (768 for text embeddings), define the neighbor count for retrieval and set the distance measure type to dot product.

NOTE: The value of BUCKET_URI could be the root path of your GCS bucket or a path inside a sub-directory inside the bucket, in either case this path must only contain the json file created previously. The below code block will fail if it finds any other file in the same location.

from google.cloud import aiplatform
aiplatform.init(project=PROJECT_ID,location=LOCATION)

my_index = aiplatform.MatchingEngineIndex.create_tree_ah_index(
display_name=f”spotify-songs-idx”,
contents_delta_uri=BUCKET_URI,
dimensions=768,
approximate_neighbors_count=10,
distance_measure_type=”DOT_PRODUCT_DISTANCE”,
)

Creating an index will roughly take about 5 minutes and once done, you can head over to Vector Search from your Vertex dasboard and find your newly created index .

The next step is to create an empty endpoint on GCP Vector store which will happen right away.

index_endpoint = aiplatform.MatchingEngineIndexEndpoint.create(
display_name=f”songs-endpoint”,
public_endpoint_enabled=True
)

Finally, we deploy the index to the endpoint. This part is going to take about 10–15 minutes. You can monitor the status of endpoint deployment under Deployed indexes as shown in the screenshot.

DEPLOYED_INDEX_ID = f”spotify_songs_idx”

index_endpoint.deploy_index(index=my_index,deployed_index_id=DEPLOYED_INDEX_ID)

And we are done :-) . The last step is just about querying this endpoint to retrieve the relevant results.

You can ask any relevant questions and see the results. For Eg : “Recommend some happy songs” or “Recommend songs similar to {song name}” .

P.S — Make sure to delete the endpoints and the index to ensure you dont incur any additional costs and thanks for reading.

--

--

Sid
CodeX
Writer for

Passionate data expert & Udemy instructor with 20k+ students, helping startups scale and derive value from data.