Image-based song matching using AWS services

Published in

AI2 Labs

5 min readMay 31, 2019

When we first heard about the AI Hackathon by Amazon, we were excited and decided to take part in it. As fellow Machine Learning practitioners, we were keen on the available APIs provided by Amazon. Amazon Web Services(AWS) brings computer vision, natural language processing, speech recognition, text-to-speech, and machine translation within the reach of every developer. In this Hackathon, we need to use more than two services. The first round of brainstorming was to decide on which API services to be used in our project. Since we have mainly dealt with image processing and videos in the past, we selected the Amazon Rekognition to analysis the metadata of an image and Amazon Comprehend to identify the insights and relationships of the metadata. The subsequent round of brainstorming leads us to the idea of having an application that matches songs based on the input image. “You provide images, we match songs for you!” — — This is the background story of our AWS Hackathon.

Motivation

You can find the hackathon here. The main objective is to build intelligent applications with AWS serverless API such as AWS Lambda and some machine learning pre-trained services such as Amazon Comprehend, Amazon Transcribe, Amazon Polly, and Amazon Rekognition for the ideas on what we can do. Our idea is to describe pictures with songs. We take advantage of Amazon Rekognition to describe what we can see in the input images and use Amazon Comprehend to tell the emotions of these images. Finally, we match songs based on AWS ElasticSearch and the song lyrics dataset. I will explain the details in the following sections.

Main Architecture

When the user uploads an image, there are two processes involved:

In the first part, it will be sent to our server for visual sentiment analysis.
For the second part, once it is uploaded, it will trigger a lambda function which performs object identification and facial analysis using Amazon Rekognition.

Both of the processes will generate synonyms in the JSON file format. We can call this JSON file as image’s feature identifier. An index library which we already built in Amazon S3 storage is used to match songs’ lyrics with image’s feature identifier. Then, the server will look for open-licensed songs in Youtube using their API based on the matched lyrics and embed the videos to our webpages to be listened by the user.

With this, the user can easily find songs that match their feelings with just an image because there are times that your feelings cannot be easily expressed in words.

Input process and storage

The frontend part is a simple page which is based on Node.js. The main function allows users to upload their images to AWS S3.

First of all, we need to set up this task,

In the Amazon S3 console, create an Amazon S3 bucket that you will use to store the photos in the album. Make sure you have both Read and Write permissions on Objects.

Second, complete the configuration of AWS javascript. Add the following code in the javascript file. For the region, accessKeyId and secretAccessKey fields Check out the following tutorial to find out more on region, access key id and secret access key.

// Load the SDK for JavaScriptvar AWS = require('aws-sdk');// Set the regionAWS.config.update({region: 'XXXXXXX', accessKeyId: 'XXXXXXX', secretAccessKey: 'XXXXXXXXXX' });

After that, the upload images is based on the following article.

Image process and recognition

When a user uploads an image to the server, the server will trigger an AWS LAMBDA function to process the image information. In this processing part, we used AWS Rekognition service and Visual Sentiment Analysis.

For AWS Rekognition service, we focused on object identification and facial analysis. With object identification, we can identify thousands of objects (e.g. bike, telephone, building) and scenes (e.g. parking lot, beach, city).

Object Identification. Image from https://aws.amazon.com/rekognition/

With facial analysis, we can analyze the attributes of faces in images we provide to determine things like happiness, age range, eyes open, glasses, facial hair, etc.

Facial Analysis. Image from https://aws.amazon.com/rekognition/

In Visual Sentiment analysis, we used a black magic technology which is called DeepSentiBank. This model can describe an image with its emotions. Here are some examples of DeepSentiBank.

If we synthesize object Identification, facial analysis, and visual sentiment analysis, we can get a lot of descriptions for each input image. Based on these words, we can match songs with their lyrics.

Lyrics process and match

In this section, we focus on lyrics data processing and song matching part.

We collected lyrics data from Kaggle. There are around 380,000+ lyrics in the data set from a lot of different artists from a lot of different genres arranged by year. The structure is artist/year/song. The quantity is sufficient for our case.

When the lyrics data uploads to AWS S3, a LAMBDA function is triggered to process the data.

AWS Comprehend was used to extract key phrases, sentiments, entities and topics from each lyric.
AWS Elastic Search was used to store these keywords, their music titles, and artists.

Then if we get the descriptions about the input image, these descriptions are used as input for Elastic Search. Finally, we can get the most similar result of the songs’ titles and artists. We just use the top 3results on the final page. Here is a demo video about our project.

Demo

Conclusion

All in all, we are pretty satisfied with our project even though we faced some little hiccups during the integration process. The current application can be easily extended further by integrating other machine learning algorithms to provide a whole new level of services and experience to the users. We have learned a lot from this project and looking forward to more APIs from Amazon in the future.

Image-based song matching using AWS services

Motivation

Main Architecture

Conclusion

Written by Hongze