Building ML-powered application: emoji2text

Purgen
6 min readOct 23, 2023

Introduction

Hello, My name is Vas. I am a student at Harbour.Space, pursuing a Master’s degree in Data Science. I want to share a project I worked on during my studies. This article isn’t scientific research or groundbreaking discoveries but aims to be practical and informative for Python interns and data-science beginners.

prepare to be emojized!

Task

Our project involved designing an MVP that uses LLM to recognize songs based on emoji sequences.

U can find every step of us on github.com/aguschin/lyrics2emoji.

The First Brainstorming

Team

Our team consisted of several members, each with specific roles.

Philip

Philip focused on designing the testing environment for our project.

Vas (me)

I was responsible for testing and metrics. I needed to understand which approaches and changes improved or hindered the search results.

Anton

Anton handled data preparation and representations. He created reference data using the openAI API.

Nicolas

Nicolas was in charge of data scraping and the web interface for the emoji2text application. He created a gamified web page for guessing songs based on emoji sequences, with three options.

Ming

Ming worked on implementing the Spotify API into the web interface to personalize the user experience. The application suggested the most popular songs for authorized users, aiming to enhance song recognition based on emoji sequences.

1st Week

We started our project by exploring our teacher’s Kaggle notebook.

kaggle.com/code/aguschin/lyrics-to-emoji

The teacher also configured a remote server to work with pre-trained, non-optimized models. It was necessary to set up a working environment using Jupyter Lab/Notebook. We used unpopular LLM models, including mpt-7b, dolly-v2–3b, Mistral-7B-v0.1. We hoped that we would find a highly specialized model that would produce better results than classic distilBERT.

To understand the quality of translation, we wrote the first metric based on cosine similarity. We created the embeddings from the first four lines of text.

Cosine similarity function
Embed function

Unfortunately, the similarity was below 0.75, which is not profitable. These results don’t compare with distilBERT’s 0.96.

To make sure that the embeddings were generated correctly, we checked the text search using their own embeddings, expecting a similarity equal to one.

Text search checking
Match plotting

At first glance, the raft shows a negative number of matches. But if you pay attention to the scale, it becomes clear that this is the area around the unit.

We decided to see what we fed into the embeddings and were unpleasantly surprised. Due to encoding, all emoji are almost identical.

As a result, at the end of the first week, we learned that we should work with distilBERT and find a better approach to make metrics.

distilBERT

2d week

We found another approach to evaluate the accuracy of the translation code. We moved from cosine similarity to AnnoyIndex for both translation and search accuracy. AnnoyIndex helped us find neighboring results.

Then we used emoji metadata (name and description from a custom CSV file) instead of its ASCII code.

Get emoji metadata

For this moment, our average number of the particular match within neighbors was 30.9% (650th position over 2100 in total). It was better than random (50%), but so far from ideal.

It was necessary to expand the metric in order to understand whether I was moving in the right direction. So, I moved on to improving the lyrics search code. To do this, I had to have an example of “verified” translated emoji sequences. The fastest way to generate it is chatGPT. So, I made 10 emoji sequences from the first lines of the lyrics.

Translation results from chatGPT

But how could I extract metadata now? I went looking for a solution and found the Emoji library.

Emoji parsing

At this point, I received an updated date for the songs from Anton, and now I was working with 300 entries. I played around with separators and came to the conclusion that dots and underscores affect the results the best.

Search step logging

Now it looks a bit better (25%, 75 over 300). I suppose I was in the right direction.

Our plan for the last week was to obtain emoji sequences from song lyrics using the chatGPT API and address issues with translating phrases as one emoji or one word as multiple emojis.

3rd week

We started using embeddings from the OpenAI API to set a standard for comparison with our translation and search quality.

Additionally, I looked at the cosine similarity of top neighbors and found that on average, the cosine value was 0.72. I realized that the top results often contained songs corrupted in some way, such as lines with single symbols or interjections.

To further improve the quality standard, I decided to clean the text by removing words that were not interpreted as emojis by our translator. This refinement increased our average search position (14%, 42 over 300).

Clean text function

We concluded that preprocessing improvements were necessary before generating embeddings.

Outro

I only talked about my small part of the work. The result of the work of the whole team — a demo of the web application — is available for a while.

guess-a-song.streamlit.app

Our project is still a work in progress. We have several tasks on our TO-DO list, such as:

  • tracking monthly users
  • average number of games played
  • error rates
  • implementing search across the entire song
  • setting up Grafana for data collection and display
  • implementing CI/CD
  • adding a share button and share ratio for publication.
The Last Brainstorming

We also plan to introduce two leaderboards for two game modes (guessing and explaining) and fine-tune the embeddings by the weight of choices.

This is where the tip of the iceberg ends. Thank you for your time, and good luck in your own endeavors!

github.com/aguschin/lyrics2emoji

--

--