How to add text similarity to your applications easily using MediaPipe and Python

Juan Guillermo Gómez Torres

Follow

Published in

Google Developer Experts

4 min readNov 1, 2023

--

How to add text similarity to your applications easily using MediaPipe and Python

Machine learning is everywhere when using a mobile app: recommending people or products, detecting people or tags in an image, etc.

Recommend means getting similar items and the user can see and consume something new.

Embedding is an essential component when you need to obtain similarities and is used to:

Semantic search
Clustering
Recommendations
Anomaly detection
Classification

An embedding is a numerical representation or a vector of floating point numbers, like this:

embedding": [
        -0.006929283495992422,
        -0.005336422007530928,
        ...
        -4.547132266452536e-05,
        -0.024047505110502243
      ]

Now, you can perform operations between two embeddings, for example, distance measures of their relationship. Small distances suggest a high relationship and large distances suggest a low relationship. There are three most commonly used measures listed below.:

Euclidean Distance — Distance between ends of vectors
Cosine — Cosine of the angle between vectors
Dot product — Cosine multiplied by lengths of both vectors

How do you get embeddings? This is our next question. There are some libraries and APIs to get embeddings like the following:

MediaPipe
Vertex AI client libraries
PaLM API by Google
OpenAI API
SentenceTransformers
FastText by Facebook

On the next step, we’ll talk about MediaPipe

MediaPipe Solutions

MediaPipe Solutions provides a set of libraries and tools to apply machine learning (ML) to your applications quickly. You can use SDKs for multiple platforms, including Android, Python, and Web, to get your machine learning up and running in fewer minutes. Additionally, you can customize your models using transfer learning. MediaPipe offers libraries and resources to use in your applications and tools to customize and evaluate new solutions.

MediaPipe also offers a library where you can get image and text embeddings; In our case, we use it for texts. The MediaPipe documentation recommends the Universal Sentence Encoder model.

How to get embeddings with MediaPipe

We are using the SDK for Python to get embeddings.

The are the steps:

Install the MediaPipe Python package

pip install mediapipe

Download the model and store it in your project [download model].
Then you can get text embeddings like this:

Embeddings with MediaPipe

Okay, the next step would be to determine the level of similarity between the sentences.

Let’s look at an example with two sentences and their embeddings, something like this:

Sentences:

{
    "text1":"How's it going?",
    "text2":"I am fine"
}

Embeddings:

{
    "text1embedding": "[127  16 185 127  82 127 128  50 127 127 172  10 127 128 127 127   7 160\n 128 128 128  90 127 238  70 127 246 128 127 127 170 128 182 185   9  76\n 154 196   4  42 136 127 127 127 128  28 151 127 127   4 135 127  80 157\n  77  90 113  41  15 127 128 167 127  83   1 127 217  60 128  90 255   2\n 161 232  24 171 127   9  55  12 127 210 127  87 181  79 127  88 128 124\n 128   7 128 128 128  19 127 127 250 145]",
    "text2embedding": "[127  44 209 127  35 127 128 128 127  81 176  26 127 128 127 127 242 180\n 139 128 128 127 127 147 126 127 230 128 127 127 200 137 128   9  65  70\n 217 128  22 124 142 127 118 127 194 131 128 127 110 245 142 127 127 151\n 127  50  67  61 248 127 128 128 127  36 216 127 218 106 151  78  20 223\n 182 189 222 233 127   1  76  11 127 253 127  33 186 127 127 235 128 121\n 128   4 128 128 175 187 127  87 228 141]",
}

How to get the distance between two embeddings

There are three famous methods to obtain similarities between embeddings. These are:

Euclidean Distance — Distance between ends of vectors

The L2 norm calculates the distance of the vector coordinate from the origin of the vector space, for this reason, we can calculate the Euclidean distance with Python and the Numpy library using the norm function.

Euclidean Distance with Python and Numpy

Dot product — Cosine multiplied by lengths of both vectors

https://datahacker.rs/dot-product-inner-product/

The dot product is a simple vector operation in which we multiply element by element and then add those multiplication terms. We can perform this operation with Python and the Numpy library using the dot function.

Dot product operation with Python and Numpy

Cosine — Cosine of the angle between vectors

We can perform this operation with Python and the Numpy library something like this:

Cosine Similarity

Although with Numpy it is easy to calculate the similarity with MediaPipe it is easier because you only use a method like this:

MediaPipe cosine similarity

So let’s look at a complete example.

Embeddings and cosine similarity with MediaPipe

I hope this information is useful to you and remember to share this blog post, your comment is always welcome.

Visit my social networks:

How to add text similarity to your applications easily using MediaPipe and Python

MediaPipe Solutions

How to get embeddings with MediaPipe

How to get the distance between two embeddings

Resources

Written by Juan Guillermo Gómez Torres