How to add text similarity to your applications easily using MediaPipe and Python

Juan Guillermo Gómez Torres
Google Developer Experts
4 min readNov 1, 2023
How to add text similarity to your applications easily using MediaPipe and Python

Machine learning is everywhere when using a mobile app: recommending people or products, detecting people or tags in an image, etc.

Recommend means getting similar items and the user can see and consume something new.

Embedding is an essential component when you need to obtain similarities and is used to:

  • Semantic search
  • Clustering
  • Recommendations
  • Anomaly detection
  • Classification

An embedding is a numerical representation or a vector of floating point numbers, like this:

embedding": [
-0.006929283495992422,
-0.005336422007530928,
...
-4.547132266452536e-05,
-0.024047505110502243
]

Now, you can perform operations between two embeddings, for example, distance measures of their relationship. Small distances suggest a high relationship and large distances suggest a low relationship. There are three most commonly used measures listed below.:

  • Euclidean Distance — Distance between ends of vectors
  • Cosine — Cosine of the angle between vectors
  • Dot product — Cosine multiplied by lengths of both vectors

How do you get embeddings? This is our next question. There are some libraries and APIs to get embeddings like the following:

  • MediaPipe
  • Vertex AI client libraries
  • PaLM API by Google
  • OpenAI API
  • SentenceTransformers
  • FastText by Facebook

On the next step, we’ll talk about MediaPipe

MediaPipe Solutions

MediaPipe Solutions provides a set of libraries and tools to apply machine learning (ML) to your applications quickly. You can use SDKs for multiple platforms, including Android, Python, and Web, to get your machine learning up and running in fewer minutes. Additionally, you can customize your models using transfer learning. MediaPipe offers libraries and resources to use in your applications and tools to customize and evaluate new solutions.

MediaPipe also offers a library where you can get image and text embeddings; In our case, we use it for texts. The MediaPipe documentation recommends the Universal Sentence Encoder model.

How to get embeddings with MediaPipe

We are using the SDK for Python to get embeddings.

The are the steps:

  • Install the MediaPipe Python package
pip install mediapipe
  • Download the model and store it in your project [download model].
  • Then you can get text embeddings like this:
Embeddings with MediaPipe

Okay, the next step would be to determine the level of similarity between the sentences.

Let’s look at an example with two sentences and their embeddings, something like this:

Sentences:

{
"text1":"How's it going?",
"text2":"I am fine"
}

Embeddings:

{
"text1embedding": "[127 16 185 127 82 127 128 50 127 127 172 10 127 128 127 127 7 160\n 128 128 128 90 127 238 70 127 246 128 127 127 170 128 182 185 9 76\n 154 196 4 42 136 127 127 127 128 28 151 127 127 4 135 127 80 157\n 77 90 113 41 15 127 128 167 127 83 1 127 217 60 128 90 255 2\n 161 232 24 171 127 9 55 12 127 210 127 87 181 79 127 88 128 124\n 128 7 128 128 128 19 127 127 250 145]",
"text2embedding": "[127 44 209 127 35 127 128 128 127 81 176 26 127 128 127 127 242 180\n 139 128 128 127 127 147 126 127 230 128 127 127 200 137 128 9 65 70\n 217 128 22 124 142 127 118 127 194 131 128 127 110 245 142 127 127 151\n 127 50 67 61 248 127 128 128 127 36 216 127 218 106 151 78 20 223\n 182 189 222 233 127 1 76 11 127 253 127 33 186 127 127 235 128 121\n 128 4 128 128 175 187 127 87 228 141]",
}

How to get the distance between two embeddings

There are three famous methods to obtain similarities between embeddings. These are:

  • Euclidean Distance — Distance between ends of vectors
Euclidean distance

The L2 norm calculates the distance of the vector coordinate from the origin of the vector space, for this reason, we can calculate the Euclidean distance with Python and the Numpy library using the norm function.

Euclidean Distance with Python and Numpy
  • Dot product — Cosine multiplied by lengths of both vectors
https://datahacker.rs/dot-product-inner-product/

The dot product is a simple vector operation in which we multiply element by element and then add those multiplication terms. We can perform this operation with Python and the Numpy library using the dot function.

Dot product operation with Python and Numpy
  • Cosine — Cosine of the angle between vectors
https://medium.com/nerd-for-tech/a-comparison-of-cosine-similarity-vs-euclidean-distance-in-als-recommendation-engine-51898f9025e7

We can perform this operation with Python and the Numpy library something like this:

Cosine Similarity

Although with Numpy it is easy to calculate the similarity with MediaPipe it is easier because you only use a method like this:

MediaPipe cosine similarity

So let’s look at a complete example.

Embeddings and cosine similarity with MediaPipe

--

--

Juan Guillermo Gómez Torres
Google Developer Experts

Tech Lead in Wordbox. @GDGCali Founder. @DevHackCali Founder. Firebase & GCPcloud & Kotlin & ODML @GoogleDevExpert . Android lover.