The Ideal Model Concept: Embeddings

Embeddings have always been a core part and the main representation of “Context” in any AI model. Whether the represented context is text, image, sound, or a combination of all three, the essence of that context remains unchanged.

Say you have an idea or a scenario in your mind. Your representation of it could be self-talk (Text, regardless of the language), or you can imagine it in front of you (Image/Video), or any other mediums. In this case, the idea is the context, which is what the ideal embedding trying to represent

Example: Let’s say there is a context that I want to deliver to you, here are 3 mediums that can mostly represent it.

  • Text: “A cozy reading room with a fireplace while it’s raining outside”
  • Image:
Generated Image
  • Audio:
As you can see, each medium delivers the context differently, narrowing down uncertainties and completing the picture in your mind, which deepens your understanding of the context. However, there is still a huge gap between observing these representations to understand the context and sensing it in reality, as witnessing a situation conveys much deeper information than seeing it.

The Main Idea

The Ideal Embedding Model Philosophy: A model that can truly understand a context completely from any medium and represent it perfectly, whether it was text (perfect storytelling), an image (perfect photography), or any other medium.

That was quite simple and ordinary to say, but the true value of such model is its ability to represent anything. Don’t you see how great writers strive to deliver their idea using stories? and how directors try to shoot a movie that delivers the story’s idea perfectly? and how musicians transfer ideas through sounds? It is a concept to reflect.

Other Concepts

Dimensionality: Have you ever struggled to communicate an idea effectively? well, your ideas are multi-dimensional, they include proper background knowledge, imaginations, experiences, and other information that builds up to construct the idea. However, text is 1-dimensional which could barely convey the complete meaning if not done perfectly. Voice could be 2-dimensional if we are going to count the voice level and tone, or 3 if we are going to count body language. You see, each time we add a dimension, it becomes easier to deliver an idea correctly.

Compression: I am struggling to come up with a satisfying title for this article, it was hard enough to translate my multi-dimensional idea into text in the first place, let alone setting a title that represent it in less than 10 words. This compressed representation will not show the complete idea, but it will save time. Just as if you watch a movie review/explanation video, you will get a deformed idea of the story but in a fraction of the time.

