Sentence Embeddings have a problem, the reason sometimes Dall-E2 fails
I got access to Dall-E2 a couple of weeks back and I have been amazed with the awesome images it creates. But playing with Dall-E2 I couldn’t shake of the feeling that something was missing. Some times images come out exactly as if the model read your mind. Other times it just would not create the images you wanted.
The doubts increased when I saw the below tweet.
From the Dall-E2 paper we can clearly see that the text encoding is the starting step. And then the images are generated with a diffusion step.
The diffusion step takes an embedding and generates an image. Looks like the diffusion step works fine. It takes an embedding as a starting point and generates an image.
What if in the scenarios that Dall-E2 is not able to produce the proper results, the issue is with the text encoding.
To test this hypothesis out I tried out the following steps:
- Pick up some problematic phrases(People playing cricket with a tennis bat)
- Run variations of the phrases and do a sentence similarity comparison on a hugging face model(https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- If two sentences with different meanings have very close similarity scores that means the sentence similarity model is not able to capture the essence of the sentence.
From my simple experiments, it turns out, sentence embeddings have a huge problem. Sentences like “People playing cricket with a tennis bat” and “People playing tennis with a cricket bat” have very high similarity scores even though we know that they are different sentences.
Several others also found some issues in negative sentences.
It appears that the sentence embeddings are not capturing negation in sentences. Negations don’t seem to affect sentence similarity scores and since sentence embeddings are the core first step for all other tasks, they fail.
Some more examples
So what meaning exactly are the vectors embedding?
From the looks of it, it looks like the sentence embeddings just capture a very high level meaning of the sentence like people playing something with something. But it does not capture information like what are they playing and with what.
Unless the sentence embedding problem is solved, systems like Dall-E2 built on top of embeddings will have problems.