Simplifying Image Fusion with GPT-4o

alex buzunov
CodeX
Published in
5 min readJun 20, 2024

Image Fusion: A general term used in image processing where multiple images are combined into one. In the context of descriptions, it implies merging the descriptive content.

This article delves into my experience integrating a state-of-the-art vision-language model (VLM) with a rudimentary wxPython GUI to create a multi-modal image analysis tool. The interface facilitates the upload of multiple images and leverages the GPT-4o architecture to generate detailed, context-aware descriptions. This implementation showcases the potential of combining large language models (LLMs) with computer vision for advanced image understanding tasks.

The application’s GUI

While rudimentary, is bifurcated into two primary components:

  1. Query and Response Pane (Left): This section handles the natural language processing (NLP) aspect, allowing users to input prompts and view the model’s text generations. It serves as the interface for the language modeling component of the VLM.
  2. Image Viewer (Right): Featuring a tabbed interface, this pane manages the computer vision input. It supports the upload of multiple images, essentially preparing the visual embeddings for the model’s encoder.

Fusing 10 images

Creative spin:

Prompt: Give detailed creative mix of all image descriptions as one (do not split in two images). be as creative and weird as possible.

10 images demo

Listing objects

Model output:

**Image 1:**

- Broken glass shards
- Ukrainian flag
- Person wearing traditional clothing
- Floral headpiece
- Sunglasses

**Image ID:** 1

**Image 2:**

- Person in elaborate robe
- Cat
- Ribbons
- Crown or headdress
- Birds

**Image ID:** 2

**Image 3:**

- Hand holding an egg
- Decorated eggs
- Candles
- Flowers
- Lace tablecloth

**Image ID:** 3

**Image 4:**

- Person with blue and yellow balloons
- Floral headpiece
- Kittens

**Image ID:** 4

**Image 5:**

- Person in military uniform
- St. Basil’s Cathedral
- Rubble
- Rifle
- Fire

**Image ID:** 5

**Image 6:**

- Person with painted face
- Ukrainian flag
- Field

**Image ID:** 6

**Image 7:**

- Person with fairy wings
- Decorated eggs
- Sun
- Birds
- Butterflies

**Image ID:** 7

**Image 8:**

- Person with flowers on hair
- Vine

**Image ID:** 8

**Image 9:**

- Person holding a flag
- Fire
- Military uniform

**Image ID:** 9

**Image 10:**

- Ukrainian coat of arms (trident)
- Flag

Result is almost to the point (missed a pig in one of the images) and order in which I attached it to payload is different from resulting image order.

20 images

I know it’s going to cost me but let’s try 20 images.

Let’s list all the objects

Image 1: Objects — flowers (Image ID: 1)
Image 2: Objects — buildings, fire, rifle, skulls, birds (Image ID: 2)
Image 3: Objects — globe, military equipment (Image ID: 3)
Image 4: Objects — truck, missile launcher, shield (Image ID: 4)
Image 5: Objects — flowers, ornate dress, headdress (Image ID: 5)
Image 6: Objects — pig, building, dress (Image ID: 6)
Image 7: Objects — Statue of Liberty (Image ID: 7)
Image 8: Objects — rifle (Image ID: 8)
Image 9: Objects — rifle, military equipment, bra (Image ID: 9)
Image 10: Objects — chariot, horses, flowers, sun (Image ID: 10)
Image 11: Objects — ceramic rabbit figurine, ceramic sunflower figurine (Image ID: 11)
Image 12: Objects — rifle, military equipment (Image ID: 12)
Image 13: Objects — buildings, fire, rifle, skulls, birds (Image ID: 13)
Image 14: Objects — rifle (Image ID: 14)
Image 15: Objects — cello, bow (Image ID: 15)
Image 16: Objects — flower crown (Image ID: 16)
Image 17: Objects — flowers, dress with floral pattern

It found the pig this time. but it’s under #6 not #14

Also list of items is shorter — it skipped 3 images.

Testing 30

Ok, “appetite comes after …” you know that saying, let’s break the bank and test 30.

Got the answer. Format is different and it’s limited to 17 again.

1. Image ID 1: Sunflowers, sky, clouds
2. Image ID 2: Fire, shield, missile launcher, soldier, explosion, clouds
3. Image ID 3: Kiss booth, people, sign, lips, bucket
4. Image ID 4: Person, boxing gloves, boxing shorts, tattoos
5. Image ID 5: Person, traditional clothing, flowers, map, trident symbol
6. Image ID 6: Smoke, flag, silhouette, debris
7. Image ID 7: Person, dress, water, sunflower petals
8. Image ID 8: Waterfall, person, cloak, rocks
9. Image ID 9: Person, ice cream, dress, floral headpiece
10. Image ID 10: Soldier, building, smoke, fire, flag, cityscape, birds
11. Image ID 11: Person, shield, flowing fabric
12. Image ID 12: Skeleton, wings, skull, child, ruins
13. Image ID 13: Pigeons, buildings, street, cars, traffic lights, crosswalk
14. Image ID 14: Person, giant sunflower, clouds, wheat field
15. Image ID 15: Person, flag, camouflage bikini, weapon
16. Image ID 16: Flag, water splash
17. Image ID 17: Person, sunflowers, traditional clothing, shorts, earrings, sunset

I increased max_tokens to 1000 and it went up to 20

1. Image ID: 1
— Gun
— Bikini

2. Image ID: 2
— Dress

3. Image ID: 3
— Mug
— Book
— Dog

4. Image ID: 4
— Wheat field
— Sunflower

5. Image ID: 5
— Wheat field
— Sunflower

6. Image ID: 6
— Sunflowers

7. Image ID: 7
— Mirrors
— Fragmented pieces

8. Image ID: 8
— Military truck

9. Image ID: 9
— Lights
— Dress

10. Image ID: 10
— Ruins

11. Image ID: 11
— Boxing gloves

12. Image ID: 12
— Sunflowers
— Lightning

13. Image ID: 13
— Shield

14. Image ID: 14
— Pigeons

15. Image ID: 15
— Waterfall

16. Image ID: 16
— Skeleton
— Wings
— Baby doll

17. Image ID: 17
— Wire halo

18. Image ID: 18
— Flag

19. Image ID: 19
— Gun
— Bikini

20. Image ID: 20
— Booth

I tried different values for max_tokens but it list never went to 30 images.

It was 15, then 17 etc.

Conclusion

Gpt4-Vision serves as an excellent example of how powerful AI models can be integrated into desktop applications. It bridges the gap between cutting-edge AI technology and everyday users, providing a tool that can be valuable in various fields such as research, education, and creative industries.

While not revolutionary, Gpt4-Vision represents a practical step forward in making AI image analysis more accessible. As AI continues to advance, tools like this that simplify interaction with complex models will play a crucial role in broadening the adoption and understanding of AI capabilities.

For developers and enthusiasts interested in the intersection of AI and desktop application development, Gpt4-Vision serves as an inspiring example of what can be achieved with wxPython and modern AI models.

--

--