Simplifying Image Fusion with GPT-4o
Image Fusion: A general term used in image processing where multiple images are combined into one. In the context of descriptions, it implies merging the descriptive content.
This article delves into my experience integrating a state-of-the-art vision-language model (VLM) with a rudimentary wxPython GUI to create a multi-modal image analysis tool. The interface facilitates the upload of multiple images and leverages the GPT-4o architecture to generate detailed, context-aware descriptions. This implementation showcases the potential of combining large language models (LLMs) with computer vision for advanced image understanding tasks.
The application’s GUI
While rudimentary, is bifurcated into two primary components:
- Query and Response Pane (Left): This section handles the natural language processing (NLP) aspect, allowing users to input prompts and view the model’s text generations. It serves as the interface for the language modeling component of the VLM.
- Image Viewer (Right): Featuring a tabbed interface, this pane manages the computer vision input. It supports the upload of multiple images, essentially preparing the visual embeddings for the model’s encoder.
Fusing 10 images
Creative spin:
Prompt: Give detailed creative mix of all image descriptions as one (do not split in two images). be as creative and weird as possible.
Listing objects
Model output:
**Image 1:**
- Broken glass shards
- Ukrainian flag
- Person wearing traditional clothing
- Floral headpiece
- Sunglasses**Image ID:** 1
**Image 2:**
- Person in elaborate robe
- Cat
- Ribbons
- Crown or headdress
- Birds**Image ID:** 2
**Image 3:**
- Hand holding an egg
- Decorated eggs
- Candles
- Flowers
- Lace tablecloth**Image ID:** 3
**Image 4:**
- Person with blue and yellow balloons
- Floral headpiece
- Kittens**Image ID:** 4
**Image 5:**
- Person in military uniform
- St. Basil’s Cathedral
- Rubble
- Rifle
- Fire**Image ID:** 5
**Image 6:**
- Person with painted face
- Ukrainian flag
- Field**Image ID:** 6
**Image 7:**
- Person with fairy wings
- Decorated eggs
- Sun
- Birds
- Butterflies**Image ID:** 7
**Image 8:**
- Person with flowers on hair
- Vine**Image ID:** 8
**Image 9:**
- Person holding a flag
- Fire
- Military uniform**Image ID:** 9
**Image 10:**
- Ukrainian coat of arms (trident)
- Flag
Result is almost to the point (missed a pig in one of the images) and order in which I attached it to payload is different from resulting image order.
20 images
I know it’s going to cost me but let’s try 20 images.
Let’s list all the objects
Image 1: Objects — flowers (Image ID: 1)
Image 2: Objects — buildings, fire, rifle, skulls, birds (Image ID: 2)
Image 3: Objects — globe, military equipment (Image ID: 3)
Image 4: Objects — truck, missile launcher, shield (Image ID: 4)
Image 5: Objects — flowers, ornate dress, headdress (Image ID: 5)
Image 6: Objects — pig, building, dress (Image ID: 6)
Image 7: Objects — Statue of Liberty (Image ID: 7)
Image 8: Objects — rifle (Image ID: 8)
Image 9: Objects — rifle, military equipment, bra (Image ID: 9)
Image 10: Objects — chariot, horses, flowers, sun (Image ID: 10)
Image 11: Objects — ceramic rabbit figurine, ceramic sunflower figurine (Image ID: 11)
Image 12: Objects — rifle, military equipment (Image ID: 12)
Image 13: Objects — buildings, fire, rifle, skulls, birds (Image ID: 13)
Image 14: Objects — rifle (Image ID: 14)
Image 15: Objects — cello, bow (Image ID: 15)
Image 16: Objects — flower crown (Image ID: 16)
Image 17: Objects — flowers, dress with floral pattern
It found the pig this time. but it’s under #6 not #14
Also list of items is shorter — it skipped 3 images.
Testing 30
Ok, “appetite comes after …” you know that saying, let’s break the bank and test 30.
Got the answer. Format is different and it’s limited to 17 again.
1. Image ID 1: Sunflowers, sky, clouds
2. Image ID 2: Fire, shield, missile launcher, soldier, explosion, clouds
3. Image ID 3: Kiss booth, people, sign, lips, bucket
4. Image ID 4: Person, boxing gloves, boxing shorts, tattoos
5. Image ID 5: Person, traditional clothing, flowers, map, trident symbol
6. Image ID 6: Smoke, flag, silhouette, debris
7. Image ID 7: Person, dress, water, sunflower petals
8. Image ID 8: Waterfall, person, cloak, rocks
9. Image ID 9: Person, ice cream, dress, floral headpiece
10. Image ID 10: Soldier, building, smoke, fire, flag, cityscape, birds
11. Image ID 11: Person, shield, flowing fabric
12. Image ID 12: Skeleton, wings, skull, child, ruins
13. Image ID 13: Pigeons, buildings, street, cars, traffic lights, crosswalk
14. Image ID 14: Person, giant sunflower, clouds, wheat field
15. Image ID 15: Person, flag, camouflage bikini, weapon
16. Image ID 16: Flag, water splash
17. Image ID 17: Person, sunflowers, traditional clothing, shorts, earrings, sunset
I increased max_tokens to 1000 and it went up to 20
1. Image ID: 1
— Gun
— Bikini2. Image ID: 2
— Dress3. Image ID: 3
— Mug
— Book
— Dog4. Image ID: 4
— Wheat field
— Sunflower5. Image ID: 5
— Wheat field
— Sunflower6. Image ID: 6
— Sunflowers7. Image ID: 7
— Mirrors
— Fragmented pieces8. Image ID: 8
— Military truck9. Image ID: 9
— Lights
— Dress
10. Image ID: 10
— Ruins11. Image ID: 11
— Boxing gloves12. Image ID: 12
— Sunflowers
— Lightning13. Image ID: 13
— Shield14. Image ID: 14
— Pigeons15. Image ID: 15
— Waterfall16. Image ID: 16
— Skeleton
— Wings
— Baby doll17. Image ID: 17
— Wire halo18. Image ID: 18
— Flag19. Image ID: 19
— Gun
— Bikini
20. Image ID: 20
— Booth
I tried different values for max_tokens but it list never went to 30 images.
It was 15, then 17 etc.
Conclusion
Gpt4-Vision serves as an excellent example of how powerful AI models can be integrated into desktop applications. It bridges the gap between cutting-edge AI technology and everyday users, providing a tool that can be valuable in various fields such as research, education, and creative industries.
While not revolutionary, Gpt4-Vision represents a practical step forward in making AI image analysis more accessible. As AI continues to advance, tools like this that simplify interaction with complex models will play a crucial role in broadening the adoption and understanding of AI capabilities.
For developers and enthusiasts interested in the intersection of AI and desktop application development, Gpt4-Vision serves as an inspiring example of what can be achieved with wxPython and modern AI models.