Develop An Application Using FastAPI Groq and Vison Models Llama 3.2 11B and llava-v1.5–7b
Introduction
Meta has come up with latest release of Llama Models Llama-3.2.Following are the key takeaways:
- includes small and medium sized text only models 1B and 3B sizes that are multilingual .
- these models are light weight that fit on to edge and mobile devices including pretrained instruction tuned versions.
- these text only models support context length of 128 tokens and excel in summarization, instruction following and rewriting tasks.
- 11B and 90B sizes that take both text and image inputs and output text are drop-in replacement for their corresponding text model equivalents while exceeding on image understanding tasks compared to closed models such as Claude Haiku-3.
- 11B and 90B vision models support image reasoning use cases such as document level understanding including charts and graphs, caption images, directionally pinpointing objects in images based on natural language descriptions.
Here we will be experimenting on Llama-3.2–11B vision model and compare it’s performance with Llava Model.
Technology Stack Used
- Groq Vision Models
- Cursor AI — to develop the UI screen look and feel
- FastAPI
Work Flow Design Steps
- User Interface Initialization:
- The web page loads, displaying the “ANALYZE IMAGE APPLICATION” header.
- Two main sections are shown: “Upload Image” and “Ask Question”.
- Two response sections for different models are initially hidden.
2. Image Upload:
- User clicks on the file input in the “Upload Image” section.
- User selects an image file from their device.
- The selected image is immediately displayed on the page.
3. Question Input:
- User types a question about the uploaded image in the textarea.
4. Query Submission:
- User clicks the “Submit Query” button.
- Frontend validates that both an image is uploaded and a question is entered.
- If validation fails, an error message is displayed.
5. Frontend Processing:
- The submit button is disabled and shows “Processing…”.
- A FormData object is created with the image file and question text.
- An asynchronous POST request is sent to the ‘/upload_and_query’ endpoint.
6. Backend Processing:
- The server receives the POST request with the image and query.
- The image is read and encoded to base64.
- The image format is validated.
- Two separate API requests are made to the Groq API:
a) One for the “llama-3.2–11b-vision-preview” model.
b) Another for the “llava-v1.5–7b-4096-preview” model.
- Responses from both models are processed and logged.
7. Server Response:
- The server sends a JSON response back to the frontend with answers from both models.
8. Frontend Response Handling:
- The frontend receives the JSON response.
- If the response is successful:
-The answers are parsed as markdown and displayed in their respective sections.
- Both response containers are made visible.
-If there’s an error:
-An error message is displayed.
9. UI Update:
- The submit button is re-enabled and reverts to “Submit Query”.
- The responses are displayed in markdown format in their respective sections.
10. Error Handling:
- If any errors occur during the process, they are caught and displayed to the user.
- The error message is shown in a dedicated error container.
- Response containers are hidden if there’s an error.
11. Repeat:
- The user can upload a new image and/or ask a new question, repeating the process from step 2.
Code Implementation
script.js → all credits to Cursor AI
document.addEventListener('DOMContentLoaded', function() {
const imageUpload = document.getElementById('image-upload');
const displayImage = document.getElementById('display-image');
const imageContainer = document.getElementById('image-container');
const queryInput = document.getElementById('query-input');
const submitQuery = document.getElementById('submit-query');
const responseLlamaContainer = document.getElementById('response-container-llama');
const responseLlamaText = document.getElementById('response-text-llama');
const responseLlavaContainer = document.getElementById('response-container-llava');
const responseLlavaText = document.getElementById('response-text-llava');
const errorContainer = document.getElementById('error-container');
const errorText = document.getElementById('error-text');
// Image upload and display
imageUpload.addEventListener('change', (event) => {
const file = event.target.files[0];
if (file) {
const reader = new FileReader();
reader.onload = (e) => {
displayImage.src = e.target.result;
imageContainer.classList.remove('hidden');
};
reader.readAsDataURL(file);
}
});
// Submit query
submitQuery.addEventListener('click', async () => {
const image = imageUpload.files[0];
const query = queryInput.value;
if (!image || !query) {
showError('Please upload an image and enter a query.');
return;
}
const formData = new FormData();
formData.append('image', image);
formData.append('query', query);
try {
submitQuery.disabled = true;
submitQuery.textContent = 'Processing...';
submitQuery.classList.add('loading');
const response = await fetch('/upload_and_query', {
method: 'POST',
body: formData
});
const result = await response.json();
if (!response.ok) {
throw new Error(result.detail || 'An error occurred while processing your request.');
}
responseLlamaText.innerHTML = marked.parse(result.llama);
responseLlamaContainer.classList.remove('hidden');
responseLlavaText.innerHTML = marked.parse(result.llava);
responseLlavaContainer.classList.remove('hidden');
errorContainer.classList.add('hidden');
} catch (error) {
console.error('Error:', error);
showError(error.message);
} finally {
submitQuery.disabled = false;
submitQuery.textContent = 'Submit Query';
submitQuery.classList.remove('loading');
}
});
function showError(message) {
errorText.textContent = message;
errorContainer.classList.remove('hidden');
responseLlamaContainer.classList.add('hidden');
responseLlavaContainer.classList.add('hidden');
}
});
#style.css → all credits to Cursor AI
body {
font-family: Arial, sans-serif;
line-height: 1.6;
margin: 0;
padding: 0;
background-color: #121212;
color: #e0e0e0;
}
.container {
width: 90%;
max-width: 1200px;
margin: auto;
padding: 20px;
}
h1, h2 {
color: #bb86fc;
text-align: center;
}
.main-content {
display: flex;
gap: 20px;
flex-wrap: wrap;
margin-bottom: 20px;
}
.image-section, .query-section {
flex: 1;
min-width: 300px;
background-color: #1f1f1f;
padding: 20px;
border-radius: 5px;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.3);
}
input[type="file"], textarea, button {
width: 100%;
padding: 10px;
margin-bottom: 10px;
background-color: #2f2f2f;
border: 1px solid #3d3d3d;
color: #e0e0e0;
}
button {
background-color: #3700b3;
color: white;
border: none;
cursor: pointer;
}
button:hover {
background-color: #5600e8;
}
#display-image {
max-width: 100%;
height: auto;
border: 1px solid #3d3d3d;
}
#response-container {
background-color: #1f1f1f;
padding: 20px;
border-radius: 5px;
box-shadow: 0 2px 5px rgba(0, 0, 0, 0.3);
}
#response-text {
background-color: #2f2f2f;
padding: 10px;
border-radius: 5px;
min-height: 50px;
border: 1px solid #3d3d3d;
}
::-webkit-file-upload-button {
background-color: #3700b3;
color: white;
border: none;
padding: 10px;
cursor: pointer;
}
::-webkit-file-upload-button:hover {
background-color: #5600e8;
}
.hidden {
display: none;
}
Index.html → all credits to Cursor AI
<!DOCTYPE html>
<html lang="en" data-theme="dark">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Analyze Image Application</title>
<link href="https://cdn.jsdelivr.net/npm/daisyui@3.7.3/dist/full.css" rel="stylesheet" type="text/css" />
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://kit.fontawesome.com/your-fontawesome-kit.js" crossorigin="anonymous"></script>
<script src="https://cdn.jsdelivr.net/npm/marked/marked.min.js"></script>
<style>
body {
background-image: linear-gradient(to bottom right, #1a202c, #2d3748);
}
.logo {
font-size: 3rem;
color: #a0aec0;
}
.header-text {
font-size: 3rem;
font-weight: bold;
color: #bb86fc;
text-shadow: 2px 2px 4px rgba(0,0,0,0.5);
}
.response-header {
font-size: 1.5rem;
font-weight: bold;
color: #03dac6;
}
.markdown-content {
font-size: 1rem;
line-height: 1.6;
}
.markdown-content h1, .markdown-content h2, .markdown-content h3 {
color: #bb86fc;
margin-top: 1em;
margin-bottom: 0.5em;
}
.markdown-content p {
margin-bottom: 1em;
}
.markdown-content ul, .markdown-content ol {
margin-left: 1.5em;
margin-bottom: 1em;
}
.markdown-content code {
background-color: #2d3748;
padding: 0.2em 0.4em;
border-radius: 3px;
}
</style>
</head>
<body class="min-h-screen">
<div class="container mx-auto p-4">
<div class="flex items-center justify-center mb-12">
<i class="fas fa-camera logo mr-6"></i>
<h1 class="header-text text-center">ANALYZE IMAGE APPLICATION</h1>
</div>
<div class="grid grid-cols-1 lg:grid-cols-2 gap-8 mb-8">
<div class="card bg-base-200 shadow-xl">
<div class="card-body">
<h2 class="card-title text-secondary">Upload Image</h2>
<input type="file" id="image-upload" accept="image/*" class="file-input file-input-bordered file-input-primary w-full" />
<div id="image-container" class="hidden mt-4">
<img id="display-image" src="" alt="Uploaded image" class="w-full rounded-lg shadow-lg" />
</div>
</div>
</div>
<div class="card bg-base-200 shadow-xl">
<div class="card-body">
<h2 class="card-title text-secondary">Ask Question</h2>
<textarea id="query-input" rows="4" placeholder="Enter your question about the image" class="textarea textarea-bordered textarea-primary w-full mb-4"></textarea>
<button id="submit-query" class="btn btn-primary mb-4">Submit Query</button>
</div>
</div>
</div>
<div class="grid grid-cols-1 lg:grid-cols-2 gap-8">
<div class="card bg-base-200 shadow-xl">
<div class="card-body">
<h2 class="card-title text-secondary response-header">llama-3.2-11b-vision-preview Response</h2>
<div id="response-container-llama" class="hidden">
<div id="response-text-llama" class="text-base-content bg-base-300 p-4 rounded-lg markdown-content"></div>
</div>
</div>
</div>
<div class="card bg-base-200 shadow-xl">
<div class="card-body">
<h2 class="card-title text-secondary response-header">llava-v1.5-7b-4096-preview Response</h2>
<div id="response-container-llava" class="hidden">
<div id="response-text-llava" class="text-base-content bg-base-300 p-4 rounded-lg markdown-content"></div>
</div>
</div>
</div>
</div>
<div id="error-container" class="mt-8 hidden">
<div class="alert alert-error">
<svg xmlns="http://www.w3.org/2000/svg" class="stroke-current shrink-0 h-6 w-6" fill="none" viewBox="0 0 24 24"><path stroke-linecap="round" stroke-linejoin="round" stroke-width="2" d="M10 14l2-2m0 0l2-2m-2 2l-2-2m2 2l2 2m7-2a9 9 0 11-18 0 9 9 0 0118 0z" /></svg>
<span id="error-text"></span>
</div>
</div>
</div>
<script src="/static/script.js"></script>
</body>
</html>
Create .env file to set up Groq API key
main processing script → main.py
from fastapi import FastAPI, File, UploadFile, Form, Request, HTTPException
from fastapi.templating import Jinja2Templates
from fastapi.staticfiles import StaticFiles
from fastapi.responses import HTMLResponse, JSONResponse
import base64
import requests
import io
from PIL import Image
from dotenv import load_dotenv
import os
import logging
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Load environment variables from .env file
load_dotenv()
app = FastAPI()
app.mount("/static", StaticFiles(directory="static"), name="static")
templates = Jinja2Templates(directory="templates")
GROQ_API_URL = "https://api.groq.com/openai/v1/chat/completions"
GROQ_API_KEY = os.getenv("GROQ_API_KEY")
if not GROQ_API_KEY:
raise ValueError("GROQ_API_KEY is not set in the .env file")
@app.get("/", response_class=HTMLResponse)
async def read_root(request: Request):
return templates.TemplateResponse("index.html", {"request": request})
@app.post("/upload_and_query")
async def upload_and_query(image: UploadFile = File(...), query: str = Form(...)):
try:
# Read and encode the image
image_content = await image.read()
if not image_content:
raise HTTPException(status_code=400, detail="Empty file")
encoded_image = base64.b64encode(image_content).decode("utf-8")
# Validate image format
try:
img = Image.open(io.BytesIO(image_content))
img.verify()
except Exception as e:
logger.error(f"Invalid image format: {str(e)}")
raise HTTPException(status_code=400, detail=f"Invalid image format: {str(e)}")
# Prepare the messages for Groq API
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": query},
{"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{encoded_image}"}}
]
}
]
# Function to make API request
def make_api_request(model):
response = requests.post(
GROQ_API_URL,
json={
"model": model,
"messages": messages,
"max_tokens": 1000
},
headers={
"Authorization": f"Bearer {GROQ_API_KEY}",
"Content-Type": "application/json"
},
timeout=30
)
return response
# Make requests to both models
llama_response = make_api_request("llama-3.2-11b-vision-preview")
llava_response = make_api_request("llava-v1.5-7b-4096-preview")
# Process responses
responses = {}
for model, response in [("llama", llama_response), ("llava", llava_response)]:
if response.status_code == 200:
result = response.json()
answer = result["choices"][0]["message"]["content"]
logger.info(f"Processed response from {model} API: {answer[:100]}...")
responses[model] = answer
else:
logger.error(f"Error from {model} API: {response.status_code} - {response.text}")
responses[model] = f"Error from {model} API: {response.status_code}"
return JSONResponse(status_code=200, content=responses)
except HTTPException as he:
logger.error(f"HTTP Exception: {str(he)}")
raise he
except Exception as e:
logger.error(f"An unexpected error occurred: {str(e)}")
raise HTTPException(status_code=500, detail=f"An unexpected error occurred: {str(e)}")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Run the webapp → python main.py
INFO: Started server process [33732]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
go to http://localhost:8000 → The below page gets invoked
Interact with Charts
Question : What is the contribution of fats in a balanced diet.
llama-3.2–11b-vision-preview Response
The pie chart illustrates the recommended balanced diet, which is divided into four distinct sections:
- Fruits and vegetables (40%)
- Protein (25%)
- Good carbs (25%)
- Fats (10%)
This breakdown is essential for maintaining a healthy lifestyle. Incorporating a sufficient amount of each category helps ensure that the body receives the necessary nutrients.
Answer: 10%
llava-v1.5–7b-4096-preview Response
A balanced diet should include appropriate intake of different food groups, such as fruits, vegetables, grains, proteins, and fats. It is essential to consume fats in moderation, as not only are they vital for various bodily needs, but it is also a source of energy. Fat sources, like olive oil and nuts or seeds, help maintain proper functioning and overall health. They also contribute to the nutritional value of meals when combined with other sources of essential nutrients. Overconsumption of saturated fats, however, can lead to weight gain, liability, and increase the risk of heart diseases, hence the importance of moderation in consuming fats.
Interact with Image
Question : Write a prompt to generate the image uploaded.
llama-3.2–11b-vision-preview Response
Here is a concise response:
The image depicts a woman wearing a futuristic-style visor that covers her forehead, with a glowing pink and blue light emanating from it. She has dark hair tied back in a ponytail and is facing slightly to the right, wearing a sleeveless top with thin straps and small hoop earrings. The background is blurred, but it appears to feature a cityscape with vibrant blue and pink lights shining through the windows of tall buildings and light bulbs lining the window frame.
llava-v1.5–7b-4096-preview Response
Generate a glamour shot of a woman in the light of an eyelash curler.
Question : Specify measures to make the image aesthetically pleasing.
llama-3.2–11b-vision-preview Response
Here is a rephrased version of the caption:
To create an aesthetically pleasing image, consider the following measures:
- Balance and Composition: Ensure the face is placed centrally and harmoniously within the frame, using elements to guide the eye and create a sense of balance.
- Color Harmony: Employ a consistent and pleasing color scheme that enhances the image’s aesthetic appeal. This can include using a limited palette to create a cohesive and cohesive appearance.
- Lighting: Use soft, flattering lighting to enhance the model’s features and create a warm, inviting atmosphere. Consider using diffuse lighting or softbox setups for a more natural look.
- Expression and Emotion: Capture the model’s expression in a way that conveys a sense of approachability, confidence, or happiness. A genuine smile or gentle gaze can add depth and emotional resonance to the image.
- Post-processing: Carefully edit the image to enhance its aesthetic appeal. This may involve adjusting exposure, contrast, and saturation to create a cohesive and visually appealing look.
- Focus and Depth of Field: Ensure the model’s face is in sharp focus, while the background is blurred to create a sense of depth and dimensionality.
- Texture and Pattern: Incorporate subtle texture or pattern to add visual interest and depth to the image. This can include the model’s clothing, accessories, or the background environment.
By considering these measures, you can create an aesthetically pleasing image that draws the viewer’s attention and creates a lasting impression.
llava-v1.5–7b-4096-preview Response
To make the image aesthetically pleasing, adjust the lighting conditions, composition, or camera settings. By using a high-resolution camera with a good focus, employing an interesting background, or increasing the saturation, contrast, and brightness, the photograph can create a visually appealing image. Experimenting with angles, framing, and post-processing techniques can also contribute to improving the visual quality of the image. Additionally, the right lighting conditions, such as using an enclosed hotspot of light or natural light, can emphasize the woman’s facial features and outfit. To make the whole scene more vivid and engaging, including her head band, do appropriate post-processing, and keep an eye on the outfit’s design as well. This combination will create a look that is memorable and eye-catching.
Question : Write a prompt to enhance the image uploaded
llama-3.2–11b-vision-preview Response
Here is a rephrased caption that meets the requirements:
This image showcase photo has been enhanced to reflect a sense of wonder and joy. Through subtle brightness and saturation adjustments, the vivid red hood and furry trim have been infused with a warm, inviting glow, evoking a feeling of comfort and luxury. The blurred foreground, with its subtle changes in contrast and texture, adds depth and dimensionality to the image, drawing the viewer’s eye to the enchanting creature.
The combination of these enhancements has resulted in an overall effect that is both captivating and emotionally engaging. The image now exudes a sense of enchantment and wonder, inviting the viewer to step into its magical world and experience the joy and comfort that it embodies.
llava-v1.5–7b-4096-preview Response
To enhance the image, you can use a software tool like Adobe Photoshop to reduce the amount of noise present in the background and adjust the colors to create a richer, more vivid experience. The software offers options for various filter effects that can be applied to the image to enhance the appearance of the lights and the woman’s skin tone, while maintaining the sense of a neon-lit environment. Additionally, you can adjust the overall composition of the image, such as the perspective and framing, for a more aesthetically pleasing and clearer scene.
Conclusion
By Comparing the response from both vision model Llama3.2–11B model clearly over performs Llava Model.