(Part 2) Getting Hands-On with MiniCPM-V: Code, Setup, and Demos!
Hey there, AI adventurers! 🤖 This is the second part of our three-part series on MiniCPM-V, where we dive into the technical setup, installation, and usage of these amazing models. If you missed the first part, feel free to check it out on my profile for a high-level overview. Now, let’s get into the fun stuff!
Chat with the Demo on Gradio
Let’s start with something exciting — chatting with MiniCPM-V using Gradio. Gradio provides an easy-to-use interface for creating web-based demos. Here’s how you can set it up:
import gradio as gr
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load model and tokenizer
model_name = "OpenBMB/MiniCPM-Llama3-V-2.5"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define the chat function
def chat(input_text):
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
# Create Gradio interface
iface = gr.Interface(fn=chat, inputs="text", outputs="text", title="Chat with MiniCPM-V")
iface.launch()
Just run the above code, and voilà ! You can chat with MiniCPM-V right from your browser.
Setting Up MiniCPM-V: Installation Guide
Before we can start using MiniCPM-V, we need to install it. Here’s a step-by-step guide to get you started:
1. Install Dependencies: Make sure you have Python and pip installed. Then, install the required packages:
pip install torch transformers gradio
2. Clone the Repository: Get the latest version of MiniCPM-V from GitHub:
git clone https://github.com/OpenBMB/MiniCPM-V.git
cd MiniCPM-V
3. Install the Model: Navigate to the project directory and install the model:
pip install .
Let’s Talk Inference: Making Predictions with MiniCPM-V
Once you have MiniCPM-V installed, you can start making predictions. Here’s a basic inference example:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
# Load the model and tokenizer
model_name = "OpenBMB/MiniCPM-Llama3-V-2.5"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Define your input text
input_text = "Translate English to French: Hello, how are you?"
# Encode and generate
inputs = tokenizer.encode(input_text, return_tensors="pt")
outputs = model.generate(inputs)
# Decode and print the result
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Run this code to see MiniCPM-V in action, translating English to French. Pretty neat, right? 😊
Discover the Model Zoo: Exploring Available Models
MiniCPM-V comes with a variety of models for different tasks. Here’s how you can explore the model zoo:
from transformers import AutoTokenizer, AutoModel
model_names = ["OpenBMB/MiniCPM-V-1.0", "OpenBMB/MiniCPM-V-2.0", "OpenBMB/MiniCPM-Llama3-V-2.5"]
for model_name in model_names:
model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
print(f"Loaded {model_name} successfully!")
Multi-Turn Conversations: Chat Like a Pro
Multi-turn conversations make interactions more dynamic. Here’s an example to handle multi-turn dialogues:
chat_history = []
def chat(input_text):
global chat_history
inputs = tokenizer.encode(input_text + tokenizer.eos_token, return_tensors="pt")
outputs = model.generate(inputs, max_length=500, pad_token_id=tokenizer.eos_token_id)
response = tokenizer.decode(outputs[:, inputs.shape[-1]:][0], skip_special_tokens=True)
chat_history.append(response)
return response
print(chat("Hello! How are you today?"))
print(chat("Can you tell me a joke?"))
Speed Up with Multi-GPU Inference
Leverage multiple GPUs to speed up inference. Here’s how:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
import torch
model_name = "OpenBMB/MiniCPM-Llama3-V-2.5"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name).half()
model = torch.nn.DataParallel(model)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "What is the capital of France?"
inputs = tokenizer.encode(input_text, return_tensors="pt").to("cuda")
outputs = model.module.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Running Inference on a Mac
Mac users, we’ve got you covered! Here’s how to run inference on your Mac:
import torch
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer
model_name = "OpenBMB/MiniCPM-Llama3-V-2.5"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
input_text = "Explain the theory of relativity in simple terms."
inputs = tokenizer.encode(input_text, return_tensors="pt")
# Use MPS (Metal Performance Shaders) if available
device = torch.device("mps") if torch.backends.mps.is_available() else torch.device("cpu")
model.to(device)
inputs = inputs.to(device)
outputs = model.generate(inputs)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Deploying on Mobile Phones
Deploying MiniCPM-V on mobile phones is exciting! Here’s a basic setup using llama.cpp:
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
./main -m models/MiniCPM-Llama3-V-2.5.bin -p "Translate English to Spanish: Good morning!"
Advanced Inference with vLLM
vLLM offers an advanced way to handle inference efficiently. Here’s an example:
from vllm import LLM
llm = LLM(model="OpenBMB/MiniCPM-Llama3-V-2.5")
prompt = "Write a short story about a dragon and a knight."
outputs = llm.generate(prompt, top_p=0.9)
for output in outputs:
print(output["text"])
Ready for More?
If you’re excited about fine-tuning these models, check out the third part of this series, where we dive deep into fine-tuning MiniCPM-V with detailed code examples. 📘
Don’t forget to check out the first blog for a basic understanding of MiniCPM-V. If you have any questions or comments, feel free to drop them below. Thanks for reading! 😊
For more details about the MiniCPM-V models, visit the MiniCPM-V GitHub repository.