Multimodal Support#
This document describes how to use ms-agent for multimodal conversations, including image understanding and analysis capabilities.
Overview#
ms-agent supports multimodal models such as Alibaba Cloud’s qwen3.5-plus. Multimodal models can:
Analyze image content
Recognize objects, scenes, and text in images
Engage in conversations based on image content
Prerequisites#
1. Install Dependencies#
Ensure the required packages are installed:
pip install openai
2. Configure API Key#
(Using qwen3.5-plus as an example) Obtain a DashScope API Key and set the environment variable:
export DASHSCOPE_API_KEY='your-dashscope-api-key'
Or set dashscope_api_key directly in the configuration file.
Configure Multimodal Models#
Multimodal functionality depends on two factors:
Choose a model that supports multimodal input (e.g.
qwen3.5-plus)Use the correct message format (containing
image_urlblocks)
You can dynamically modify the model configuration in code on top of an existing config:
from ms_agent.config import Config
from ms_agent import LLMAgent
import os
# Use an existing configuration file (e.g. ms_agent/agent/agent.yaml)
config = Config.from_task('ms_agent/agent/agent.yaml')
# Override configuration for multimodal model
config.llm.model = 'qwen3.5-plus'
config.llm.service = 'dashscope'
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'
# Create LLMAgent
agent = LLMAgent(config=config)
Using LLMAgent for Multimodal Conversations#
Using LLMAgent for multimodal conversations is recommended, as it provides more complete features including memory management, tool calling, and callback support.
Basic Usage#
import asyncio
import os
from ms_agent import LLMAgent
from ms_agent.config import Config
from ms_agent.llm.utils import Message
async def multimodal_chat():
# Create configuration
config = Config.from_task('ms_agent/agent/agent.yaml')
config.llm.model = 'qwen3.5-plus'
config.llm.service = 'dashscope'
config.llm.dashscope_api_key = os.environ.get('DASHSCOPE_API_KEY', '')
config.llm.modelscope_base_url = 'https://dashscope.aliyuncs.com/compatible-mode/v1'
# Create LLMAgent
agent = LLMAgent(config=config)
# Build multimodal message
multimodal_content = [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
# Call the agent
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
print(response[-1].content)
asyncio.run(multimodal_chat())
Non-Stream Mode#
# Disable stream in configuration
config.generation_config.stream = False
agent = LLMAgent(config=config)
multimodal_content = [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
# Non-stream mode: returns complete response directly
response = await agent.run(messages=[Message(role="user", content=multimodal_content)])
print(f"[Response] {response[-1].content}")
print(f"[Token Usage] Input: {response[-1].prompt_tokens}, Output: {response[-1].completion_tokens}")
Stream Mode#
# Enable stream in configuration
config.generation_config.stream = True
agent = LLMAgent(config=config)
multimodal_content = [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
# Stream mode: returns a generator
generator = await agent.run(
messages=[Message(role="user", content=multimodal_content)],
stream=True
)
full_response = ""
async for response_chunk in generator:
if response_chunk and len(response_chunk) > 0:
last_msg = response_chunk[-1]
if last_msg.content:
# Stream output of new content
print(last_msg.content[len(full_response):], end='', flush=True)
full_response = last_msg.content
print(f"\n[Full Response] {full_response}")
Multi-Turn Conversations#
LLMAgent supports multi-turn conversations, allowing you to mix images and text:
agent = LLMAgent(config=config, tag="multimodal_conversation")
# Turn 1: Send an image
multimodal_content = [
{"type": "text", "text": "How many people are in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
messages = [Message(role="user", content=multimodal_content)]
response = await agent.run(messages=messages)
print(f"[Turn 1 Response] {response[-1].content}")
# Turn 2: Follow-up question (text only, preserving context)
messages = response # Use previous response as context
messages.append(Message(role="user", content="What are they doing?"))
response = await agent.run(messages=messages)
print(f"[Turn 2 Response] {response[-1].content}")
Multimodal Message Format#
ms-agent uses the OpenAI-compatible multimodal message format. Images can be provided in three ways:
1. Image URL#
from ms_agent.llm.utils import Message
multimodal_content = [
{"type": "text", "text": "Please describe this image."},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
messages = [
Message(role="user", content=multimodal_content)
]
response = llm.generate(messages=messages)
2. Base64 Encoding#
import base64
# Read and encode the image
with open('image.jpg', 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
multimodal_content = [
{"type": "text", "text": "What is this?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{image_data}"
}
}
]
messages = [Message(role="user", content=multimodal_content)]
response = llm.generate(messages=messages)
3. Local File Path#
import base64
import os
image_path = 'path/to/image.png'
# Get MIME type
ext = os.path.splitext(image_path)[1].lower()
mime_type = {
'.png': 'image/png',
'.jpg': 'image/jpeg',
'.jpeg': 'image/jpeg',
'.gif': 'image/gif',
'.webp': 'image/webp'
}.get(ext, 'image/png')
# Read and encode
with open(image_path, 'rb') as f:
image_data = base64.b64encode(f.read()).decode('utf-8')
multimodal_content = [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {
"url": f"data:{mime_type};base64,{image_data}"
}
}
]
messages = [Message(role="user", content=multimodal_content)]
response = llm.generate(messages=messages)
Running Examples#
Running the Agent Example#
# Run the complete test suite (including stream and non-stream modes)
python examples/agent/test_llm_agent_multimodal.py
FAQ#
Q: Are there image size limits?#
A: Yes, different models have different limits:
qwen3.5-plus: Recommended image size under 4MB
Recommended resolution not exceeding 2048x2048
Q: What image formats are supported?#
A: Commonly supported formats:
JPEG / JPG
PNG
GIF
WebP
Q: Can I send multiple images at once?#
A: Yes, you can add multiple image_url blocks in a single message:
multimodal_content = [
{"type": "text", "text": "Compare these two images."},
{"type": "image_url", "image_url": {"url": "https://example.com/img1.jpg"}},
{"type": "image_url", "image_url": {"url": "https://example.com/img2.jpg"}}
]
Q: Is streaming output supported?#
A: Yes, multimodal conversations support streaming output. Set stream: true:
config.generation_config.stream = True
response = llm.generate(messages=messages, stream=True)