cpu->cuda

This commit is contained in:
2026-02-26 08:15:21 +00:00
parent bac7838dcd
commit 0ad0ea5c10
7 changed files with 1445 additions and 32 deletions

View File

@ -0,0 +1,414 @@
# AICAS 2026 - 面向AI芯片的VLM高效推理与优化赛道
## 目录
- [概述](#概述)
- [代码结构](#代码结构)
- [核心文件](#核心文件)
- [快速开始](#快速开始)
- [评测指标](#评测指标)
- [比赛规则](#比赛规则)
- [重要提示](#重要提示)
- [提交指南](#提交指南)
## 概述
本次竞赛专注于优化视觉语言模型VLM的推理性能。参赛者需要修改 `evaluation_wrapper.py` 中的 `VLMModel` 类,在保持准确率的同时提升首 Token 时间TTFT和吞吐量Throughput
## 代码结构
```
AICASGC/
├── benchmark.py # 基准测试脚本
├── evaluation_wrapper.py # 模型包装器(选手在此实现优化)
├── requirements.txt # Python 依赖包
├── data/ # 验证数据集
│ ├── data-*.arrow # 数据集文件
│ ├── dataset_info.json # 数据集元信息
│ └── state.json # 数据集状态
├── Qwen3-VL-2B-Instruct/ # 模型权重目录(需要选手自行下载)
└── README.md / README_CN.md # 说明文档
```
## 核心文件
- **`benchmark.py`** - 自测基准脚本(⚠️ **不建议修改**
- **`evaluation_wrapper.py`** - 模型包装器,参赛者在此实现优化
- **`Qwen3-VL-2B-Instruct/`** - 竞赛模型权重(需要选手自行下载,见"快速开始"部分)
- **`data/`** - 验证数据集
- **`requirements.txt`** - Python 依赖包
## 快速开始
### 0. 下载模型(首次使用)
模型文件较大,需要单独下载。请先创建模型目录,然后下载模型:
```bash
# 创建模型目录
mkdir -p Qwen3-VL-2B-Instruct
# 安装 huggingface_hub如果未安装
pip install -U huggingface_hub
# 设置镜像源(国内用户推荐,加速下载)
export HF_ENDPOINT=https://hf-mirror.com
# 下载模型到指定目录
huggingface-cli download \
--resume-download \
Qwen/Qwen3-VL-2B-Instruct \
--local-dir ./Qwen3-VL-2B-Instruct \
--local-dir-use-symlinks False
```
**注意:**
- 模型大小约 4-5GB下载可能需要一些时间
- 如果下载中断,可以重新运行命令,会自动续传(`--resume-download`
- 下载完成后,`Qwen3-VL-2B-Instruct/` 文件夹会包含所有模型文件
- 确保有足够的磁盘空间(至少 5GB
### 1. 安装依赖
```bash
pip install -r requirements.txt
```
### 2. 运行测试
```bash
python benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result.json \
--num-samples 100
```
### 3. 实现你的优化
编辑 `evaluation_wrapper.py` 中的 `VLMModel` 类。优化采用**模块化设计**,每个优化方向对应一个独立方法。
#### 3.1 探索模型结构(可选)
在开始优化前,可以先探索模型结构,了解优化目标:
```python
class VLMModel:
def __init__(self, model_path: str, device: str = "cuda:0"):
# ... 加载模型 ...
# 可选:探索模型结构
self._explore_model_structure() # 会打印模型结构信息
```
#### 3.2 启用优化方法
`__init__` 方法中,通过注释/取消注释来启用/禁用不同的优化:
```python
class VLMModel:
def __init__(self, model_path: str, device: str = "cuda:0"):
# ... 加载模型 ...
# ================================================================
# 选手优化区域 - 启用/禁用优化方法
# ================================================================
# 1. Vision Encoder 加速(优化大分辨率图像处理)
# self._optimize_vision_encoder()
# 2. KV Cache 管理(优化生成过程中的内存碎片)
# self._optimize_kv_cache()
# 3. 跨模态融合层优化(优化 Cross-modal Connector
# self._optimize_cross_modal_connector()
# 4. Flash Attention 优化
# self._enable_flash_attention()
# 5. 量化优化
# self._apply_quantization()
```
#### 3.3 实现优化代码
在各个优化方法中实现你的优化逻辑。例如,优化 Vision Encoder
```python
def _optimize_vision_encoder(self):
"""在 evaluation_wrapper.py 中找到这个方法,实现你的优化"""
# 示例:替换注意力算子
# from your_optimization import optimized_attention
# if hasattr(self._model, 'vision_model'):
# for layer in self._model.vision_model.encoder.layers:
# layer.self_attn.forward = optimized_attention
# TODO: 实现你的 Vision Encoder 优化
pass
```
### 4. 测试你的优化模型
```bash
python benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result_optimized.json \
--num-samples 100
```
### 5. 生成完整结果用于提交
```bash
python benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result.json \
--num-samples 5000
```
## 评测指标
最终得分计算公式:
```
最终得分 = 0.4 × 准确率 + 0.3 × TTFT提升率 + 0.3 × 吞吐量提升率
```
### 指标说明
- **TTFT (Time To First Token)**: 从输入准备到生成第一个 Token 的时间(毫秒)
- 包含图像编码、文本编码、跨模态交互、Prefill 阶段、第一个 Token 生成
- Baseline: ~80ms
- 提升率 = (Baseline - 你的TTFT) / Baseline
- **Throughput (吞吐量)**: 端到端 Token 生成速率tokens/秒)
- Baseline: ~55 tokens/sec
- 提升率 = (你的吞吐量 - Baseline) / Baseline
- **Accuracy (准确率)**: 验证集上的 VQA 准确率5000 个样本)
- 支持多个标准答案的软匹配
## 比赛规则
### 重要规则
1. **不要修改 `benchmark.py`**
- 此基准脚本仅用于自测
- 最终评测将使用独立的官方基准系统
- 修改此文件可能导致本地结果与最终评测结果不一致
2. **仅修改 `evaluation_wrapper.py`**
3. **保持必需的属性**
- `VLMModel` 类必须暴露 `processor``model``device` 属性
- Benchmark 使用这些属性来访问模型和处理器
- `generate()` 方法是可选的,主要用于调试
4. **禁止行为**
- 禁止硬编码答案
- 禁止修改数据集
- 禁止使用外部 API 或服务
- 所有优化必须是本地且自包含的
### 优化方向
- 鼓励实现算子替换与内核优化使用Triton、CUDA C++等重写或替换标准算子实现如Attention、LayerNorm、Conv2d等
- 鼓励实现内存与缓存优化优化KV Cache内存布局、减少内存碎片、优化显存访问模式
- 鼓励实现编译与图优化使用torch.compile进行计算图优化、自定义内核调度
- 鼓励实现注意力机制优化实现Flash Attention、内存高效注意力、稀疏注意力
- 鼓励实现生成过程优化:优化解码策略、缓存管理、生成配置参数
**不允许:**
- 使用外部服务禁止调用外部API、云服务或任何需要网络连接的功能
- 数据与答案作弊:禁止使用测试数据进行训练、预计算答案、硬编码输出
- 模型替换与篡改:希望选手着重做算子优化,不要用额外的数据集去训练模型、改变模型架构、直接修改权重数值等。
- 过拟合优化:禁止针对特定评测样本进行条件分支或特殊处理
- 黑盒工具套用:仅修改配置文件而无实质性代码贡献的行为不被认可
- 环境操纵禁止通过修改系统环境、GPU频率锁定等方式干扰公平评测
## 重要提示
### 样本选择
- 提供的 `benchmark.py` 使用**固定顺序**(从索引 0 开始的前 N 个样本)
- 运行 `--num-samples 100` 时,会评测样本 0-99
- 这确保了本地自测的可复现性
- **注意**:竞赛委员会使用的官方评测系统可能采用不同的采样策略(包括随机采样)进行最终验证
### 硬件信息
基准测试会自动记录详细的硬件信息:
- Python 版本、PyTorch 版本、CUDA 版本
- GPU 名称、显存、计算能力
- CPU 型号、核心数、频率
- 系统信息(操作系统、内核、架构)
- PPU 信息(如果可用)
这些信息保存在 `result.json``system_info` 字段中,用于统计分析。
### 性能测量
- **预热**:在实际测量前使用 10 个样本进行 GPU 预热
- **TTFT 测量**:测量从输入准备到第一个 Token 的时间(包含所有预处理)
- **吞吐量测量**:测量生成 128 个 Token 的端到端时间
- **状态隔离**:在测量之间清理 GPU 缓存,确保公平性
### 随机种子
- `--random-seed` 参数仅影响 PyTorch 的随机数生成器
- 它**不会**影响样本选择顺序(始终是固定的)
- 用于模型推理随机性的可复现性
### 输出格式
`result.json` 文件包含:
```json
{
"system_info": {
"timestamp": "...",
"python_version": "...",
"torch_version": "...",
"cuda_version": "...",
"gpu_name": "...",
...
},
"performance": {
"avg_ttft_ms": 90.55,
"avg_throughput_tokens_per_sec": 57.77
},
"answers": [
{
"question_id": 34602,
"prediction": "你的答案文本"
},
...
]
}
```
## 提交指南
### 初赛提交必需文件
1. **`result.json`** - 通过运行 `benchmark.py` 生成
- 包含所有样本的预测
- 必须包含有效的 `performance` 指标
- **重要**:上传到天池平台的 `result.json` 仅用于参考。最终成绩将由竞赛委员会使用标准化硬件和官方评测系统进行评测。
2. **你的优化代码** - 包含你优化的 `VLMModel` 类的 `evaluation_wrapper.py`
3. **Docker 镜像**- 包含你优化环境的容器
### 评测流程
1. **自测**:使用提供的 `benchmark.py` 在本地测试你的优化
2. **提交**:将你的 `result.json` 上传到天池平台(仅用于参考)
3. **官方评测**:竞赛委员会将使用以下方式评测你的代码:
- 提交Docker镜像
- 标准化硬件环境
- 官方评测代码
- 完整验证集,随机采样进行验证
4. **最终排名**:基于官方评测系统计算的最终得分
## 祝你好运!
希望你会专注于算子级优化、内核替换和高效的内存管理。记住:准确率和速度同样重要!祝你好运!
Qwen3VLForConditionalGeneration(
(model): Qwen3VLModel(
(visual): Qwen3VLVisionModel(
(patch_embed): Qwen3VLVisionPatchEmbed(
(proj): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
)
(pos_embed): Embedding(2304, 1024)
(rotary_pos_emb): Qwen3VLVisionRotaryEmbedding()
(blocks): ModuleList(
(0-23): 24 x Qwen3VLVisionBlock(
(norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(attn): Qwen3VLVisionAttention(
(qkv): Linear(in_features=1024, out_features=3072, bias=True)
(proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(mlp): Qwen3VLVisionMLP(
(linear_fc1): Linear(in_features=1024, out_features=4096, bias=True)
(linear_fc2): Linear(in_features=4096, out_features=1024, bias=True)
(act_fn): GELUTanh()
)
)
)
(merger): Qwen3VLVisionPatchMerger(
(norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(linear_fc1): Linear(in_features=4096, out_features=4096, bias=True)
(act_fn): GELU(approximate='none')
(linear_fc2): Linear(in_features=4096, out_features=2048, bias=True)
)
(deepstack_merger_list): ModuleList(
(0-2): 3 x Qwen3VLVisionPatchMerger(
(norm): LayerNorm((4096,), eps=1e-06, elementwise_affine=True)
(linear_fc1): Linear(in_features=4096, out_features=4096, bias=True)
(act_fn): GELU(approximate='none')
(linear_fc2): Linear(in_features=4096, out_features=2048, bias=True)
)
)
)
(language_model): Qwen3VLTextModel(
(embed_tokens): Embedding(151936, 2048)
(layers): ModuleList(
(0-27): 28 x Qwen3VLTextDecoderLayer(
(self_attn): Qwen3VLTextAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=1024, bias=False)
(v_proj): Linear(in_features=2048, out_features=1024, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(q_norm): Qwen3VLTextRMSNorm((128,), eps=1e-06)
(k_norm): Qwen3VLTextRMSNorm((128,), eps=1e-06)
)
(mlp): Qwen3VLTextMLP(
(gate_proj): Linear(in_features=2048, out_features=6144, bias=False)
(up_proj): Linear(in_features=2048, out_features=6144, bias=False)
(down_proj): Linear(in_features=6144, out_features=2048, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
(post_attention_layernorm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
)
)
(norm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
(rotary_emb): Qwen3VLTextRotaryEmbedding()
)
)
(lm_head): Linear(in_features=2048, out_features=151936, bias=False)
)

View File

@ -0,0 +1,406 @@
"""
AICAS 2026 - Participant Core Modification File
Participants should modify the VLMModel class to implement optimizations.
Note:
- Benchmark directly calls self.model.generate() for performance testing.
- Your optimizations should modify self.model or its operators in __init__ via Monkey Patch.
- The generate() method is optional and mainly for debugging.
"""
from typing import Dict
try:
from PIL import Image
except ImportError:
# For testing without PIL
class Image:
pass
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
class VLMModel:
"""
Participant optimization class - modify this to implement optimizations.
Optimization Architecture:
- Split optimizations into separate methods for isolation and testing
- Enable/disable each optimization independently in __init__
- Each optimization method can be tested individually
Important Notes:
1. Benchmark directly calls self.model.generate() for performance testing.
2. Your optimizations should modify self.model or its operators via Monkey Patch.
3. All optimizations are applied in __init__ by calling optimization methods.
"""
def __init__(self, model_path: str, device: str = "cuda:0"):
"""
Initialize model and apply optimizations.
Args:
model_path: Qwen3-VL-2B-Instruct model path
device: CUDA device, e.g., "cuda:0"
"""
self._device = device
self.model_path = model_path
# Load processor
print(f"[VLMModel] Loading processor from {model_path}...")
self._processor = AutoProcessor.from_pretrained(model_path)
# Load model
print(f"[VLMModel] Loading model with FP16...")
self._model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map=device
)
self._model.eval()
# Track applied optimizations
self._optimizations_applied = []
# ================================================================
# Participant Optimization Area - Enable/disable optimizations here
# Uncomment the optimization methods you want to apply
# ================================================================
# 1. Vision Encoder Acceleration
# self._optimize_vision_encoder()
# 2. KV Cache Management
# self._optimize_kv_cache()
# 3. Cross-modal Connector Optimization
# self._optimize_cross_modal_connector()
# 4. Flash Attention Optimization
# self._enable_flash_attention()
# 5. Quantization
# self._apply_quantization()
# Optional: Explore model structure before optimization
# self._explore_model_structure()
# ================================================================
print(f"[VLMModel] Model loaded successfully on {device}")
if self._optimizations_applied:
print(f"[VLMModel] Applied optimizations: {', '.join(self._optimizations_applied)}")
# ================================================================
# Optimization Methods - Implement your optimizations here
# ================================================================
def _explore_model_structure(self):
"""
Helper method to explore model structure.
Use this to understand the model architecture before implementing optimizations.
This helps identify where to apply monkey patches.
"""
print("=" * 60)
print("Model Structure Exploration")
print("=" * 60)
# Explore vision model structure
if hasattr(self._model, 'vision_model'):
print(f"Vision Model: {type(self._model.vision_model)}")
if hasattr(self._model.vision_model, 'encoder'):
if hasattr(self._model.vision_model.encoder, 'layers'):
print(f" Vision Encoder Layers: {len(self._model.vision_model.encoder.layers)}")
# Show first layer structure
if len(self._model.vision_model.encoder.layers) > 0:
print(f" First Layer Type: {type(self._model.vision_model.encoder.layers[0])}")
else:
print("Vision Model: Not found (model structure may differ)")
# Explore language model structure
if hasattr(self._model, 'model'):
print(f"Language Model: {type(self._model.model)}")
if hasattr(self._model.model, 'layers'):
print(f" Language Model Layers: {len(self._model.model.layers)}")
else:
print("Language Model: Not found (model structure may differ)")
# Explore cross-modal components
cross_modal_attrs = ['connector', 'cross_attn', 'cross_attention', 'proj', 'projector']
found_components = []
for attr in cross_modal_attrs:
if hasattr(self._model, attr):
found_components.append(attr)
if found_components:
print(f"Cross-modal Components: {', '.join(found_components)}")
else:
print("Cross-modal Components: Explore manually (structure may vary)")
print("=" * 60)
print("Tip: Use print(self._model) to see full model structure")
print("=" * 60)
def _optimize_vision_encoder(self):
"""
Optimize Vision Encoder for high-resolution image inputs.
Optimization Directions:
1. Patch embedding convolution optimization
2. Vision Transformer attention mechanism optimization
3. Layer normalization optimization
4. Memory-efficient image processing
Implementation Steps:
1. Inspect model structure: call self._explore_model_structure()
2. Identify bottlenecks using profiling tools (PyTorch Profiler, nsys, etc.)
3. Implement optimized operators (Triton/CUDA kernels)
4. Replace original operators via monkey patch
Target Components:
- self._model.vision_model (if exists)
- Vision encoder layers and attention mechanisms
- Convolution operations in patch embedding
"""
# TODO: Implement your Vision Encoder optimization here
#
# Example workflow:
# 1. from your_optimization import optimized_attention, optimized_conv
# 2. Inspect: print(self._model.vision_model) to find target layers
# 3. Replace: layer.self_attn.forward = optimized_attention
# 4. Test: Run benchmark to verify improvement
if 'vision_encoder' not in self._optimizations_applied:
self._optimizations_applied.append('vision_encoder')
def _optimize_kv_cache(self):
"""
Optimize KV Cache management to reduce memory fragmentation.
Optimization Directions:
1. Memory layout optimization (contiguous memory allocation)
2. Fragmentation-free allocation strategies
3. Efficient cache reuse patterns
4. Dynamic cache sizing
Implementation Steps:
1. Understand current KV cache implementation in model layers
2. Design memory-efficient cache allocation strategy
3. Implement custom KV cache allocator if needed
4. Apply optimizations via monkey patch or config modification
Target Components:
- self._model.config (cache configuration)
- Attention layers (KV cache allocation)
- Generation loop (cache management)
"""
# Enable KV Cache first
self._model.config.use_cache = True
if hasattr(self._model.config, 'pad_token_id'):
if self._model.config.pad_token_id is None:
self._model.config.pad_token_id = self._model.config.eos_token_id
# TODO: Implement advanced KV Cache optimizations here
#
# Example workflow:
# 1. from your_optimization import FragmentationFreeKVCache
# 2. for layer in self._model.model.layers:
# 3. layer.attention.custom_kv_cache = FragmentationFreeKVCache()
# 4. Test: Monitor memory usage and generation speed
if 'kv_cache' not in self._optimizations_applied:
self._optimizations_applied.append('kv_cache')
def _optimize_cross_modal_connector(self):
"""
Optimize Cross-modal Connector computation efficiency.
Optimization Directions:
1. Cross-attention mechanism optimization
2. Vision-to-language projection optimization
3. Multi-modal fusion layer efficiency
4. Feature alignment and transformation optimization
Implementation Steps:
1. Identify cross-modal components using self._explore_model_structure()
2. Profile cross-modal operations to find bottlenecks
3. Implement optimized cross-attention or projection kernels
4. Replace original operations via monkey patch
Note: Qwen3-VL's cross-modal structure may vary.
Use model exploration to identify actual component names and locations.
"""
# TODO: Implement your Cross-modal Connector optimization here
#
# Example workflow:
# 1. Explore: self._explore_model_structure() to find connector components
# 2. from your_optimization import optimized_cross_attention
# 3. Identify: Inspect model to find cross-attention layers
# 4. Replace: connector.cross_attention.forward = optimized_cross_attention
# 5. Test: Verify accuracy and performance improvements
from my_patch import patch_forward
self._model.model.__class__.forward = patch_forward
if 'cross_modal' not in self._optimizations_applied:
self._optimizations_applied.append('cross_modal')
def _enable_flash_attention(self):
"""
Enable or implement Flash Attention optimization.
Implementation Approaches:
Approach 1: Enable PyTorch's Built-in Flash Attention (Simple)
- Uses torch.backends.cuda.enable_flash_sdp(True)
- Easy to enable but limited customization
- May not work for all attention patterns in Qwen3-VL
Approach 2: Implement Custom Flash Attention (Advanced, Recommended)
- Write custom Triton/CUDA kernels for attention computation
- Replace torch.nn.functional.scaled_dot_product_attention
- Full control over attention computation and memory layout
- Better performance potential but requires more implementation effort
Recommended: Implement Approach 2 for better performance gains.
Use profiling to identify which attention operations benefit most from optimization.
"""
# TODO: Choose and implement your Flash Attention approach
# Approach 1: Simple (enable PyTorch built-in)
# torch.backends.cuda.enable_flash_sdp(True)
# Approach 2: Advanced (custom implementation - recommended)
# from your_optimization import custom_flash_attention
# torch.nn.functional.scaled_dot_product_attention = custom_flash_attention
#
# Or replace at layer level:
# for layer in self._model.model.layers:
# layer.self_attn.forward = custom_attention_with_flash
if 'flash_attention' not in self._optimizations_applied:
self._optimizations_applied.append('flash_attention')
def _apply_quantization(self):
"""
Apply quantization to reduce model size and speed up inference.
Optimization Directions:
1. INT8 quantization (8-bit integer)
2. FP8 quantization (8-bit floating point)
3. Mixed precision quantization
4. Dynamic vs static quantization
Implementation Steps:
1. Choose quantization strategy based on accuracy/performance trade-off
2. Use quantization libraries (BitsAndBytes, TensorRT, etc.)
3. Calibrate quantized model on validation data
4. Verify accuracy preservation
Note: Quantization may require reloading the model with quantization config.
Consider applying quantization before other optimizations if model reload is needed.
"""
# TODO: Implement your quantization here
#
# Example workflow:
# 1. from transformers import BitsAndBytesConfig
# 2. quantization_config = BitsAndBytesConfig(load_in_8bit=True)
# 3. Note: May need to reload model with quantization config
# 4. Test: Verify accuracy and performance improvements
if 'quantization' not in self._optimizations_applied:
self._optimizations_applied.append('quantization')
# Required properties for benchmark
@property
def processor(self):
"""
Required by benchmark for input processing.
Benchmark uses this to prepare inputs with unified tokenizer.
"""
return self._processor
@property
def model(self):
"""
Required by benchmark for direct model.generate() calls.
Benchmark directly calls self.model.generate() for performance testing.
Your optimizations should modify this model object or its operators.
"""
return self._model
@property
def device(self):
"""
Required by benchmark for device information.
"""
return self._device
def generate(
self,
image: Image.Image,
question: str,
max_new_tokens: int = 128
) -> Dict:
"""
Generate answer (optional method, mainly for debugging).
Note: Benchmark uses self.model.generate() directly for performance testing.
This method is provided for convenience and debugging purposes.
Args:
image: PIL Image object
question: Question text
max_new_tokens: Maximum tokens to generate
Returns:
Dict: {
"text": str, # Generated text answer
"token_count": int # Generated token count
}
"""
# Build Qwen3-VL message format
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": question}
]
}]
# Process inputs
inputs = self._processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(self._device)
# Generate
with torch.no_grad():
output_ids = self._model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=0.0,
top_p=1.0,
use_cache=True
)
# Extract generated tokens (remove input part)
input_len = inputs.input_ids.shape[1]
generated_ids = output_ids[0][input_len:]
# Decode
text = self._processor.tokenizer.decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
return {
"text": text,
"token_count": len(generated_ids)
}

View File

@ -0,0 +1,364 @@
import numpy as np
import torch
from transformers.models.qwen3_vl.processing_qwen3_vl import Qwen3VLProcessor, Qwen3VLProcessorKwargs
from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLModelOutputWithPast, BaseModelOutputWithDeepstackFeatures
from transformers.feature_extraction_utils import BatchFeature
from transformers.image_utils import ImageInput
from transformers.processing_utils import Unpack
from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
from transformers.utils import logging, TransformersKwargs, can_return_tuple
from transformers.video_utils import VideoInput
from transformers.cache_utils import Cache
from transformers.processing_utils import Unpack
logger = logging.get_logger(__name__)
class myQwen3VLProcessor(Qwen3VLProcessor):
def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
super().__init__(image_processor, tokenizer, video_processor, chat_template, **kwargs)
def __call__(
self,
images: ImageInput = None,
text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
videos: VideoInput = None,
**kwargs: Unpack[Qwen3VLProcessorKwargs],
) -> BatchFeature:
r"""
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
- **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
- **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
- **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
"""
output_kwargs = self._merge_kwargs(
Qwen3VLProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
if images is not None:
image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
image_grid_thw = image_inputs["image_grid_thw"]
else:
image_inputs = {}
image_grid_thw = None
if videos is not None:
videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
video_grid_thw = videos_inputs["video_grid_thw"]
# If user has not requested video metadata, pop it
if not kwargs.get("return_metadata"):
video_metadata = videos_inputs.pop("video_metadata")
else:
video_metadata = videos_inputs["video_metadata"]
else:
videos_inputs = {}
video_grid_thw = None
if not isinstance(text, list):
text = [text]
text = text.copy() # below lines change text in-place
if image_grid_thw is not None:
merge_length = self.image_processor.merge_size**2
index = 0
for i in range(len(text)):
while self.image_token in text[i]:
# num_image_tokens = image_grid_thw[index].prod() // merge_length
num_image_tokens = 40
text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
index += 1
text[i] = text[i].replace("<|placeholder|>", self.image_token)
if video_grid_thw is not None:
merge_length = self.video_processor.merge_size**2
index = 0
for i in range(len(text)):
while self.video_token in text[i]:
metadata = video_metadata[index]
if metadata.fps is None:
logger.warning_once(
"Qwen3VL requires frame timestamps to construct prompts, but the `fps` of the input video could not be inferred. "
"Probably `video_metadata` was missing from inputs and you passed pre-sampled frames. "
"Defaulting to `fps=24`. Please provide `video_metadata` for more accurate results."
)
metadata.fps = 24 if metadata.fps is None else metadata.fps
# if timestamps are not provided, calculate them
curr_timestamp = self._calculate_timestamps(
metadata.frames_indices,
metadata.fps,
self.video_processor.temporal_patch_size,
)
video_placeholder = ""
frame_seqlen = video_grid_thw[index][1:].prod() // merge_length
for frame_idx in range(video_grid_thw[index][0]):
curr_time = curr_timestamp[frame_idx]
video_placeholder += f"<{curr_time:.1f} seconds>"
video_placeholder += (
self.vision_start_token + "<|placeholder|>" * frame_seqlen + self.vision_end_token
)
if f"{self.vision_start_token}{self.video_token}{self.vision_end_token}" in text[i]:
text[i] = text[i].replace(
f"{self.vision_start_token}{self.video_token}{self.vision_end_token}", video_placeholder, 1
)
else:
# vllm may input video token directly
text[i] = text[i].replace(self.video_token, video_placeholder, 1)
index += 1
text[i] = text[i].replace("<|placeholder|>", self.video_token)
return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", None)
text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
self._check_special_mm_tokens(text, text_inputs, modalities=["image", "video"])
if return_mm_token_type_ids:
array_ids = np.array(text_inputs["input_ids"])
mm_token_type_ids = np.zeros_like(text_inputs["input_ids"])
mm_token_type_ids[array_ids == self.image_token_id] = 1
text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
def _sample_indices_uniform(idx: torch.LongTensor, keep_ratio: float, min_keep: int = 0):
"""
idx: 1D indices in original sequence (sorted)
keep_ratio: 0~1, keep uniformly spaced
"""
n = idx.numel()
if n == 0:
return idx
k = max(min_keep, int(torch.ceil(torch.tensor(n * keep_ratio)).item()))
k = min(k, n)
if k == n:
return idx
# uniform pick: linspace over [0, n-1]
pos = torch.linspace(0, n - 1, steps=k, device=idx.device)
pos = pos.round().long().clamp(0, n - 1)
return idx[pos]
def sparse_keep_and_gather(
inputs_embeds, # (B,S,D)
attention_mask, # (B,S)
position_ids, # (4,B,S)
visual_pos_masks, # (B,S) bool
deepstack_visual_embeds,# list[tensor] each (Nvis_total,D) OR None
keep_ratio: float = 0.25,
min_keep_per_vis: int = 0,
max_len: int | None = None,
):
"""
稀疏保留:保留全部文本 token视觉 token 按 keep_ratio 均匀采样保留。
可选 max_len如果最终还超长再从视觉 token 里继续裁(不动文本)。
"""
device = inputs_embeds.device
B, S, D = inputs_embeds.shape
eff = attention_mask.bool()
keep_mask_token = torch.zeros((B, S), dtype=torch.bool, device=device)
for b in range(B):
eff_idx = eff[b].nonzero(as_tuple=False).squeeze(1) # 有效 token
if eff_idx.numel() == 0:
continue
vis_eff = visual_pos_masks[b, eff_idx] # 有效里哪些是视觉
text_idx = eff_idx[~vis_eff] # 全保留
vis_idx = eff_idx[vis_eff] # 待稀疏
# 视觉稀疏采样(删中间就靠这一步)
kept_vis = _sample_indices_uniform(vis_idx, keep_ratio, min_keep=min_keep_per_vis)
chosen = torch.cat([text_idx, kept_vis], dim=0)
chosen, _ = torch.sort(chosen) # 保持原序
# 如果还要控最大长度:优先继续裁视觉(不裁文本)
if max_len is not None and chosen.numel() > max_len:
# 已保留的视觉位置
chosen_vis = chosen[visual_pos_masks[b, chosen]]
chosen_txt = chosen[~visual_pos_masks[b, chosen]]
# 文本若已超 max_len只能截文本极少
if chosen_txt.numel() >= max_len:
chosen = chosen_txt[:max_len]
else:
budget = max_len - chosen_txt.numel()
# 对视觉再均匀裁到 budget
chosen_vis = _sample_indices_uniform(chosen_vis, budget / max(chosen_vis.numel(), 1))
chosen = torch.cat([chosen_txt, chosen_vis], dim=0)
chosen, _ = torch.sort(chosen)
keep_mask_token[b, chosen] = True
# ===== gather + pad 到 batch 内最大长度 =====
keep_lens = keep_mask_token.sum(dim=1).tolist()
max_keep = max(keep_lens) if keep_lens else 0
new_inputs = inputs_embeds.new_zeros((B, max_keep, D))
new_attn = attention_mask.new_zeros((B, max_keep))
new_pos = position_ids.new_zeros((4, B, max_keep))
new_vis = visual_pos_masks.new_zeros((B, max_keep), dtype=torch.bool)
for b in range(B):
idx = keep_mask_token[b].nonzero(as_tuple=False).squeeze(1)
L = idx.numel()
if L == 0:
continue
new_inputs[b, :L, :] = inputs_embeds[b, idx, :]
new_attn[b, :L] = attention_mask[b, idx]
new_pos[:, b, :L] = position_ids[:, b, idx]
new_vis[b, :L] = visual_pos_masks[b, idx]
# ===== deepstack 同步裁剪(关键!)=====
new_deepstack = None
if deepstack_visual_embeds is not None:
# deepstack 的顺序 = visual_pos_masks flatten 后 True 的顺序
# 所以用 keep_mask_token 在这些位置的布尔值来裁剪
keep_vis_flat = keep_mask_token[visual_pos_masks] # 1D bool, length = Nvis_total
new_deepstack = [x[keep_vis_flat] for x in deepstack_visual_embeds]
return new_inputs, new_attn, new_pos, new_vis, new_deepstack
@can_return_tuple
def patch_forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.LongTensor | None = None,
past_key_values: Cache | None = None,
inputs_embeds: torch.FloatTensor | None = None,
pixel_values: torch.Tensor | None = None,
pixel_values_videos: torch.FloatTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
cache_position: torch.LongTensor | None = None,
**kwargs: Unpack[TransformersKwargs],
) -> tuple | Qwen3VLModelOutputWithPast:
r"""
image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):
The temporal, height and width of feature shape of each image in LLM.
video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
The temporal, height and width of feature shape of each video in LLM.
"""
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
image_mask = None
video_mask = None
if pixel_values is not None:
image_outputs: BaseModelOutputWithDeepstackFeatures = self.get_image_features(
pixel_values, image_grid_thw, return_dict=True
)
image_embeds = image_outputs.pooler_output
deepstack_image_embeds = image_outputs.deepstack_features
image_embeds = torch.cat(image_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
image_mask, _ = self.get_placeholder_mask(
input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
)
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
if pixel_values_videos is not None:
video_outputs: BaseModelOutputWithDeepstackFeatures = self.get_video_features(
pixel_values_videos, video_grid_thw, return_dict=True
)
video_embeds = video_outputs.pooler_output
deepstack_video_embeds = video_outputs.deepstack_features
video_embeds = torch.cat(video_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
_, video_mask = self.get_placeholder_mask(
input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
)
inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
visual_pos_masks = None
deepstack_visual_embeds = None
if image_mask is not None and video_mask is not None:
# aggregate visual_pos_masks and deepstack_visual_embeds
image_mask = image_mask[..., 0]
video_mask = video_mask[..., 0]
visual_pos_masks = image_mask | video_mask
deepstack_visual_embeds = []
image_mask_joint = image_mask[visual_pos_masks]
video_mask_joint = video_mask[visual_pos_masks]
for img_embed, vid_embed in zip(deepstack_image_embeds, deepstack_video_embeds):
embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)
embed_joint[image_mask_joint, :] = img_embed
embed_joint[video_mask_joint, :] = vid_embed
deepstack_visual_embeds.append(embed_joint)
elif image_mask is not None:
image_mask = image_mask[..., 0]
visual_pos_masks = image_mask
deepstack_visual_embeds = deepstack_image_embeds
elif video_mask is not None:
video_mask = video_mask[..., 0]
visual_pos_masks = video_mask
deepstack_visual_embeds = deepstack_video_embeds
if position_ids is None:
position_ids = self.compute_3d_position_ids(
input_ids=input_ids,
image_grid_thw=image_grid_thw,
video_grid_thw=video_grid_thw,
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
past_key_values=past_key_values,
)
# ====== 稀疏采样裁剪:只在 prefill 做past_key_values is None=====
if past_key_values.get_seq_length() == 0 and visual_pos_masks is not None:
# 这些参数你可以通过 kwargs 传入
keep_ratio = kwargs.pop("visual_keep_ratio", 0.1) # 只保留 25% 视觉 token
min_keep = kwargs.pop("min_keep_per_vis", 0) # 每段视觉最少保留多少(可设比如 16
max_len = kwargs.pop("truncate_max_len", None) # 总长度上限(可选)
inputs_embeds, attention_mask, position_ids, visual_pos_masks, deepstack_visual_embeds = sparse_keep_and_gather(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
visual_pos_masks=visual_pos_masks,
deepstack_visual_embeds=deepstack_visual_embeds,
keep_ratio=keep_ratio,
min_keep_per_vis=min_keep,
max_len=max_len,
)
# cache_position 建议重建为 0..L-1避免对齐问题
cache_position = torch.arange(
inputs_embeds.shape[1], device=inputs_embeds.device, dtype=torch.long
).unsqueeze(0).expand(inputs_embeds.shape[0], -1)
# rope_deltas 建议也按裁剪后的序列重算(防止不一致)
eff_len = attention_mask.sum(dim=1).to(torch.long) # (B,)
max_pos = position_ids.max(dim=0).values.max(dim=1).values # (B,)
self.rope_deltas = (max_pos + 1 - eff_len).unsqueeze(1)
# ====== 裁剪结束 ======
outputs = self.language_model(
input_ids=None,
position_ids=position_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
cache_position=cache_position,
visual_pos_masks=visual_pos_masks,
deepstack_visual_embeds=deepstack_visual_embeds,
**kwargs,
)
return Qwen3VLModelOutputWithPast(
**outputs,
rope_deltas=self.rope_deltas,
)

View File

@ -34,7 +34,7 @@ class VLMModel:
3. All optimizations are applied in __init__ by calling optimization methods.
"""
def __init__(self, model_path: str, device: str = "cpu"):
def __init__(self, model_path: str, device: str = "cuda:0"):
"""
Initialize model and apply optimizations.
@ -73,7 +73,7 @@ class VLMModel:
# self._optimize_kv_cache()
# 3. Cross-modal Connector Optimization
self._optimize_cross_modal_connector()
# self._optimize_cross_modal_connector()
# 4. Flash Attention Optimization
# self._enable_flash_attention()

View File

@ -249,8 +249,6 @@ def patch_forward(
video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
The temporal, height and width of feature shape of each video in LLM.
"""
import time
start = time.time()
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
@ -319,34 +317,34 @@ def patch_forward(
past_key_values=past_key_values,
)
# # ====== 稀疏采样裁剪:只在 prefill 做past_key_values is None=====
# if past_key_values.get_seq_length() == 0 and visual_pos_masks is not None:
# # 这些参数你可以通过 kwargs 传入
# keep_ratio = kwargs.pop("visual_keep_ratio", 0.1) # 只保留 25% 视觉 token
# min_keep = kwargs.pop("min_keep_per_vis", 0) # 每段视觉最少保留多少(可设比如 16
# max_len = kwargs.pop("truncate_max_len", None) # 总长度上限(可选)
# ====== 稀疏采样裁剪:只在 prefill 做past_key_values is None=====
if past_key_values.get_seq_length() == 0 and visual_pos_masks is not None:
# 这些参数你可以通过 kwargs 传入
keep_ratio = kwargs.pop("visual_keep_ratio", 0.1) # 只保留 25% 视觉 token
min_keep = kwargs.pop("min_keep_per_vis", 0) # 每段视觉最少保留多少(可设比如 16
max_len = kwargs.pop("truncate_max_len", None) # 总长度上限(可选)
# inputs_embeds, attention_mask, position_ids, visual_pos_masks, deepstack_visual_embeds = sparse_keep_and_gather(
# inputs_embeds=inputs_embeds,
# attention_mask=attention_mask,
# position_ids=position_ids,
# visual_pos_masks=visual_pos_masks,
# deepstack_visual_embeds=deepstack_visual_embeds,
# keep_ratio=keep_ratio,
# min_keep_per_vis=min_keep,
# max_len=max_len,
# )
inputs_embeds, attention_mask, position_ids, visual_pos_masks, deepstack_visual_embeds = sparse_keep_and_gather(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
visual_pos_masks=visual_pos_masks,
deepstack_visual_embeds=deepstack_visual_embeds,
keep_ratio=keep_ratio,
min_keep_per_vis=min_keep,
max_len=max_len,
)
# # cache_position 建议重建为 0..L-1避免对齐问题
# cache_position = torch.arange(
# inputs_embeds.shape[1], device=inputs_embeds.device, dtype=torch.long
# ).unsqueeze(0).expand(inputs_embeds.shape[0], -1)
# cache_position 建议重建为 0..L-1避免对齐问题
cache_position = torch.arange(
inputs_embeds.shape[1], device=inputs_embeds.device, dtype=torch.long
).unsqueeze(0).expand(inputs_embeds.shape[0], -1)
# # rope_deltas 建议也按裁剪后的序列重算(防止不一致)
# eff_len = attention_mask.sum(dim=1).to(torch.long) # (B,)
# max_pos = position_ids.max(dim=0).values.max(dim=1).values # (B,)
# self.rope_deltas = (max_pos + 1 - eff_len).unsqueeze(1)
# # ====== 裁剪结束 ======
# rope_deltas 建议也按裁剪后的序列重算(防止不一致)
eff_len = attention_mask.sum(dim=1).to(torch.long) # (B,)
max_pos = position_ids.max(dim=0).values.max(dim=1).values # (B,)
self.rope_deltas = (max_pos + 1 - eff_len).unsqueeze(1)
# ====== 裁剪结束 ======
outputs = self.language_model(
input_ids=None,
@ -360,9 +358,6 @@ def patch_forward(
**kwargs,
)
end = time.time()
print('程序运行时间:%s毫秒' % ((end - start)*1000))
return Qwen3VLModelOutputWithPast(
**outputs,
rope_deltas=self.rope_deltas,

117
result.json Normal file
View File

@ -0,0 +1,117 @@
{
"system_info": {
"timestamp": "2026-02-26T08:10:16.296574",
"python_version": "3.12.12",
"python_full_version": "3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:16:04) [GCC 11.2.0]",
"torch_version": "2.10.0+cu128",
"cuda_available": true,
"cuda_version": "12.8",
"cudnn_version": "91002",
"gpu_count": 1,
"gpu_name": "NVIDIA GeForce RTX 4090",
"gpu_memory_gb": 23.52,
"gpu_compute_capability": "8.9",
"cpu_processor": "x86_64",
"cpu_count_physical": 16,
"cpu_count_logical": 16,
"cpu_freq_mhz": 3245.12,
"cpu_model": "AMD EPYC 9354 32-Core Processor",
"platform_system": "Linux",
"platform_release": "5.15.0-105-generic",
"platform_version": "#115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024",
"platform_machine": "x86_64",
"platform_architecture": "64bit",
"ppu_available": false,
"ppu_info": {},
"gpu_driver_version": "580.95.05",
"gpu_memory_total": "24564 MiB",
"memory_total_gb": 54.92,
"memory_available_gb": 52.92
},
"performance": {
"avg_ttft_ms": 60.76,
"avg_throughput_tokens_per_sec": 51.24
},
"answers": [
{
"question_id": 34602,
"prediction": "Based on the text visible on the camera in the image, the brand of this camera is **Dakota Digital**.\n\nThis is clearly printed on the top left of the camera's body. The camera is also labeled as a \"Single-Use Camera\" and has a \"Pure Digital\" logo, which is a feature of the Dakota Digital brand."
},
{
"question_id": 34603,
"prediction": "copenhagen"
},
{
"question_id": 34604,
"prediction": "Based on the label in the image, this is a **Self-Righteous Ale**.\n\nHere are the details from the label:\n- **Beer Type:** Ale\n- **Alcohol Content:** 8.7% Alc/Vol\n- **Brand:** Stone\n- **Name:** Sublimely Self-Righteous\n\nThe label features a graphic of a muscular, horned figure, which is likely a representation of the \"Stone\" brand's logo. The name \"Sublimely Self-Righteous\" suggests a bold and perhaps slightly rebellious or self-assertive character, which is fitting for a beer with a strong, distinctive name."
},
{
"question_id": 34605,
"prediction": "Based on the image provided, the brand of liquor on the right is **Bowmore**.\n\nThis is clearly visible on the blue label of the bottle in the center-right of the image. The label reads:\n\n- **BOWMORE**\n- **ISLAY SINGLE MALT SCOTCH WHISKY**\n- **TEMPER**\n- **NON CHILL FILTERED**\n- **BATCH RELEASE No**\n- **DISTILLED AND BOTTLED IN SCOTLAND**\n- **55.6% alc./vol.**\n- **AGED 10 YEARS**\n\nThe bottle is a **Bowmore Islay Single Malt Scotch Whisky**."
},
{
"question_id": 34606,
"prediction": "Based on the label on the rightmost bottle, the drink has been aged for **10 years**.\n\nThis is clearly stated on the blue label of the Bowmore Islay Single Malt Scotch Whisky bottle:\n\n- **AGED 10 YEARS**"
},
{
"question_id": 34607,
"prediction": "Based on the image provided, the number on the player's jersey is **22**.\n\nThis can be seen clearly on the front of his white jersey, just below the red stripe on the sleeve."
},
{
"question_id": 34608,
"prediction": "Based on the watch face in the image, we can determine the time by examining the positions of the hands.\n\n- The **hour hand** is pointing just past the number 2.\n- The **minute hand** is pointing at the number 4, which represents 20 minutes.\n- The **second hand** is pointing at the number 10, which represents 10 seconds.\n\nTherefore, the time displayed on the watch is **2:20:10**."
},
{
"question_id": 34609,
"prediction": "Based on the details visible in the image, the watch is an **Audemars Piguet**.\n\nHere are the key features that identify the brand:\n\n- **Logo:** The \"AP\" logo is clearly visible on the dial, just below the 12 o'clock position.\n- **Dial:** The dial has a distinctive blue and white color scheme with a light blue outer ring, which is characteristic of the Audemars Piguet Royal Oak collection.\n- **Case:** The watch has a robust, octagonal case with a brushed metal finish, a signature design element of the Audemars Piguet Royal Oak.\n- **Bracelet:** The white rubber strap is consistent with the design of the Audemars Piguet Royal Oak, which is known for its unique, flexible, and durable rubber strap.\n\nThe watch in the image is a **Audemars Piguet Royal Oak** chronograph."
},
{
"question_id": 34610,
"prediction": "Based on the visual information in the image, the person at the center of the whiteboard is **Bryan Owens**.\n\nHere's how we can determine this:\n\n- The whiteboard is a mind map or flowchart that connects various people and events.\n- The central figure is highlighted by a large, prominent note that reads \"Bryan Owens\".\n- The note also includes a cartoon drawing of a person with a hat and the text \"Bryan Owens\" below it.\n- The flowchart shows that Bryan Owens is connected to many other people and events, including:\n - **Kristie Weatherford** (with a red arrow pointing to her)\n - **Alexa Cupps** (with a purple arrow pointing to her)\n - **Caroline Chong** (with a green arrow pointing to her)\n - **Alex Marsh** (with a blue arrow pointing to her)\n - **Dime Ferer** (with a red arrow pointing to her)\n - **UK Sketch Camp!** (with a green arrow pointing to it)\n - **IxDA.org** (with a green arrow pointing to it)\n - **Meet of SXSW 2012** (with a purple arrow pointing to it)\n- The person writing on the board is a man in a red cap, and he is actively drawing a line from the center of the diagram to the person named \"Bryan Owens\".\n\nTherefore, Bryan Owens is the central figure in this mind map."
},
{
"question_id": 34611,
"prediction": "Based on the image provided, the photographer is Philippe Molitor.\n\nThis information is visible in the bottom-left corner of the image, where the text \"© Gleamlight / Philippe Molitor\" is printed."
},
{
"question_id": 34612,
"prediction": "Based on the image provided, the switches are all in the **off** position.\n\nHere's the reasoning:\n- Each switch has the word \"OFF\" clearly printed on its face.\n- The switches are all in the same state, with the toggle arms in the \"off\" position.\n- The switches are all in the same state, with the toggle arms in the \"off\" position."
},
{
"question_id": 34613,
"prediction": "Based on the image provided, the candy bar located at the bottom of the scene is a **Hershey's** chocolate bar.\n\nYou can identify it by the distinctive \"HERSHEY'S\" logo printed in large, bold, white letters on the dark brown wrapper. The bar is positioned in the foreground, nestled in the snow."
},
{
"question_id": 34614,
"prediction": "Based on the image provided, the sign on the farthest right window reads:\n\n**BUD LIGHT**\n\nThis is a circular, blue and white sign with the brand name \"BUD LIGHT\" in white text."
},
{
"question_id": 34615,
"prediction": "Based on the price sign visible in the image, a can of Skoal costs $3.82.\n\nThis is shown in the red price tag on the left side of the store's entrance, which lists the following prices:\n- $4.52 for a 12-pack\n- $3.82 for a can of Skoal\n- $3.16 for a 12-pack of coffee\n- $1.85 for a can of coffee"
},
{
"question_id": 34616,
"prediction": "Yes, the sign in the image is for Denny's. The name \"Denny's\" is clearly visible in red lettering on a yellow background."
},
{
"question_id": 34617,
"prediction": "Based on the image provided, the letters on the sign are **red**."
},
{
"question_id": 34618,
"prediction": "Based on the image provided, the bottle with the red label is **Red Label**.\n\nIt is a well-known brand of Scotch whisky, and the bottle is clearly visible on the left side of the bar counter. The label features a red and gold design with the name \"Red Label\" prominently displayed."
},
{
"question_id": 34619,
"prediction": "Based on the image provided, there are two percentages shown on the posters.\n\n- A large, yellow circular sign on the glass door prominently displays **0%**.\n- On the poster for \"THE IDOLM@STER 2\", there is a smaller text that reads **10%**.\n\nTherefore, the percentages shown on the posters are **0%** and **10%**."
},
{
"question_id": 34620,
"prediction": "Based on the image provided, we can determine the number of items you can get for $5 by examining the price tags on the shelves.\n\nThe price tags are arranged in rows, and the prices are listed as:\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n- $2.39\n-"
},
{
"question_id": 34621,
"prediction": "Based on the image provided, there are **4** price tags on the bottom shelf.\n\nHere is a breakdown of the price tags visible on the bottom shelf:\n\n- **Left side:** A yellow price tag for the \"Betty Crocker Super Moist\" cake mix is visible, but it is not a price tag in the traditional sense. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price tag for the product. The price tag for the \"Betty Crocker\" brand is not a price tag for the product itself, but rather a price"
}
]
}

117
result_optimized.json Normal file
View File

@ -0,0 +1,117 @@
{
"system_info": {
"timestamp": "2026-02-26T08:07:17.320558",
"python_version": "3.12.12",
"python_full_version": "3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:16:04) [GCC 11.2.0]",
"torch_version": "2.10.0+cu128",
"cuda_available": true,
"cuda_version": "12.8",
"cudnn_version": "91002",
"gpu_count": 1,
"gpu_name": "NVIDIA GeForce RTX 4090",
"gpu_memory_gb": 23.52,
"gpu_compute_capability": "8.9",
"cpu_processor": "x86_64",
"cpu_count_physical": 16,
"cpu_count_logical": 16,
"cpu_freq_mhz": 3245.12,
"cpu_model": "AMD EPYC 9354 32-Core Processor",
"platform_system": "Linux",
"platform_release": "5.15.0-105-generic",
"platform_version": "#115-Ubuntu SMP Mon Apr 15 09:52:04 UTC 2024",
"platform_machine": "x86_64",
"platform_architecture": "64bit",
"ppu_available": false,
"ppu_info": {},
"gpu_driver_version": "580.95.05",
"gpu_memory_total": "24564 MiB",
"memory_total_gb": 54.92,
"memory_available_gb": 52.9
},
"performance": {
"avg_ttft_ms": 59.0,
"avg_throughput_tokens_per_sec": 52.11
},
"answers": [
{
"question_id": 34602,
"prediction": "Based on the text visible on the camera in the image, the brand is **Dakota**.\n\nThis is indicated by the blue label on the top left of the camera body which clearly reads \"DAKOTA DIGITAL\". The camera also has a \"Dakota\" logo in the bottom left corner."
},
{
"question_id": 34603,
"prediction": "Based on the image provided, the small white text at the bottom of the banner is a logo. It spells out the word **\"DRUPAL\"**.\n\nThis is the logo for the Drupal project, a popular open-source content management system. The text is written in a stylized, lowercase font, and it is positioned at the bottom of the banner, just above the main graphic."
},
{
"question_id": 34604,
"prediction": "Based on the label visible in the image, this is **Stone Brewing's \"Self Righteous\"**.\n\nHere are the details from the label:\n\n- **Brand:** Stone Brewing\n- **Product Name:** Self Righteous\n- **Style:** This is a **Pale Ale**. The \"Self Righteous\" name is a playful, self-referential title, and the beer is a pale ale, which is a style of beer that is typically light in color, with a balanced malt profile and a crisp, clean finish.\n\nThe label also shows a small \"Stone\" logo and a \"Believe\" seal, which are part of the brand's identity. The beer is a 5% ABV (alcohol by volume) pale ale."
},
{
"question_id": 34605,
"prediction": "Based on the image provided, the brand of liquor on the right is **The Morangie**.\n\nHere are the details from the label:\n- **Brand Name:** The Morangie\n- **Product Type:** Scotch Whisky\n- **Alcohol Content:** 40%\n- **Cask Type:** 10-year-old single malt\n- **Cask Number:** 11\n- **Cask Type:** 10-year-old single malt\n\nThe label also indicates that it is \"Extremely Rare\" and \"The Morangie\" is a brand of Scotch whisky."
},
{
"question_id": 34606,
"prediction": "Based on the information visible on the bottle of whisky on the right, we can determine its age.\n\nThe label on the bottle clearly states \"10 YEARS\" in large, bold text. This indicates that the whisky has been aged for 10 years.\n\nThe label also says \"SCOTCH WHISKY\" and \"EXTREMELY RARE\", which are details about the type of whisky and its rarity, but the age is explicitly stated.\n\nTherefore, the drink on the right has been aged for **10 years**."
},
{
"question_id": 34607,
"prediction": "Based on the image provided, the number on the player's jersey is **22**."
},
{
"question_id": 34608,
"prediction": "Based on the image provided, the time displayed on the watch is **10:10**.\n\nThis is indicated by the large number \"10\" on the watch face, which is positioned at the top of the dial. The smaller numbers on the dial, such as \"20\" and \"20\" on the lower right, are likely part of the chronograph function, but the main time reading is 10:10."
},
{
"question_id": 34609,
"prediction": "Based on the details visible in the image, the watch is an **Audemars Piguet**.\n\nHere are the key features that identify it:\n\n- **Branding:** The \"AP\" logo is clearly visible on the dial, which is the signature of the Audemars Piguet brand.\n- **Dial Design:** The watch has a distinctive three-subdial layout, which is characteristic of the **Audemars Piguet Royal Oak** line.\n- **Case and Strap:** The silver-colored case and the white rubber strap are consistent with the design of the **Audemars Piguet Royal Oak** model, which is known for its bold, angular design and the use of a white rubber strap.\n\nThe specific model appears to be the **Audemars Piguet Royal Oak** with a **white dial** and **three subdials**. The watch is also notable for its **\"Pigeon\"** design, which is a unique feature of this model."
},
{
"question_id": 34610,
"prediction": "Based on the image provided, the person at the center of the whiteboard is **Mitchell**.\n\nThis is indicated by the name written in the top right corner of the whiteboard. The name \"Mitchell\" is written in a large, bold font, and it is positioned directly above the drawing of a person with a ponytail, which is the central figure on the board."
},
{
"question_id": 34611,
"prediction": "I'm sorry, but I can't determine who the photographer is from this image. The image is a photograph of a group of men at a table, and there is no visible information about the photographer in the picture."
},
{
"question_id": 34612,
"prediction": "Based on the image provided, the switches are in the **\"OFF\"** position.\n\nHere's the reasoning:\n- The switches are a type of toggle switch, commonly used in electrical systems.\n- The switches are labeled \"OFF\" on the top of each switch.\n- The switches are in the \"OFF\" position, which is indicated by the position of the lever. The lever is in the \"off\" position, which is the standard position for a switch that is turned off.\n\nTherefore, the switches are **OFF**."
},
{
"question_id": 34613,
"prediction": "Based on the image provided, the candy bar you're asking about is not visible. The image shows a collection of items on a table, including:\n\n- A bottle of sparkling wine (likely a Sherry or similar).\n- A bottle of white wine.\n- A bottle of a dark-colored wine.\n- A bottle of a red wine.\n- A bottle of a sparkling wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of a red wine.\n- A bottle of"
},
{
"question_id": 34614,
"prediction": "Based on the image provided, the light sign on the farthest right window reads:\n\n**All Coca-Cola**\n\nThis is a promotional sign that says \"All Coca-Cola\" and is located in the upper right corner of the window."
},
{
"question_id": 34615,
"prediction": "Based on the image provided, the price for a can of Skoal is **$3.30**.\n\nThis price is visible on the sign in the window of the convenience store. The sign also mentions \"All for $3.40\" for a 3-pack, which is a different price point."
},
{
"question_id": 34616,
"prediction": "Yes, this is Denny's. The sign features the iconic yellow and red logo with the name \"Denny's\" in a stylized font, which is the brand's signature look."
},
{
"question_id": 34617,
"prediction": "Based on the image provided, the letters on the sign are **yellow**."
},
{
"question_id": 34618,
"prediction": "Based on the image provided, the bottle with the red label is **Jim Beam**.\n\nThe label is clearly visible on the bottle, and the brand name \"JIM BEAM\" is printed in large, white letters on a dark background. The red label is a common feature of the Jim Beam brand, which is a well-known American bourbon whiskey."
},
{
"question_id": 34619,
"prediction": "Based on the image provided, there is a poster with the text \"2%\" visible on it.\n\nThe number \"2\" is displayed in a large, stylized font, and the percentage sign (%) is clearly visible next to it.\n\nTherefore, the percentage shown on the poster is **2%**."
},
{
"question_id": 34620,
"prediction": "Based on the image provided, we can see a store shelf with various items. The price tags are clearly visible, and the question asks how many items can be purchased for $5.\n\nLet's examine the items on the shelf:\n\n- The top row has a \"Pillsbury\" item with a price tag of $2.25.\n- The middle row has a \"Pillsbury\" item with a price tag of $2.50.\n- The bottom row has a \"Pillsbury\" item with a price tag of $2.60.\n- The top row also has a \"Pillsbury\" item with a price tag of $2.25.\n- The middle row has a \"Pillsbury\" item with a price tag of $2.50.\n- The bottom row has a \"Pillsbury\" item with a price tag of $2.60.\n\nHowever, the most prominent items are the \"Pillsbury\" items, and the price tags are $2.25, $2.50, and $2.60. There is no item priced at $5 on the shelf.\n\nTherefore, the number of items that can be bought for $5 is 0."
},
{
"question_id": 34621,
"prediction": "Based on the image provided, there are **2** price tags on the bottom shelf.\n\nHere is a breakdown of the visible price tags:\n\n- **On the left side of the bottom shelf:** There is one price tag with the price `$2.60`.\n- **On the right side of the bottom shelf:** There is another price tag with the price `$2.60`.\n\nTherefore, there are a total of **2** price tags on the bottom shelf."
}
]
}