.gitignore

This commit is contained in:
2026-02-27 01:41:52 +00:00
parent 2f4420bb2d
commit 9556909e78
4 changed files with 2 additions and 1185 deletions

1
.gitignore vendored
View File

@ -2,3 +2,4 @@ data/*
Qwen3-VL-2B-Instruct/*
__pycache__/
.vscode/
.ipynb_checkpoints/

View File

@ -1,414 +0,0 @@
# AICAS 2026 - 面向AI芯片的VLM高效推理与优化赛道
## 目录
- [概述](#概述)
- [代码结构](#代码结构)
- [核心文件](#核心文件)
- [快速开始](#快速开始)
- [评测指标](#评测指标)
- [比赛规则](#比赛规则)
- [重要提示](#重要提示)
- [提交指南](#提交指南)
## 概述
本次竞赛专注于优化视觉语言模型VLM的推理性能。参赛者需要修改 `evaluation_wrapper.py` 中的 `VLMModel` 类,在保持准确率的同时提升首 Token 时间TTFT和吞吐量Throughput
## 代码结构
```
AICASGC/
├── benchmark.py # 基准测试脚本
├── evaluation_wrapper.py # 模型包装器(选手在此实现优化)
├── requirements.txt # Python 依赖包
├── data/ # 验证数据集
│ ├── data-*.arrow # 数据集文件
│ ├── dataset_info.json # 数据集元信息
│ └── state.json # 数据集状态
├── Qwen3-VL-2B-Instruct/ # 模型权重目录(需要选手自行下载)
└── README.md / README_CN.md # 说明文档
```
## 核心文件
- **`benchmark.py`** - 自测基准脚本(⚠️ **不建议修改**
- **`evaluation_wrapper.py`** - 模型包装器,参赛者在此实现优化
- **`Qwen3-VL-2B-Instruct/`** - 竞赛模型权重(需要选手自行下载,见"快速开始"部分)
- **`data/`** - 验证数据集
- **`requirements.txt`** - Python 依赖包
## 快速开始
### 0. 下载模型(首次使用)
模型文件较大,需要单独下载。请先创建模型目录,然后下载模型:
```bash
# 创建模型目录
mkdir -p Qwen3-VL-2B-Instruct
# 安装 huggingface_hub如果未安装
pip install -U huggingface_hub
# 设置镜像源(国内用户推荐,加速下载)
export HF_ENDPOINT=https://hf-mirror.com
# 下载模型到指定目录
huggingface-cli download \
--resume-download \
Qwen/Qwen3-VL-2B-Instruct \
--local-dir ./Qwen3-VL-2B-Instruct \
--local-dir-use-symlinks False
```
**注意:**
- 模型大小约 4-5GB下载可能需要一些时间
- 如果下载中断,可以重新运行命令,会自动续传(`--resume-download`
- 下载完成后,`Qwen3-VL-2B-Instruct/` 文件夹会包含所有模型文件
- 确保有足够的磁盘空间(至少 5GB
### 1. 安装依赖
```bash
pip install -r requirements.txt
```
### 2. 运行测试
```bash
python benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result.json \
--num-samples 100
```
### 3. 实现你的优化
编辑 `evaluation_wrapper.py` 中的 `VLMModel` 类。优化采用**模块化设计**,每个优化方向对应一个独立方法。
#### 3.1 探索模型结构(可选)
在开始优化前,可以先探索模型结构,了解优化目标:
```python
class VLMModel:
def __init__(self, model_path: str, device: str = "cuda:0"):
# ... 加载模型 ...
# 可选:探索模型结构
self._explore_model_structure() # 会打印模型结构信息
```
#### 3.2 启用优化方法
`__init__` 方法中,通过注释/取消注释来启用/禁用不同的优化:
```python
class VLMModel:
def __init__(self, model_path: str, device: str = "cuda:0"):
# ... 加载模型 ...
# ================================================================
# 选手优化区域 - 启用/禁用优化方法
# ================================================================
# 1. Vision Encoder 加速(优化大分辨率图像处理)
# self._optimize_vision_encoder()
# 2. KV Cache 管理(优化生成过程中的内存碎片)
# self._optimize_kv_cache()
# 3. 跨模态融合层优化(优化 Cross-modal Connector
# self._optimize_cross_modal_connector()
# 4. Flash Attention 优化
# self._enable_flash_attention()
# 5. 量化优化
# self._apply_quantization()
```
#### 3.3 实现优化代码
在各个优化方法中实现你的优化逻辑。例如,优化 Vision Encoder
```python
def _optimize_vision_encoder(self):
"""在 evaluation_wrapper.py 中找到这个方法,实现你的优化"""
# 示例:替换注意力算子
# from your_optimization import optimized_attention
# if hasattr(self._model, 'vision_model'):
# for layer in self._model.vision_model.encoder.layers:
# layer.self_attn.forward = optimized_attention
# TODO: 实现你的 Vision Encoder 优化
pass
```
### 4. 测试你的优化模型
```bash
python benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result_optimized.json \
--num-samples 100
```
### 5. 生成完整结果用于提交
```bash
python benchmark.py \
--model-path ./Qwen3-VL-2B-Instruct \
--dataset-path ./data \
--output result.json \
--num-samples 5000
```
## 评测指标
最终得分计算公式:
```
最终得分 = 0.4 × 准确率 + 0.3 × TTFT提升率 + 0.3 × 吞吐量提升率
```
### 指标说明
- **TTFT (Time To First Token)**: 从输入准备到生成第一个 Token 的时间(毫秒)
- 包含图像编码、文本编码、跨模态交互、Prefill 阶段、第一个 Token 生成
- Baseline: ~80ms
- 提升率 = (Baseline - 你的TTFT) / Baseline
- **Throughput (吞吐量)**: 端到端 Token 生成速率tokens/秒)
- Baseline: ~55 tokens/sec
- 提升率 = (你的吞吐量 - Baseline) / Baseline
- **Accuracy (准确率)**: 验证集上的 VQA 准确率5000 个样本)
- 支持多个标准答案的软匹配
## 比赛规则
### 重要规则
1. **不要修改 `benchmark.py`**
- 此基准脚本仅用于自测
- 最终评测将使用独立的官方基准系统
- 修改此文件可能导致本地结果与最终评测结果不一致
2. **仅修改 `evaluation_wrapper.py`**
3. **保持必需的属性**
- `VLMModel` 类必须暴露 `processor``model``device` 属性
- Benchmark 使用这些属性来访问模型和处理器
- `generate()` 方法是可选的,主要用于调试
4. **禁止行为**
- 禁止硬编码答案
- 禁止修改数据集
- 禁止使用外部 API 或服务
- 所有优化必须是本地且自包含的
### 优化方向
- 鼓励实现算子替换与内核优化使用Triton、CUDA C++等重写或替换标准算子实现如Attention、LayerNorm、Conv2d等
- 鼓励实现内存与缓存优化优化KV Cache内存布局、减少内存碎片、优化显存访问模式
- 鼓励实现编译与图优化使用torch.compile进行计算图优化、自定义内核调度
- 鼓励实现注意力机制优化实现Flash Attention、内存高效注意力、稀疏注意力
- 鼓励实现生成过程优化:优化解码策略、缓存管理、生成配置参数
**不允许:**
- 使用外部服务禁止调用外部API、云服务或任何需要网络连接的功能
- 数据与答案作弊:禁止使用测试数据进行训练、预计算答案、硬编码输出
- 模型替换与篡改:希望选手着重做算子优化,不要用额外的数据集去训练模型、改变模型架构、直接修改权重数值等。
- 过拟合优化:禁止针对特定评测样本进行条件分支或特殊处理
- 黑盒工具套用:仅修改配置文件而无实质性代码贡献的行为不被认可
- 环境操纵禁止通过修改系统环境、GPU频率锁定等方式干扰公平评测
## 重要提示
### 样本选择
- 提供的 `benchmark.py` 使用**固定顺序**(从索引 0 开始的前 N 个样本)
- 运行 `--num-samples 100` 时,会评测样本 0-99
- 这确保了本地自测的可复现性
- **注意**:竞赛委员会使用的官方评测系统可能采用不同的采样策略(包括随机采样)进行最终验证
### 硬件信息
基准测试会自动记录详细的硬件信息:
- Python 版本、PyTorch 版本、CUDA 版本
- GPU 名称、显存、计算能力
- CPU 型号、核心数、频率
- 系统信息(操作系统、内核、架构)
- PPU 信息(如果可用)
这些信息保存在 `result.json``system_info` 字段中,用于统计分析。
### 性能测量
- **预热**:在实际测量前使用 10 个样本进行 GPU 预热
- **TTFT 测量**:测量从输入准备到第一个 Token 的时间(包含所有预处理)
- **吞吐量测量**:测量生成 128 个 Token 的端到端时间
- **状态隔离**:在测量之间清理 GPU 缓存,确保公平性
### 随机种子
- `--random-seed` 参数仅影响 PyTorch 的随机数生成器
- 它**不会**影响样本选择顺序(始终是固定的)
- 用于模型推理随机性的可复现性
### 输出格式
`result.json` 文件包含:
```json
{
"system_info": {
"timestamp": "...",
"python_version": "...",
"torch_version": "...",
"cuda_version": "...",
"gpu_name": "...",
...
},
"performance": {
"avg_ttft_ms": 90.55,
"avg_throughput_tokens_per_sec": 57.77
},
"answers": [
{
"question_id": 34602,
"prediction": "你的答案文本"
},
...
]
}
```
## 提交指南
### 初赛提交必需文件
1. **`result.json`** - 通过运行 `benchmark.py` 生成
- 包含所有样本的预测
- 必须包含有效的 `performance` 指标
- **重要**:上传到天池平台的 `result.json` 仅用于参考。最终成绩将由竞赛委员会使用标准化硬件和官方评测系统进行评测。
2. **你的优化代码** - 包含你优化的 `VLMModel` 类的 `evaluation_wrapper.py`
3. **Docker 镜像**- 包含你优化环境的容器
### 评测流程
1. **自测**:使用提供的 `benchmark.py` 在本地测试你的优化
2. **提交**:将你的 `result.json` 上传到天池平台(仅用于参考)
3. **官方评测**:竞赛委员会将使用以下方式评测你的代码:
- 提交Docker镜像
- 标准化硬件环境
- 官方评测代码
- 完整验证集,随机采样进行验证
4. **最终排名**:基于官方评测系统计算的最终得分
## 祝你好运!
希望你会专注于算子级优化、内核替换和高效的内存管理。记住:准确率和速度同样重要!祝你好运!
Qwen3VLForConditionalGeneration(
(model): Qwen3VLModel(
(visual): Qwen3VLVisionModel(
(patch_embed): Qwen3VLVisionPatchEmbed(
(proj): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
)
(pos_embed): Embedding(2304, 1024)
(rotary_pos_emb): Qwen3VLVisionRotaryEmbedding()
(blocks): ModuleList(
(0-23): 24 x Qwen3VLVisionBlock(
(norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(attn): Qwen3VLVisionAttention(
(qkv): Linear(in_features=1024, out_features=3072, bias=True)
(proj): Linear(in_features=1024, out_features=1024, bias=True)
)
(mlp): Qwen3VLVisionMLP(
(linear_fc1): Linear(in_features=1024, out_features=4096, bias=True)
(linear_fc2): Linear(in_features=4096, out_features=1024, bias=True)
(act_fn): GELUTanh()
)
)
)
(merger): Qwen3VLVisionPatchMerger(
(norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
(linear_fc1): Linear(in_features=4096, out_features=4096, bias=True)
(act_fn): GELU(approximate='none')
(linear_fc2): Linear(in_features=4096, out_features=2048, bias=True)
)
(deepstack_merger_list): ModuleList(
(0-2): 3 x Qwen3VLVisionPatchMerger(
(norm): LayerNorm((4096,), eps=1e-06, elementwise_affine=True)
(linear_fc1): Linear(in_features=4096, out_features=4096, bias=True)
(act_fn): GELU(approximate='none')
(linear_fc2): Linear(in_features=4096, out_features=2048, bias=True)
)
)
)
(language_model): Qwen3VLTextModel(
(embed_tokens): Embedding(151936, 2048)
(layers): ModuleList(
(0-27): 28 x Qwen3VLTextDecoderLayer(
(self_attn): Qwen3VLTextAttention(
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
(k_proj): Linear(in_features=2048, out_features=1024, bias=False)
(v_proj): Linear(in_features=2048, out_features=1024, bias=False)
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
(q_norm): Qwen3VLTextRMSNorm((128,), eps=1e-06)
(k_norm): Qwen3VLTextRMSNorm((128,), eps=1e-06)
)
(mlp): Qwen3VLTextMLP(
(gate_proj): Linear(in_features=2048, out_features=6144, bias=False)
(up_proj): Linear(in_features=2048, out_features=6144, bias=False)
(down_proj): Linear(in_features=6144, out_features=2048, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
(post_attention_layernorm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
)
)
(norm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
(rotary_emb): Qwen3VLTextRotaryEmbedding()
)
)
(lm_head): Linear(in_features=2048, out_features=151936, bias=False)
)

View File

@ -1,406 +0,0 @@
"""
AICAS 2026 - Participant Core Modification File
Participants should modify the VLMModel class to implement optimizations.
Note:
- Benchmark directly calls self.model.generate() for performance testing.
- Your optimizations should modify self.model or its operators in __init__ via Monkey Patch.
- The generate() method is optional and mainly for debugging.
"""
from typing import Dict
try:
from PIL import Image
except ImportError:
# For testing without PIL
class Image:
pass
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
class VLMModel:
"""
Participant optimization class - modify this to implement optimizations.
Optimization Architecture:
- Split optimizations into separate methods for isolation and testing
- Enable/disable each optimization independently in __init__
- Each optimization method can be tested individually
Important Notes:
1. Benchmark directly calls self.model.generate() for performance testing.
2. Your optimizations should modify self.model or its operators via Monkey Patch.
3. All optimizations are applied in __init__ by calling optimization methods.
"""
def __init__(self, model_path: str, device: str = "cuda:0"):
"""
Initialize model and apply optimizations.
Args:
model_path: Qwen3-VL-2B-Instruct model path
device: CUDA device, e.g., "cuda:0"
"""
self._device = device
self.model_path = model_path
# Load processor
print(f"[VLMModel] Loading processor from {model_path}...")
self._processor = AutoProcessor.from_pretrained(model_path)
# Load model
print(f"[VLMModel] Loading model with FP16...")
self._model = AutoModelForImageTextToText.from_pretrained(
model_path,
torch_dtype=torch.float16,
device_map=device
)
self._model.eval()
# Track applied optimizations
self._optimizations_applied = []
# ================================================================
# Participant Optimization Area - Enable/disable optimizations here
# Uncomment the optimization methods you want to apply
# ================================================================
# 1. Vision Encoder Acceleration
# self._optimize_vision_encoder()
# 2. KV Cache Management
# self._optimize_kv_cache()
# 3. Cross-modal Connector Optimization
# self._optimize_cross_modal_connector()
# 4. Flash Attention Optimization
# self._enable_flash_attention()
# 5. Quantization
# self._apply_quantization()
# Optional: Explore model structure before optimization
# self._explore_model_structure()
# ================================================================
print(f"[VLMModel] Model loaded successfully on {device}")
if self._optimizations_applied:
print(f"[VLMModel] Applied optimizations: {', '.join(self._optimizations_applied)}")
# ================================================================
# Optimization Methods - Implement your optimizations here
# ================================================================
def _explore_model_structure(self):
"""
Helper method to explore model structure.
Use this to understand the model architecture before implementing optimizations.
This helps identify where to apply monkey patches.
"""
print("=" * 60)
print("Model Structure Exploration")
print("=" * 60)
# Explore vision model structure
if hasattr(self._model, 'vision_model'):
print(f"Vision Model: {type(self._model.vision_model)}")
if hasattr(self._model.vision_model, 'encoder'):
if hasattr(self._model.vision_model.encoder, 'layers'):
print(f" Vision Encoder Layers: {len(self._model.vision_model.encoder.layers)}")
# Show first layer structure
if len(self._model.vision_model.encoder.layers) > 0:
print(f" First Layer Type: {type(self._model.vision_model.encoder.layers[0])}")
else:
print("Vision Model: Not found (model structure may differ)")
# Explore language model structure
if hasattr(self._model, 'model'):
print(f"Language Model: {type(self._model.model)}")
if hasattr(self._model.model, 'layers'):
print(f" Language Model Layers: {len(self._model.model.layers)}")
else:
print("Language Model: Not found (model structure may differ)")
# Explore cross-modal components
cross_modal_attrs = ['connector', 'cross_attn', 'cross_attention', 'proj', 'projector']
found_components = []
for attr in cross_modal_attrs:
if hasattr(self._model, attr):
found_components.append(attr)
if found_components:
print(f"Cross-modal Components: {', '.join(found_components)}")
else:
print("Cross-modal Components: Explore manually (structure may vary)")
print("=" * 60)
print("Tip: Use print(self._model) to see full model structure")
print("=" * 60)
def _optimize_vision_encoder(self):
"""
Optimize Vision Encoder for high-resolution image inputs.
Optimization Directions:
1. Patch embedding convolution optimization
2. Vision Transformer attention mechanism optimization
3. Layer normalization optimization
4. Memory-efficient image processing
Implementation Steps:
1. Inspect model structure: call self._explore_model_structure()
2. Identify bottlenecks using profiling tools (PyTorch Profiler, nsys, etc.)
3. Implement optimized operators (Triton/CUDA kernels)
4. Replace original operators via monkey patch
Target Components:
- self._model.vision_model (if exists)
- Vision encoder layers and attention mechanisms
- Convolution operations in patch embedding
"""
# TODO: Implement your Vision Encoder optimization here
#
# Example workflow:
# 1. from your_optimization import optimized_attention, optimized_conv
# 2. Inspect: print(self._model.vision_model) to find target layers
# 3. Replace: layer.self_attn.forward = optimized_attention
# 4. Test: Run benchmark to verify improvement
if 'vision_encoder' not in self._optimizations_applied:
self._optimizations_applied.append('vision_encoder')
def _optimize_kv_cache(self):
"""
Optimize KV Cache management to reduce memory fragmentation.
Optimization Directions:
1. Memory layout optimization (contiguous memory allocation)
2. Fragmentation-free allocation strategies
3. Efficient cache reuse patterns
4. Dynamic cache sizing
Implementation Steps:
1. Understand current KV cache implementation in model layers
2. Design memory-efficient cache allocation strategy
3. Implement custom KV cache allocator if needed
4. Apply optimizations via monkey patch or config modification
Target Components:
- self._model.config (cache configuration)
- Attention layers (KV cache allocation)
- Generation loop (cache management)
"""
# Enable KV Cache first
self._model.config.use_cache = True
if hasattr(self._model.config, 'pad_token_id'):
if self._model.config.pad_token_id is None:
self._model.config.pad_token_id = self._model.config.eos_token_id
# TODO: Implement advanced KV Cache optimizations here
#
# Example workflow:
# 1. from your_optimization import FragmentationFreeKVCache
# 2. for layer in self._model.model.layers:
# 3. layer.attention.custom_kv_cache = FragmentationFreeKVCache()
# 4. Test: Monitor memory usage and generation speed
if 'kv_cache' not in self._optimizations_applied:
self._optimizations_applied.append('kv_cache')
def _optimize_cross_modal_connector(self):
"""
Optimize Cross-modal Connector computation efficiency.
Optimization Directions:
1. Cross-attention mechanism optimization
2. Vision-to-language projection optimization
3. Multi-modal fusion layer efficiency
4. Feature alignment and transformation optimization
Implementation Steps:
1. Identify cross-modal components using self._explore_model_structure()
2. Profile cross-modal operations to find bottlenecks
3. Implement optimized cross-attention or projection kernels
4. Replace original operations via monkey patch
Note: Qwen3-VL's cross-modal structure may vary.
Use model exploration to identify actual component names and locations.
"""
# TODO: Implement your Cross-modal Connector optimization here
#
# Example workflow:
# 1. Explore: self._explore_model_structure() to find connector components
# 2. from your_optimization import optimized_cross_attention
# 3. Identify: Inspect model to find cross-attention layers
# 4. Replace: connector.cross_attention.forward = optimized_cross_attention
# 5. Test: Verify accuracy and performance improvements
from my_patch import patch_forward
self._model.model.__class__.forward = patch_forward
if 'cross_modal' not in self._optimizations_applied:
self._optimizations_applied.append('cross_modal')
def _enable_flash_attention(self):
"""
Enable or implement Flash Attention optimization.
Implementation Approaches:
Approach 1: Enable PyTorch's Built-in Flash Attention (Simple)
- Uses torch.backends.cuda.enable_flash_sdp(True)
- Easy to enable but limited customization
- May not work for all attention patterns in Qwen3-VL
Approach 2: Implement Custom Flash Attention (Advanced, Recommended)
- Write custom Triton/CUDA kernels for attention computation
- Replace torch.nn.functional.scaled_dot_product_attention
- Full control over attention computation and memory layout
- Better performance potential but requires more implementation effort
Recommended: Implement Approach 2 for better performance gains.
Use profiling to identify which attention operations benefit most from optimization.
"""
# TODO: Choose and implement your Flash Attention approach
# Approach 1: Simple (enable PyTorch built-in)
# torch.backends.cuda.enable_flash_sdp(True)
# Approach 2: Advanced (custom implementation - recommended)
# from your_optimization import custom_flash_attention
# torch.nn.functional.scaled_dot_product_attention = custom_flash_attention
#
# Or replace at layer level:
# for layer in self._model.model.layers:
# layer.self_attn.forward = custom_attention_with_flash
if 'flash_attention' not in self._optimizations_applied:
self._optimizations_applied.append('flash_attention')
def _apply_quantization(self):
"""
Apply quantization to reduce model size and speed up inference.
Optimization Directions:
1. INT8 quantization (8-bit integer)
2. FP8 quantization (8-bit floating point)
3. Mixed precision quantization
4. Dynamic vs static quantization
Implementation Steps:
1. Choose quantization strategy based on accuracy/performance trade-off
2. Use quantization libraries (BitsAndBytes, TensorRT, etc.)
3. Calibrate quantized model on validation data
4. Verify accuracy preservation
Note: Quantization may require reloading the model with quantization config.
Consider applying quantization before other optimizations if model reload is needed.
"""
# TODO: Implement your quantization here
#
# Example workflow:
# 1. from transformers import BitsAndBytesConfig
# 2. quantization_config = BitsAndBytesConfig(load_in_8bit=True)
# 3. Note: May need to reload model with quantization config
# 4. Test: Verify accuracy and performance improvements
if 'quantization' not in self._optimizations_applied:
self._optimizations_applied.append('quantization')
# Required properties for benchmark
@property
def processor(self):
"""
Required by benchmark for input processing.
Benchmark uses this to prepare inputs with unified tokenizer.
"""
return self._processor
@property
def model(self):
"""
Required by benchmark for direct model.generate() calls.
Benchmark directly calls self.model.generate() for performance testing.
Your optimizations should modify this model object or its operators.
"""
return self._model
@property
def device(self):
"""
Required by benchmark for device information.
"""
return self._device
def generate(
self,
image: Image.Image,
question: str,
max_new_tokens: int = 128
) -> Dict:
"""
Generate answer (optional method, mainly for debugging).
Note: Benchmark uses self.model.generate() directly for performance testing.
This method is provided for convenience and debugging purposes.
Args:
image: PIL Image object
question: Question text
max_new_tokens: Maximum tokens to generate
Returns:
Dict: {
"text": str, # Generated text answer
"token_count": int # Generated token count
}
"""
# Build Qwen3-VL message format
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": question}
]
}]
# Process inputs
inputs = self._processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(self._device)
# Generate
with torch.no_grad():
output_ids = self._model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
temperature=0.0,
top_p=1.0,
use_cache=True
)
# Extract generated tokens (remove input part)
input_len = inputs.input_ids.shape[1]
generated_ids = output_ids[0][input_len:]
# Decode
text = self._processor.tokenizer.decode(
generated_ids,
skip_special_tokens=True,
clean_up_tokenization_spaces=False
)
return {
"text": text,
"token_count": len(generated_ids)
}

View File

@ -1,364 +0,0 @@
import numpy as np
import torch
from transformers.models.qwen3_vl.processing_qwen3_vl import Qwen3VLProcessor, Qwen3VLProcessorKwargs
from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLModelOutputWithPast, BaseModelOutputWithDeepstackFeatures
from transformers.feature_extraction_utils import BatchFeature
from transformers.image_utils import ImageInput
from transformers.processing_utils import Unpack
from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
from transformers.utils import logging, TransformersKwargs, can_return_tuple
from transformers.video_utils import VideoInput
from transformers.cache_utils import Cache
from transformers.processing_utils import Unpack
logger = logging.get_logger(__name__)
class myQwen3VLProcessor(Qwen3VLProcessor):
def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
super().__init__(image_processor, tokenizer, video_processor, chat_template, **kwargs)
def __call__(
self,
images: ImageInput = None,
text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
videos: VideoInput = None,
**kwargs: Unpack[Qwen3VLProcessorKwargs],
) -> BatchFeature:
r"""
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
- **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
- **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
- **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
"""
output_kwargs = self._merge_kwargs(
Qwen3VLProcessorKwargs,
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
**kwargs,
)
if images is not None:
image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
image_grid_thw = image_inputs["image_grid_thw"]
else:
image_inputs = {}
image_grid_thw = None
if videos is not None:
videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
video_grid_thw = videos_inputs["video_grid_thw"]
# If user has not requested video metadata, pop it
if not kwargs.get("return_metadata"):
video_metadata = videos_inputs.pop("video_metadata")
else:
video_metadata = videos_inputs["video_metadata"]
else:
videos_inputs = {}
video_grid_thw = None
if not isinstance(text, list):
text = [text]
text = text.copy() # below lines change text in-place
if image_grid_thw is not None:
merge_length = self.image_processor.merge_size**2
index = 0
for i in range(len(text)):
while self.image_token in text[i]:
# num_image_tokens = image_grid_thw[index].prod() // merge_length
num_image_tokens = 40
text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
index += 1
text[i] = text[i].replace("<|placeholder|>", self.image_token)
if video_grid_thw is not None:
merge_length = self.video_processor.merge_size**2
index = 0
for i in range(len(text)):
while self.video_token in text[i]:
metadata = video_metadata[index]
if metadata.fps is None:
logger.warning_once(
"Qwen3VL requires frame timestamps to construct prompts, but the `fps` of the input video could not be inferred. "
"Probably `video_metadata` was missing from inputs and you passed pre-sampled frames. "
"Defaulting to `fps=24`. Please provide `video_metadata` for more accurate results."
)
metadata.fps = 24 if metadata.fps is None else metadata.fps
# if timestamps are not provided, calculate them
curr_timestamp = self._calculate_timestamps(
metadata.frames_indices,
metadata.fps,
self.video_processor.temporal_patch_size,
)
video_placeholder = ""
frame_seqlen = video_grid_thw[index][1:].prod() // merge_length
for frame_idx in range(video_grid_thw[index][0]):
curr_time = curr_timestamp[frame_idx]
video_placeholder += f"<{curr_time:.1f} seconds>"
video_placeholder += (
self.vision_start_token + "<|placeholder|>" * frame_seqlen + self.vision_end_token
)
if f"{self.vision_start_token}{self.video_token}{self.vision_end_token}" in text[i]:
text[i] = text[i].replace(
f"{self.vision_start_token}{self.video_token}{self.vision_end_token}", video_placeholder, 1
)
else:
# vllm may input video token directly
text[i] = text[i].replace(self.video_token, video_placeholder, 1)
index += 1
text[i] = text[i].replace("<|placeholder|>", self.video_token)
return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", None)
text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
self._check_special_mm_tokens(text, text_inputs, modalities=["image", "video"])
if return_mm_token_type_ids:
array_ids = np.array(text_inputs["input_ids"])
mm_token_type_ids = np.zeros_like(text_inputs["input_ids"])
mm_token_type_ids[array_ids == self.image_token_id] = 1
text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
def _sample_indices_uniform(idx: torch.LongTensor, keep_ratio: float, min_keep: int = 0):
"""
idx: 1D indices in original sequence (sorted)
keep_ratio: 0~1, keep uniformly spaced
"""
n = idx.numel()
if n == 0:
return idx
k = max(min_keep, int(torch.ceil(torch.tensor(n * keep_ratio)).item()))
k = min(k, n)
if k == n:
return idx
# uniform pick: linspace over [0, n-1]
pos = torch.linspace(0, n - 1, steps=k, device=idx.device)
pos = pos.round().long().clamp(0, n - 1)
return idx[pos]
def sparse_keep_and_gather(
inputs_embeds, # (B,S,D)
attention_mask, # (B,S)
position_ids, # (4,B,S)
visual_pos_masks, # (B,S) bool
deepstack_visual_embeds,# list[tensor] each (Nvis_total,D) OR None
keep_ratio: float = 0.25,
min_keep_per_vis: int = 0,
max_len: int | None = None,
):
"""
稀疏保留:保留全部文本 token视觉 token 按 keep_ratio 均匀采样保留。
可选 max_len如果最终还超长再从视觉 token 里继续裁(不动文本)。
"""
device = inputs_embeds.device
B, S, D = inputs_embeds.shape
eff = attention_mask.bool()
keep_mask_token = torch.zeros((B, S), dtype=torch.bool, device=device)
for b in range(B):
eff_idx = eff[b].nonzero(as_tuple=False).squeeze(1) # 有效 token
if eff_idx.numel() == 0:
continue
vis_eff = visual_pos_masks[b, eff_idx] # 有效里哪些是视觉
text_idx = eff_idx[~vis_eff] # 全保留
vis_idx = eff_idx[vis_eff] # 待稀疏
# 视觉稀疏采样(删中间就靠这一步)
kept_vis = _sample_indices_uniform(vis_idx, keep_ratio, min_keep=min_keep_per_vis)
chosen = torch.cat([text_idx, kept_vis], dim=0)
chosen, _ = torch.sort(chosen) # 保持原序
# 如果还要控最大长度:优先继续裁视觉(不裁文本)
if max_len is not None and chosen.numel() > max_len:
# 已保留的视觉位置
chosen_vis = chosen[visual_pos_masks[b, chosen]]
chosen_txt = chosen[~visual_pos_masks[b, chosen]]
# 文本若已超 max_len只能截文本极少
if chosen_txt.numel() >= max_len:
chosen = chosen_txt[:max_len]
else:
budget = max_len - chosen_txt.numel()
# 对视觉再均匀裁到 budget
chosen_vis = _sample_indices_uniform(chosen_vis, budget / max(chosen_vis.numel(), 1))
chosen = torch.cat([chosen_txt, chosen_vis], dim=0)
chosen, _ = torch.sort(chosen)
keep_mask_token[b, chosen] = True
# ===== gather + pad 到 batch 内最大长度 =====
keep_lens = keep_mask_token.sum(dim=1).tolist()
max_keep = max(keep_lens) if keep_lens else 0
new_inputs = inputs_embeds.new_zeros((B, max_keep, D))
new_attn = attention_mask.new_zeros((B, max_keep))
new_pos = position_ids.new_zeros((4, B, max_keep))
new_vis = visual_pos_masks.new_zeros((B, max_keep), dtype=torch.bool)
for b in range(B):
idx = keep_mask_token[b].nonzero(as_tuple=False).squeeze(1)
L = idx.numel()
if L == 0:
continue
new_inputs[b, :L, :] = inputs_embeds[b, idx, :]
new_attn[b, :L] = attention_mask[b, idx]
new_pos[:, b, :L] = position_ids[:, b, idx]
new_vis[b, :L] = visual_pos_masks[b, idx]
# ===== deepstack 同步裁剪(关键!)=====
new_deepstack = None
if deepstack_visual_embeds is not None:
# deepstack 的顺序 = visual_pos_masks flatten 后 True 的顺序
# 所以用 keep_mask_token 在这些位置的布尔值来裁剪
keep_vis_flat = keep_mask_token[visual_pos_masks] # 1D bool, length = Nvis_total
new_deepstack = [x[keep_vis_flat] for x in deepstack_visual_embeds]
return new_inputs, new_attn, new_pos, new_vis, new_deepstack
@can_return_tuple
def patch_forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: torch.Tensor | None = None,
position_ids: torch.LongTensor | None = None,
past_key_values: Cache | None = None,
inputs_embeds: torch.FloatTensor | None = None,
pixel_values: torch.Tensor | None = None,
pixel_values_videos: torch.FloatTensor | None = None,
image_grid_thw: torch.LongTensor | None = None,
video_grid_thw: torch.LongTensor | None = None,
cache_position: torch.LongTensor | None = None,
**kwargs: Unpack[TransformersKwargs],
) -> tuple | Qwen3VLModelOutputWithPast:
r"""
image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):
The temporal, height and width of feature shape of each image in LLM.
video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
The temporal, height and width of feature shape of each video in LLM.
"""
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
if inputs_embeds is None:
inputs_embeds = self.get_input_embeddings()(input_ids)
image_mask = None
video_mask = None
if pixel_values is not None:
image_outputs: BaseModelOutputWithDeepstackFeatures = self.get_image_features(
pixel_values, image_grid_thw, return_dict=True
)
image_embeds = image_outputs.pooler_output
deepstack_image_embeds = image_outputs.deepstack_features
image_embeds = torch.cat(image_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
image_mask, _ = self.get_placeholder_mask(
input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
)
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
if pixel_values_videos is not None:
video_outputs: BaseModelOutputWithDeepstackFeatures = self.get_video_features(
pixel_values_videos, video_grid_thw, return_dict=True
)
video_embeds = video_outputs.pooler_output
deepstack_video_embeds = video_outputs.deepstack_features
video_embeds = torch.cat(video_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
_, video_mask = self.get_placeholder_mask(
input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
)
inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
visual_pos_masks = None
deepstack_visual_embeds = None
if image_mask is not None and video_mask is not None:
# aggregate visual_pos_masks and deepstack_visual_embeds
image_mask = image_mask[..., 0]
video_mask = video_mask[..., 0]
visual_pos_masks = image_mask | video_mask
deepstack_visual_embeds = []
image_mask_joint = image_mask[visual_pos_masks]
video_mask_joint = video_mask[visual_pos_masks]
for img_embed, vid_embed in zip(deepstack_image_embeds, deepstack_video_embeds):
embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)
embed_joint[image_mask_joint, :] = img_embed
embed_joint[video_mask_joint, :] = vid_embed
deepstack_visual_embeds.append(embed_joint)
elif image_mask is not None:
image_mask = image_mask[..., 0]
visual_pos_masks = image_mask
deepstack_visual_embeds = deepstack_image_embeds
elif video_mask is not None:
video_mask = video_mask[..., 0]
visual_pos_masks = video_mask
deepstack_visual_embeds = deepstack_video_embeds
if position_ids is None:
position_ids = self.compute_3d_position_ids(
input_ids=input_ids,
image_grid_thw=image_grid_thw,
video_grid_thw=video_grid_thw,
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
past_key_values=past_key_values,
)
# ====== 稀疏采样裁剪:只在 prefill 做past_key_values is None=====
if past_key_values.get_seq_length() == 0 and visual_pos_masks is not None:
# 这些参数你可以通过 kwargs 传入
keep_ratio = kwargs.pop("visual_keep_ratio", 0.1) # 只保留 25% 视觉 token
min_keep = kwargs.pop("min_keep_per_vis", 0) # 每段视觉最少保留多少(可设比如 16
max_len = kwargs.pop("truncate_max_len", None) # 总长度上限(可选)
inputs_embeds, attention_mask, position_ids, visual_pos_masks, deepstack_visual_embeds = sparse_keep_and_gather(
inputs_embeds=inputs_embeds,
attention_mask=attention_mask,
position_ids=position_ids,
visual_pos_masks=visual_pos_masks,
deepstack_visual_embeds=deepstack_visual_embeds,
keep_ratio=keep_ratio,
min_keep_per_vis=min_keep,
max_len=max_len,
)
# cache_position 建议重建为 0..L-1避免对齐问题
cache_position = torch.arange(
inputs_embeds.shape[1], device=inputs_embeds.device, dtype=torch.long
).unsqueeze(0).expand(inputs_embeds.shape[0], -1)
# rope_deltas 建议也按裁剪后的序列重算(防止不一致)
eff_len = attention_mask.sum(dim=1).to(torch.long) # (B,)
max_pos = position_ids.max(dim=0).values.max(dim=1).values # (B,)
self.rope_deltas = (max_pos + 1 - eff_len).unsqueeze(1)
# ====== 裁剪结束 ======
outputs = self.language_model(
input_ids=None,
position_ids=position_ids,
attention_mask=attention_mask,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
cache_position=cache_position,
visual_pos_masks=visual_pos_masks,
deepstack_visual_embeds=deepstack_visual_embeds,
**kwargs,
)
return Qwen3VLModelOutputWithPast(
**outputs,
rope_deltas=self.rope_deltas,
)