.gitignore
This commit is contained in:
3
.gitignore
vendored
3
.gitignore
vendored
@ -1,4 +1,5 @@
|
||||
data/*
|
||||
Qwen3-VL-2B-Instruct/*
|
||||
__pycache__/
|
||||
.vscode/
|
||||
.vscode/
|
||||
.ipynb_checkpoints/
|
||||
@ -1,414 +0,0 @@
|
||||
# AICAS 2026 - 面向AI芯片的VLM高效推理与优化赛道
|
||||
|
||||
## 目录
|
||||
- [概述](#概述)
|
||||
- [代码结构](#代码结构)
|
||||
- [核心文件](#核心文件)
|
||||
- [快速开始](#快速开始)
|
||||
- [评测指标](#评测指标)
|
||||
- [比赛规则](#比赛规则)
|
||||
- [重要提示](#重要提示)
|
||||
- [提交指南](#提交指南)
|
||||
|
||||
|
||||
## 概述
|
||||
|
||||
本次竞赛专注于优化视觉语言模型(VLM)的推理性能。参赛者需要修改 `evaluation_wrapper.py` 中的 `VLMModel` 类,在保持准确率的同时提升首 Token 时间(TTFT)和吞吐量(Throughput)。
|
||||
|
||||
## 代码结构
|
||||
|
||||
```
|
||||
AICASGC/
|
||||
├── benchmark.py # 基准测试脚本
|
||||
├── evaluation_wrapper.py # 模型包装器(选手在此实现优化)
|
||||
├── requirements.txt # Python 依赖包
|
||||
├── data/ # 验证数据集
|
||||
│ ├── data-*.arrow # 数据集文件
|
||||
│ ├── dataset_info.json # 数据集元信息
|
||||
│ └── state.json # 数据集状态
|
||||
├── Qwen3-VL-2B-Instruct/ # 模型权重目录(需要选手自行下载)
|
||||
└── README.md / README_CN.md # 说明文档
|
||||
```
|
||||
|
||||
|
||||
## 核心文件
|
||||
|
||||
- **`benchmark.py`** - 自测基准脚本(⚠️ **不建议修改**)
|
||||
- **`evaluation_wrapper.py`** - 模型包装器,参赛者在此实现优化
|
||||
- **`Qwen3-VL-2B-Instruct/`** - 竞赛模型权重(需要选手自行下载,见"快速开始"部分)
|
||||
- **`data/`** - 验证数据集
|
||||
- **`requirements.txt`** - Python 依赖包
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 0. 下载模型(首次使用)
|
||||
|
||||
模型文件较大,需要单独下载。请先创建模型目录,然后下载模型:
|
||||
|
||||
```bash
|
||||
# 创建模型目录
|
||||
mkdir -p Qwen3-VL-2B-Instruct
|
||||
|
||||
# 安装 huggingface_hub(如果未安装)
|
||||
pip install -U huggingface_hub
|
||||
|
||||
# 设置镜像源(国内用户推荐,加速下载)
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
|
||||
# 下载模型到指定目录
|
||||
huggingface-cli download \
|
||||
--resume-download \
|
||||
Qwen/Qwen3-VL-2B-Instruct \
|
||||
--local-dir ./Qwen3-VL-2B-Instruct \
|
||||
--local-dir-use-symlinks False
|
||||
```
|
||||
|
||||
**注意:**
|
||||
- 模型大小约 4-5GB,下载可能需要一些时间
|
||||
- 如果下载中断,可以重新运行命令,会自动续传(`--resume-download`)
|
||||
- 下载完成后,`Qwen3-VL-2B-Instruct/` 文件夹会包含所有模型文件
|
||||
- 确保有足够的磁盘空间(至少 5GB)
|
||||
|
||||
### 1. 安装依赖
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. 运行测试
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result.json \
|
||||
--num-samples 100
|
||||
```
|
||||
|
||||
### 3. 实现你的优化
|
||||
|
||||
编辑 `evaluation_wrapper.py` 中的 `VLMModel` 类。优化采用**模块化设计**,每个优化方向对应一个独立方法。
|
||||
|
||||
#### 3.1 探索模型结构(可选)
|
||||
|
||||
在开始优化前,可以先探索模型结构,了解优化目标:
|
||||
|
||||
```python
|
||||
class VLMModel:
|
||||
def __init__(self, model_path: str, device: str = "cuda:0"):
|
||||
# ... 加载模型 ...
|
||||
|
||||
# 可选:探索模型结构
|
||||
self._explore_model_structure() # 会打印模型结构信息
|
||||
```
|
||||
|
||||
#### 3.2 启用优化方法
|
||||
|
||||
在 `__init__` 方法中,通过注释/取消注释来启用/禁用不同的优化:
|
||||
|
||||
```python
|
||||
class VLMModel:
|
||||
def __init__(self, model_path: str, device: str = "cuda:0"):
|
||||
# ... 加载模型 ...
|
||||
|
||||
# ================================================================
|
||||
# 选手优化区域 - 启用/禁用优化方法
|
||||
# ================================================================
|
||||
|
||||
# 1. Vision Encoder 加速(优化大分辨率图像处理)
|
||||
# self._optimize_vision_encoder()
|
||||
|
||||
# 2. KV Cache 管理(优化生成过程中的内存碎片)
|
||||
# self._optimize_kv_cache()
|
||||
|
||||
# 3. 跨模态融合层优化(优化 Cross-modal Connector)
|
||||
# self._optimize_cross_modal_connector()
|
||||
|
||||
# 4. Flash Attention 优化
|
||||
# self._enable_flash_attention()
|
||||
|
||||
# 5. 量化优化
|
||||
# self._apply_quantization()
|
||||
```
|
||||
|
||||
#### 3.3 实现优化代码
|
||||
|
||||
在各个优化方法中实现你的优化逻辑。例如,优化 Vision Encoder:
|
||||
|
||||
```python
|
||||
def _optimize_vision_encoder(self):
|
||||
"""在 evaluation_wrapper.py 中找到这个方法,实现你的优化"""
|
||||
|
||||
# 示例:替换注意力算子
|
||||
# from your_optimization import optimized_attention
|
||||
# if hasattr(self._model, 'vision_model'):
|
||||
# for layer in self._model.vision_model.encoder.layers:
|
||||
# layer.self_attn.forward = optimized_attention
|
||||
|
||||
# TODO: 实现你的 Vision Encoder 优化
|
||||
pass
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
### 4. 测试你的优化模型
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result_optimized.json \
|
||||
--num-samples 100
|
||||
```
|
||||
|
||||
### 5. 生成完整结果用于提交
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result.json \
|
||||
--num-samples 5000
|
||||
```
|
||||
|
||||
## 评测指标
|
||||
|
||||
最终得分计算公式:
|
||||
|
||||
```
|
||||
最终得分 = 0.4 × 准确率 + 0.3 × TTFT提升率 + 0.3 × 吞吐量提升率
|
||||
```
|
||||
|
||||
### 指标说明
|
||||
|
||||
- **TTFT (Time To First Token)**: 从输入准备到生成第一个 Token 的时间(毫秒)
|
||||
- 包含:图像编码、文本编码、跨模态交互、Prefill 阶段、第一个 Token 生成
|
||||
- Baseline: ~80ms
|
||||
- 提升率 = (Baseline - 你的TTFT) / Baseline
|
||||
|
||||
- **Throughput (吞吐量)**: 端到端 Token 生成速率(tokens/秒)
|
||||
- Baseline: ~55 tokens/sec
|
||||
- 提升率 = (你的吞吐量 - Baseline) / Baseline
|
||||
|
||||
- **Accuracy (准确率)**: 验证集上的 VQA 准确率(5000 个样本)
|
||||
- 支持多个标准答案的软匹配
|
||||
|
||||
## 比赛规则
|
||||
|
||||
### 重要规则
|
||||
|
||||
|
||||
1. **不要修改 `benchmark.py`**
|
||||
- 此基准脚本仅用于自测
|
||||
- 最终评测将使用独立的官方基准系统
|
||||
- 修改此文件可能导致本地结果与最终评测结果不一致
|
||||
|
||||
2. **仅修改 `evaluation_wrapper.py`**
|
||||
|
||||
|
||||
3. **保持必需的属性**
|
||||
- `VLMModel` 类必须暴露 `processor`、`model` 和 `device` 属性
|
||||
- Benchmark 使用这些属性来访问模型和处理器
|
||||
- `generate()` 方法是可选的,主要用于调试
|
||||
|
||||
4. **禁止行为**
|
||||
- 禁止硬编码答案
|
||||
- 禁止修改数据集
|
||||
- 禁止使用外部 API 或服务
|
||||
- 所有优化必须是本地且自包含的
|
||||
|
||||
|
||||
|
||||
|
||||
### 优化方向
|
||||
- 鼓励实现算子替换与内核优化:使用Triton、CUDA C++等重写或替换标准算子实现(如Attention、LayerNorm、Conv2d等)
|
||||
|
||||
- 鼓励实现内存与缓存优化:优化KV Cache内存布局、减少内存碎片、优化显存访问模式
|
||||
|
||||
|
||||
- 鼓励实现编译与图优化:使用torch.compile进行计算图优化、自定义内核调度
|
||||
|
||||
|
||||
- 鼓励实现注意力机制优化:实现Flash Attention、内存高效注意力、稀疏注意力
|
||||
|
||||
- 鼓励实现生成过程优化:优化解码策略、缓存管理、生成配置参数
|
||||
|
||||
|
||||
**不允许:**
|
||||
- 使用外部服务:禁止调用外部API、云服务或任何需要网络连接的功能
|
||||
|
||||
- 数据与答案作弊:禁止使用测试数据进行训练、预计算答案、硬编码输出
|
||||
|
||||
- 模型替换与篡改:希望选手着重做算子优化,不要用额外的数据集去训练模型、改变模型架构、直接修改权重数值等。
|
||||
|
||||
|
||||
- 过拟合优化:禁止针对特定评测样本进行条件分支或特殊处理
|
||||
|
||||
- 黑盒工具套用:仅修改配置文件而无实质性代码贡献的行为不被认可
|
||||
|
||||
- 环境操纵:禁止通过修改系统环境、GPU频率锁定等方式干扰公平评测
|
||||
|
||||
|
||||
|
||||
## 重要提示
|
||||
|
||||
### 样本选择
|
||||
|
||||
- 提供的 `benchmark.py` 使用**固定顺序**(从索引 0 开始的前 N 个样本)
|
||||
- 运行 `--num-samples 100` 时,会评测样本 0-99
|
||||
- 这确保了本地自测的可复现性
|
||||
- **注意**:竞赛委员会使用的官方评测系统可能采用不同的采样策略(包括随机采样)进行最终验证
|
||||
|
||||
### 硬件信息
|
||||
|
||||
基准测试会自动记录详细的硬件信息:
|
||||
- Python 版本、PyTorch 版本、CUDA 版本
|
||||
- GPU 名称、显存、计算能力
|
||||
- CPU 型号、核心数、频率
|
||||
- 系统信息(操作系统、内核、架构)
|
||||
- PPU 信息(如果可用)
|
||||
|
||||
这些信息保存在 `result.json` 的 `system_info` 字段中,用于统计分析。
|
||||
|
||||
### 性能测量
|
||||
|
||||
- **预热**:在实际测量前使用 10 个样本进行 GPU 预热
|
||||
- **TTFT 测量**:测量从输入准备到第一个 Token 的时间(包含所有预处理)
|
||||
- **吞吐量测量**:测量生成 128 个 Token 的端到端时间
|
||||
- **状态隔离**:在测量之间清理 GPU 缓存,确保公平性
|
||||
|
||||
### 随机种子
|
||||
|
||||
- `--random-seed` 参数仅影响 PyTorch 的随机数生成器
|
||||
- 它**不会**影响样本选择顺序(始终是固定的)
|
||||
- 用于模型推理随机性的可复现性
|
||||
|
||||
### 输出格式
|
||||
|
||||
`result.json` 文件包含:
|
||||
```json
|
||||
{
|
||||
"system_info": {
|
||||
"timestamp": "...",
|
||||
"python_version": "...",
|
||||
"torch_version": "...",
|
||||
"cuda_version": "...",
|
||||
"gpu_name": "...",
|
||||
...
|
||||
},
|
||||
"performance": {
|
||||
"avg_ttft_ms": 90.55,
|
||||
"avg_throughput_tokens_per_sec": 57.77
|
||||
},
|
||||
"answers": [
|
||||
{
|
||||
"question_id": 34602,
|
||||
"prediction": "你的答案文本"
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## 提交指南
|
||||
|
||||
### 初赛提交必需文件
|
||||
|
||||
1. **`result.json`** - 通过运行 `benchmark.py` 生成
|
||||
- 包含所有样本的预测
|
||||
- 必须包含有效的 `performance` 指标
|
||||
- **重要**:上传到天池平台的 `result.json` 仅用于参考。最终成绩将由竞赛委员会使用标准化硬件和官方评测系统进行评测。
|
||||
|
||||
2. **你的优化代码** - 包含你优化的 `VLMModel` 类的 `evaluation_wrapper.py`
|
||||
|
||||
3. **Docker 镜像**- 包含你优化环境的容器
|
||||
|
||||
|
||||
|
||||
### 评测流程
|
||||
|
||||
1. **自测**:使用提供的 `benchmark.py` 在本地测试你的优化
|
||||
2. **提交**:将你的 `result.json` 上传到天池平台(仅用于参考)
|
||||
3. **官方评测**:竞赛委员会将使用以下方式评测你的代码:
|
||||
- 提交Docker镜像
|
||||
- 标准化硬件环境
|
||||
- 官方评测代码
|
||||
- 完整验证集,随机采样进行验证
|
||||
4. **最终排名**:基于官方评测系统计算的最终得分
|
||||
|
||||
|
||||
|
||||
## 祝你好运!
|
||||
|
||||
希望你会专注于算子级优化、内核替换和高效的内存管理。记住:准确率和速度同样重要!祝你好运!
|
||||
|
||||
|
||||
|
||||
|
||||
Qwen3VLForConditionalGeneration(
|
||||
(model): Qwen3VLModel(
|
||||
(visual): Qwen3VLVisionModel(
|
||||
(patch_embed): Qwen3VLVisionPatchEmbed(
|
||||
(proj): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
|
||||
)
|
||||
(pos_embed): Embedding(2304, 1024)
|
||||
(rotary_pos_emb): Qwen3VLVisionRotaryEmbedding()
|
||||
(blocks): ModuleList(
|
||||
(0-23): 24 x Qwen3VLVisionBlock(
|
||||
(norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
|
||||
(norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
|
||||
(attn): Qwen3VLVisionAttention(
|
||||
(qkv): Linear(in_features=1024, out_features=3072, bias=True)
|
||||
(proj): Linear(in_features=1024, out_features=1024, bias=True)
|
||||
)
|
||||
(mlp): Qwen3VLVisionMLP(
|
||||
(linear_fc1): Linear(in_features=1024, out_features=4096, bias=True)
|
||||
(linear_fc2): Linear(in_features=4096, out_features=1024, bias=True)
|
||||
(act_fn): GELUTanh()
|
||||
)
|
||||
)
|
||||
)
|
||||
(merger): Qwen3VLVisionPatchMerger(
|
||||
(norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
|
||||
(linear_fc1): Linear(in_features=4096, out_features=4096, bias=True)
|
||||
(act_fn): GELU(approximate='none')
|
||||
(linear_fc2): Linear(in_features=4096, out_features=2048, bias=True)
|
||||
)
|
||||
(deepstack_merger_list): ModuleList(
|
||||
(0-2): 3 x Qwen3VLVisionPatchMerger(
|
||||
(norm): LayerNorm((4096,), eps=1e-06, elementwise_affine=True)
|
||||
(linear_fc1): Linear(in_features=4096, out_features=4096, bias=True)
|
||||
(act_fn): GELU(approximate='none')
|
||||
(linear_fc2): Linear(in_features=4096, out_features=2048, bias=True)
|
||||
)
|
||||
)
|
||||
)
|
||||
(language_model): Qwen3VLTextModel(
|
||||
(embed_tokens): Embedding(151936, 2048)
|
||||
(layers): ModuleList(
|
||||
(0-27): 28 x Qwen3VLTextDecoderLayer(
|
||||
(self_attn): Qwen3VLTextAttention(
|
||||
(q_proj): Linear(in_features=2048, out_features=2048, bias=False)
|
||||
(k_proj): Linear(in_features=2048, out_features=1024, bias=False)
|
||||
(v_proj): Linear(in_features=2048, out_features=1024, bias=False)
|
||||
(o_proj): Linear(in_features=2048, out_features=2048, bias=False)
|
||||
(q_norm): Qwen3VLTextRMSNorm((128,), eps=1e-06)
|
||||
(k_norm): Qwen3VLTextRMSNorm((128,), eps=1e-06)
|
||||
)
|
||||
(mlp): Qwen3VLTextMLP(
|
||||
(gate_proj): Linear(in_features=2048, out_features=6144, bias=False)
|
||||
(up_proj): Linear(in_features=2048, out_features=6144, bias=False)
|
||||
(down_proj): Linear(in_features=6144, out_features=2048, bias=False)
|
||||
(act_fn): SiLUActivation()
|
||||
)
|
||||
(input_layernorm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
|
||||
(post_attention_layernorm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
|
||||
)
|
||||
)
|
||||
(norm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
|
||||
(rotary_emb): Qwen3VLTextRotaryEmbedding()
|
||||
)
|
||||
)
|
||||
(lm_head): Linear(in_features=2048, out_features=151936, bias=False)
|
||||
)
|
||||
@ -1,406 +0,0 @@
|
||||
"""
|
||||
AICAS 2026 - Participant Core Modification File
|
||||
|
||||
Participants should modify the VLMModel class to implement optimizations.
|
||||
|
||||
Note:
|
||||
- Benchmark directly calls self.model.generate() for performance testing.
|
||||
- Your optimizations should modify self.model or its operators in __init__ via Monkey Patch.
|
||||
- The generate() method is optional and mainly for debugging.
|
||||
"""
|
||||
from typing import Dict
|
||||
try:
|
||||
from PIL import Image
|
||||
except ImportError:
|
||||
# For testing without PIL
|
||||
class Image:
|
||||
pass
|
||||
import torch
|
||||
from transformers import AutoModelForImageTextToText, AutoProcessor
|
||||
|
||||
|
||||
class VLMModel:
|
||||
"""
|
||||
Participant optimization class - modify this to implement optimizations.
|
||||
|
||||
Optimization Architecture:
|
||||
- Split optimizations into separate methods for isolation and testing
|
||||
- Enable/disable each optimization independently in __init__
|
||||
- Each optimization method can be tested individually
|
||||
|
||||
Important Notes:
|
||||
1. Benchmark directly calls self.model.generate() for performance testing.
|
||||
2. Your optimizations should modify self.model or its operators via Monkey Patch.
|
||||
3. All optimizations are applied in __init__ by calling optimization methods.
|
||||
"""
|
||||
|
||||
def __init__(self, model_path: str, device: str = "cuda:0"):
|
||||
"""
|
||||
Initialize model and apply optimizations.
|
||||
|
||||
Args:
|
||||
model_path: Qwen3-VL-2B-Instruct model path
|
||||
device: CUDA device, e.g., "cuda:0"
|
||||
"""
|
||||
self._device = device
|
||||
self.model_path = model_path
|
||||
|
||||
# Load processor
|
||||
print(f"[VLMModel] Loading processor from {model_path}...")
|
||||
self._processor = AutoProcessor.from_pretrained(model_path)
|
||||
|
||||
# Load model
|
||||
print(f"[VLMModel] Loading model with FP16...")
|
||||
self._model = AutoModelForImageTextToText.from_pretrained(
|
||||
model_path,
|
||||
torch_dtype=torch.float16,
|
||||
device_map=device
|
||||
)
|
||||
self._model.eval()
|
||||
|
||||
# Track applied optimizations
|
||||
self._optimizations_applied = []
|
||||
|
||||
# ================================================================
|
||||
# Participant Optimization Area - Enable/disable optimizations here
|
||||
# Uncomment the optimization methods you want to apply
|
||||
# ================================================================
|
||||
|
||||
# 1. Vision Encoder Acceleration
|
||||
# self._optimize_vision_encoder()
|
||||
|
||||
# 2. KV Cache Management
|
||||
# self._optimize_kv_cache()
|
||||
|
||||
# 3. Cross-modal Connector Optimization
|
||||
# self._optimize_cross_modal_connector()
|
||||
|
||||
# 4. Flash Attention Optimization
|
||||
# self._enable_flash_attention()
|
||||
|
||||
# 5. Quantization
|
||||
# self._apply_quantization()
|
||||
|
||||
# Optional: Explore model structure before optimization
|
||||
# self._explore_model_structure()
|
||||
|
||||
# ================================================================
|
||||
|
||||
print(f"[VLMModel] Model loaded successfully on {device}")
|
||||
if self._optimizations_applied:
|
||||
print(f"[VLMModel] Applied optimizations: {', '.join(self._optimizations_applied)}")
|
||||
|
||||
# ================================================================
|
||||
# Optimization Methods - Implement your optimizations here
|
||||
# ================================================================
|
||||
|
||||
def _explore_model_structure(self):
|
||||
"""
|
||||
Helper method to explore model structure.
|
||||
|
||||
Use this to understand the model architecture before implementing optimizations.
|
||||
This helps identify where to apply monkey patches.
|
||||
"""
|
||||
print("=" * 60)
|
||||
print("Model Structure Exploration")
|
||||
print("=" * 60)
|
||||
|
||||
# Explore vision model structure
|
||||
if hasattr(self._model, 'vision_model'):
|
||||
print(f"Vision Model: {type(self._model.vision_model)}")
|
||||
if hasattr(self._model.vision_model, 'encoder'):
|
||||
if hasattr(self._model.vision_model.encoder, 'layers'):
|
||||
print(f" Vision Encoder Layers: {len(self._model.vision_model.encoder.layers)}")
|
||||
# Show first layer structure
|
||||
if len(self._model.vision_model.encoder.layers) > 0:
|
||||
print(f" First Layer Type: {type(self._model.vision_model.encoder.layers[0])}")
|
||||
else:
|
||||
print("Vision Model: Not found (model structure may differ)")
|
||||
|
||||
# Explore language model structure
|
||||
if hasattr(self._model, 'model'):
|
||||
print(f"Language Model: {type(self._model.model)}")
|
||||
if hasattr(self._model.model, 'layers'):
|
||||
print(f" Language Model Layers: {len(self._model.model.layers)}")
|
||||
else:
|
||||
print("Language Model: Not found (model structure may differ)")
|
||||
|
||||
# Explore cross-modal components
|
||||
cross_modal_attrs = ['connector', 'cross_attn', 'cross_attention', 'proj', 'projector']
|
||||
found_components = []
|
||||
for attr in cross_modal_attrs:
|
||||
if hasattr(self._model, attr):
|
||||
found_components.append(attr)
|
||||
if found_components:
|
||||
print(f"Cross-modal Components: {', '.join(found_components)}")
|
||||
else:
|
||||
print("Cross-modal Components: Explore manually (structure may vary)")
|
||||
|
||||
print("=" * 60)
|
||||
print("Tip: Use print(self._model) to see full model structure")
|
||||
print("=" * 60)
|
||||
|
||||
def _optimize_vision_encoder(self):
|
||||
"""
|
||||
Optimize Vision Encoder for high-resolution image inputs.
|
||||
|
||||
Optimization Directions:
|
||||
1. Patch embedding convolution optimization
|
||||
2. Vision Transformer attention mechanism optimization
|
||||
3. Layer normalization optimization
|
||||
4. Memory-efficient image processing
|
||||
|
||||
Implementation Steps:
|
||||
1. Inspect model structure: call self._explore_model_structure()
|
||||
2. Identify bottlenecks using profiling tools (PyTorch Profiler, nsys, etc.)
|
||||
3. Implement optimized operators (Triton/CUDA kernels)
|
||||
4. Replace original operators via monkey patch
|
||||
|
||||
Target Components:
|
||||
- self._model.vision_model (if exists)
|
||||
- Vision encoder layers and attention mechanisms
|
||||
- Convolution operations in patch embedding
|
||||
"""
|
||||
# TODO: Implement your Vision Encoder optimization here
|
||||
#
|
||||
# Example workflow:
|
||||
# 1. from your_optimization import optimized_attention, optimized_conv
|
||||
# 2. Inspect: print(self._model.vision_model) to find target layers
|
||||
# 3. Replace: layer.self_attn.forward = optimized_attention
|
||||
# 4. Test: Run benchmark to verify improvement
|
||||
|
||||
if 'vision_encoder' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('vision_encoder')
|
||||
|
||||
def _optimize_kv_cache(self):
|
||||
"""
|
||||
Optimize KV Cache management to reduce memory fragmentation.
|
||||
|
||||
Optimization Directions:
|
||||
1. Memory layout optimization (contiguous memory allocation)
|
||||
2. Fragmentation-free allocation strategies
|
||||
3. Efficient cache reuse patterns
|
||||
4. Dynamic cache sizing
|
||||
|
||||
Implementation Steps:
|
||||
1. Understand current KV cache implementation in model layers
|
||||
2. Design memory-efficient cache allocation strategy
|
||||
3. Implement custom KV cache allocator if needed
|
||||
4. Apply optimizations via monkey patch or config modification
|
||||
|
||||
Target Components:
|
||||
- self._model.config (cache configuration)
|
||||
- Attention layers (KV cache allocation)
|
||||
- Generation loop (cache management)
|
||||
"""
|
||||
# Enable KV Cache first
|
||||
self._model.config.use_cache = True
|
||||
if hasattr(self._model.config, 'pad_token_id'):
|
||||
if self._model.config.pad_token_id is None:
|
||||
self._model.config.pad_token_id = self._model.config.eos_token_id
|
||||
|
||||
# TODO: Implement advanced KV Cache optimizations here
|
||||
#
|
||||
# Example workflow:
|
||||
# 1. from your_optimization import FragmentationFreeKVCache
|
||||
# 2. for layer in self._model.model.layers:
|
||||
# 3. layer.attention.custom_kv_cache = FragmentationFreeKVCache()
|
||||
# 4. Test: Monitor memory usage and generation speed
|
||||
|
||||
if 'kv_cache' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('kv_cache')
|
||||
|
||||
def _optimize_cross_modal_connector(self):
|
||||
"""
|
||||
Optimize Cross-modal Connector computation efficiency.
|
||||
|
||||
Optimization Directions:
|
||||
1. Cross-attention mechanism optimization
|
||||
2. Vision-to-language projection optimization
|
||||
3. Multi-modal fusion layer efficiency
|
||||
4. Feature alignment and transformation optimization
|
||||
|
||||
Implementation Steps:
|
||||
1. Identify cross-modal components using self._explore_model_structure()
|
||||
2. Profile cross-modal operations to find bottlenecks
|
||||
3. Implement optimized cross-attention or projection kernels
|
||||
4. Replace original operations via monkey patch
|
||||
|
||||
Note: Qwen3-VL's cross-modal structure may vary.
|
||||
Use model exploration to identify actual component names and locations.
|
||||
"""
|
||||
# TODO: Implement your Cross-modal Connector optimization here
|
||||
#
|
||||
# Example workflow:
|
||||
# 1. Explore: self._explore_model_structure() to find connector components
|
||||
# 2. from your_optimization import optimized_cross_attention
|
||||
# 3. Identify: Inspect model to find cross-attention layers
|
||||
# 4. Replace: connector.cross_attention.forward = optimized_cross_attention
|
||||
# 5. Test: Verify accuracy and performance improvements
|
||||
|
||||
from my_patch import patch_forward
|
||||
self._model.model.__class__.forward = patch_forward
|
||||
|
||||
if 'cross_modal' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('cross_modal')
|
||||
|
||||
def _enable_flash_attention(self):
|
||||
"""
|
||||
Enable or implement Flash Attention optimization.
|
||||
|
||||
Implementation Approaches:
|
||||
|
||||
Approach 1: Enable PyTorch's Built-in Flash Attention (Simple)
|
||||
- Uses torch.backends.cuda.enable_flash_sdp(True)
|
||||
- Easy to enable but limited customization
|
||||
- May not work for all attention patterns in Qwen3-VL
|
||||
|
||||
Approach 2: Implement Custom Flash Attention (Advanced, Recommended)
|
||||
- Write custom Triton/CUDA kernels for attention computation
|
||||
- Replace torch.nn.functional.scaled_dot_product_attention
|
||||
- Full control over attention computation and memory layout
|
||||
- Better performance potential but requires more implementation effort
|
||||
|
||||
Recommended: Implement Approach 2 for better performance gains.
|
||||
Use profiling to identify which attention operations benefit most from optimization.
|
||||
"""
|
||||
# TODO: Choose and implement your Flash Attention approach
|
||||
|
||||
# Approach 1: Simple (enable PyTorch built-in)
|
||||
# torch.backends.cuda.enable_flash_sdp(True)
|
||||
|
||||
# Approach 2: Advanced (custom implementation - recommended)
|
||||
# from your_optimization import custom_flash_attention
|
||||
# torch.nn.functional.scaled_dot_product_attention = custom_flash_attention
|
||||
#
|
||||
# Or replace at layer level:
|
||||
# for layer in self._model.model.layers:
|
||||
# layer.self_attn.forward = custom_attention_with_flash
|
||||
|
||||
if 'flash_attention' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('flash_attention')
|
||||
|
||||
def _apply_quantization(self):
|
||||
"""
|
||||
Apply quantization to reduce model size and speed up inference.
|
||||
|
||||
Optimization Directions:
|
||||
1. INT8 quantization (8-bit integer)
|
||||
2. FP8 quantization (8-bit floating point)
|
||||
3. Mixed precision quantization
|
||||
4. Dynamic vs static quantization
|
||||
|
||||
Implementation Steps:
|
||||
1. Choose quantization strategy based on accuracy/performance trade-off
|
||||
2. Use quantization libraries (BitsAndBytes, TensorRT, etc.)
|
||||
3. Calibrate quantized model on validation data
|
||||
4. Verify accuracy preservation
|
||||
|
||||
Note: Quantization may require reloading the model with quantization config.
|
||||
Consider applying quantization before other optimizations if model reload is needed.
|
||||
"""
|
||||
# TODO: Implement your quantization here
|
||||
#
|
||||
# Example workflow:
|
||||
# 1. from transformers import BitsAndBytesConfig
|
||||
# 2. quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||
# 3. Note: May need to reload model with quantization config
|
||||
# 4. Test: Verify accuracy and performance improvements
|
||||
|
||||
if 'quantization' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('quantization')
|
||||
|
||||
# Required properties for benchmark
|
||||
@property
|
||||
def processor(self):
|
||||
"""
|
||||
Required by benchmark for input processing.
|
||||
|
||||
Benchmark uses this to prepare inputs with unified tokenizer.
|
||||
"""
|
||||
return self._processor
|
||||
|
||||
@property
|
||||
def model(self):
|
||||
"""
|
||||
Required by benchmark for direct model.generate() calls.
|
||||
|
||||
Benchmark directly calls self.model.generate() for performance testing.
|
||||
Your optimizations should modify this model object or its operators.
|
||||
"""
|
||||
return self._model
|
||||
|
||||
@property
|
||||
def device(self):
|
||||
"""
|
||||
Required by benchmark for device information.
|
||||
"""
|
||||
return self._device
|
||||
|
||||
def generate(
|
||||
self,
|
||||
image: Image.Image,
|
||||
question: str,
|
||||
max_new_tokens: int = 128
|
||||
) -> Dict:
|
||||
"""
|
||||
Generate answer (optional method, mainly for debugging).
|
||||
|
||||
Note: Benchmark uses self.model.generate() directly for performance testing.
|
||||
This method is provided for convenience and debugging purposes.
|
||||
|
||||
Args:
|
||||
image: PIL Image object
|
||||
question: Question text
|
||||
max_new_tokens: Maximum tokens to generate
|
||||
|
||||
Returns:
|
||||
Dict: {
|
||||
"text": str, # Generated text answer
|
||||
"token_count": int # Generated token count
|
||||
}
|
||||
"""
|
||||
# Build Qwen3-VL message format
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image", "image": image},
|
||||
{"type": "text", "text": question}
|
||||
]
|
||||
}]
|
||||
|
||||
# Process inputs
|
||||
inputs = self._processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=True,
|
||||
add_generation_prompt=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt"
|
||||
).to(self._device)
|
||||
|
||||
# Generate
|
||||
with torch.no_grad():
|
||||
output_ids = self._model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=max_new_tokens,
|
||||
do_sample=False,
|
||||
temperature=0.0,
|
||||
top_p=1.0,
|
||||
use_cache=True
|
||||
)
|
||||
|
||||
# Extract generated tokens (remove input part)
|
||||
input_len = inputs.input_ids.shape[1]
|
||||
generated_ids = output_ids[0][input_len:]
|
||||
|
||||
# Decode
|
||||
text = self._processor.tokenizer.decode(
|
||||
generated_ids,
|
||||
skip_special_tokens=True,
|
||||
clean_up_tokenization_spaces=False
|
||||
)
|
||||
|
||||
return {
|
||||
"text": text,
|
||||
"token_count": len(generated_ids)
|
||||
}
|
||||
@ -1,364 +0,0 @@
|
||||
import numpy as np
|
||||
import torch
|
||||
|
||||
from transformers.models.qwen3_vl.processing_qwen3_vl import Qwen3VLProcessor, Qwen3VLProcessorKwargs
|
||||
from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLModelOutputWithPast, BaseModelOutputWithDeepstackFeatures
|
||||
from transformers.feature_extraction_utils import BatchFeature
|
||||
from transformers.image_utils import ImageInput
|
||||
from transformers.processing_utils import Unpack
|
||||
from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
|
||||
from transformers.utils import logging, TransformersKwargs, can_return_tuple
|
||||
from transformers.video_utils import VideoInput
|
||||
from transformers.cache_utils import Cache
|
||||
from transformers.processing_utils import Unpack
|
||||
|
||||
logger = logging.get_logger(__name__)
|
||||
|
||||
class myQwen3VLProcessor(Qwen3VLProcessor):
|
||||
def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
|
||||
super().__init__(image_processor, tokenizer, video_processor, chat_template, **kwargs)
|
||||
|
||||
def __call__(
|
||||
self,
|
||||
images: ImageInput = None,
|
||||
text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
|
||||
videos: VideoInput = None,
|
||||
**kwargs: Unpack[Qwen3VLProcessorKwargs],
|
||||
) -> BatchFeature:
|
||||
r"""
|
||||
Returns:
|
||||
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
|
||||
|
||||
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
|
||||
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
|
||||
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
|
||||
`None`).
|
||||
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
|
||||
- **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
|
||||
- **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
|
||||
- **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
|
||||
"""
|
||||
output_kwargs = self._merge_kwargs(
|
||||
Qwen3VLProcessorKwargs,
|
||||
tokenizer_init_kwargs=self.tokenizer.init_kwargs,
|
||||
**kwargs,
|
||||
)
|
||||
if images is not None:
|
||||
image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
|
||||
image_grid_thw = image_inputs["image_grid_thw"]
|
||||
else:
|
||||
image_inputs = {}
|
||||
image_grid_thw = None
|
||||
|
||||
if videos is not None:
|
||||
videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
|
||||
video_grid_thw = videos_inputs["video_grid_thw"]
|
||||
# If user has not requested video metadata, pop it
|
||||
if not kwargs.get("return_metadata"):
|
||||
video_metadata = videos_inputs.pop("video_metadata")
|
||||
else:
|
||||
video_metadata = videos_inputs["video_metadata"]
|
||||
else:
|
||||
videos_inputs = {}
|
||||
video_grid_thw = None
|
||||
|
||||
if not isinstance(text, list):
|
||||
text = [text]
|
||||
|
||||
text = text.copy() # below lines change text in-place
|
||||
if image_grid_thw is not None:
|
||||
merge_length = self.image_processor.merge_size**2
|
||||
index = 0
|
||||
for i in range(len(text)):
|
||||
while self.image_token in text[i]:
|
||||
# num_image_tokens = image_grid_thw[index].prod() // merge_length
|
||||
num_image_tokens = 40
|
||||
text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
|
||||
index += 1
|
||||
text[i] = text[i].replace("<|placeholder|>", self.image_token)
|
||||
|
||||
if video_grid_thw is not None:
|
||||
merge_length = self.video_processor.merge_size**2
|
||||
index = 0
|
||||
for i in range(len(text)):
|
||||
while self.video_token in text[i]:
|
||||
metadata = video_metadata[index]
|
||||
if metadata.fps is None:
|
||||
logger.warning_once(
|
||||
"Qwen3VL requires frame timestamps to construct prompts, but the `fps` of the input video could not be inferred. "
|
||||
"Probably `video_metadata` was missing from inputs and you passed pre-sampled frames. "
|
||||
"Defaulting to `fps=24`. Please provide `video_metadata` for more accurate results."
|
||||
)
|
||||
metadata.fps = 24 if metadata.fps is None else metadata.fps
|
||||
|
||||
# if timestamps are not provided, calculate them
|
||||
curr_timestamp = self._calculate_timestamps(
|
||||
metadata.frames_indices,
|
||||
metadata.fps,
|
||||
self.video_processor.temporal_patch_size,
|
||||
)
|
||||
|
||||
video_placeholder = ""
|
||||
frame_seqlen = video_grid_thw[index][1:].prod() // merge_length
|
||||
for frame_idx in range(video_grid_thw[index][0]):
|
||||
curr_time = curr_timestamp[frame_idx]
|
||||
video_placeholder += f"<{curr_time:.1f} seconds>"
|
||||
video_placeholder += (
|
||||
self.vision_start_token + "<|placeholder|>" * frame_seqlen + self.vision_end_token
|
||||
)
|
||||
if f"{self.vision_start_token}{self.video_token}{self.vision_end_token}" in text[i]:
|
||||
text[i] = text[i].replace(
|
||||
f"{self.vision_start_token}{self.video_token}{self.vision_end_token}", video_placeholder, 1
|
||||
)
|
||||
else:
|
||||
# vllm may input video token directly
|
||||
text[i] = text[i].replace(self.video_token, video_placeholder, 1)
|
||||
index += 1
|
||||
|
||||
text[i] = text[i].replace("<|placeholder|>", self.video_token)
|
||||
|
||||
return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
|
||||
return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", None)
|
||||
text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
|
||||
self._check_special_mm_tokens(text, text_inputs, modalities=["image", "video"])
|
||||
|
||||
if return_mm_token_type_ids:
|
||||
array_ids = np.array(text_inputs["input_ids"])
|
||||
mm_token_type_ids = np.zeros_like(text_inputs["input_ids"])
|
||||
mm_token_type_ids[array_ids == self.image_token_id] = 1
|
||||
text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
|
||||
|
||||
return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
|
||||
|
||||
def _sample_indices_uniform(idx: torch.LongTensor, keep_ratio: float, min_keep: int = 0):
|
||||
"""
|
||||
idx: 1D indices in original sequence (sorted)
|
||||
keep_ratio: 0~1, keep uniformly spaced
|
||||
"""
|
||||
n = idx.numel()
|
||||
if n == 0:
|
||||
return idx
|
||||
k = max(min_keep, int(torch.ceil(torch.tensor(n * keep_ratio)).item()))
|
||||
k = min(k, n)
|
||||
if k == n:
|
||||
return idx
|
||||
# uniform pick: linspace over [0, n-1]
|
||||
pos = torch.linspace(0, n - 1, steps=k, device=idx.device)
|
||||
pos = pos.round().long().clamp(0, n - 1)
|
||||
return idx[pos]
|
||||
|
||||
def sparse_keep_and_gather(
|
||||
inputs_embeds, # (B,S,D)
|
||||
attention_mask, # (B,S)
|
||||
position_ids, # (4,B,S)
|
||||
visual_pos_masks, # (B,S) bool
|
||||
deepstack_visual_embeds,# list[tensor] each (Nvis_total,D) OR None
|
||||
keep_ratio: float = 0.25,
|
||||
min_keep_per_vis: int = 0,
|
||||
max_len: int | None = None,
|
||||
):
|
||||
"""
|
||||
稀疏保留:保留全部文本 token;视觉 token 按 keep_ratio 均匀采样保留。
|
||||
可选 max_len:如果最终还超长,再从视觉 token 里继续裁(不动文本)。
|
||||
"""
|
||||
device = inputs_embeds.device
|
||||
B, S, D = inputs_embeds.shape
|
||||
eff = attention_mask.bool()
|
||||
|
||||
keep_mask_token = torch.zeros((B, S), dtype=torch.bool, device=device)
|
||||
|
||||
for b in range(B):
|
||||
eff_idx = eff[b].nonzero(as_tuple=False).squeeze(1) # 有效 token
|
||||
if eff_idx.numel() == 0:
|
||||
continue
|
||||
|
||||
vis_eff = visual_pos_masks[b, eff_idx] # 有效里哪些是视觉
|
||||
text_idx = eff_idx[~vis_eff] # 全保留
|
||||
vis_idx = eff_idx[vis_eff] # 待稀疏
|
||||
|
||||
# 视觉稀疏采样(删中间就靠这一步)
|
||||
kept_vis = _sample_indices_uniform(vis_idx, keep_ratio, min_keep=min_keep_per_vis)
|
||||
|
||||
chosen = torch.cat([text_idx, kept_vis], dim=0)
|
||||
chosen, _ = torch.sort(chosen) # 保持原序
|
||||
|
||||
# 如果还要控最大长度:优先继续裁视觉(不裁文本)
|
||||
if max_len is not None and chosen.numel() > max_len:
|
||||
# 已保留的视觉位置
|
||||
chosen_vis = chosen[visual_pos_masks[b, chosen]]
|
||||
chosen_txt = chosen[~visual_pos_masks[b, chosen]]
|
||||
# 文本若已超 max_len,只能截文本(极少)
|
||||
if chosen_txt.numel() >= max_len:
|
||||
chosen = chosen_txt[:max_len]
|
||||
else:
|
||||
budget = max_len - chosen_txt.numel()
|
||||
# 对视觉再均匀裁到 budget
|
||||
chosen_vis = _sample_indices_uniform(chosen_vis, budget / max(chosen_vis.numel(), 1))
|
||||
chosen = torch.cat([chosen_txt, chosen_vis], dim=0)
|
||||
chosen, _ = torch.sort(chosen)
|
||||
|
||||
keep_mask_token[b, chosen] = True
|
||||
|
||||
# ===== gather + pad 到 batch 内最大长度 =====
|
||||
keep_lens = keep_mask_token.sum(dim=1).tolist()
|
||||
max_keep = max(keep_lens) if keep_lens else 0
|
||||
|
||||
new_inputs = inputs_embeds.new_zeros((B, max_keep, D))
|
||||
new_attn = attention_mask.new_zeros((B, max_keep))
|
||||
new_pos = position_ids.new_zeros((4, B, max_keep))
|
||||
new_vis = visual_pos_masks.new_zeros((B, max_keep), dtype=torch.bool)
|
||||
|
||||
for b in range(B):
|
||||
idx = keep_mask_token[b].nonzero(as_tuple=False).squeeze(1)
|
||||
L = idx.numel()
|
||||
if L == 0:
|
||||
continue
|
||||
new_inputs[b, :L, :] = inputs_embeds[b, idx, :]
|
||||
new_attn[b, :L] = attention_mask[b, idx]
|
||||
new_pos[:, b, :L] = position_ids[:, b, idx]
|
||||
new_vis[b, :L] = visual_pos_masks[b, idx]
|
||||
|
||||
# ===== deepstack 同步裁剪(关键!)=====
|
||||
new_deepstack = None
|
||||
if deepstack_visual_embeds is not None:
|
||||
# deepstack 的顺序 = visual_pos_masks flatten 后 True 的顺序
|
||||
# 所以用 keep_mask_token 在这些位置的布尔值来裁剪
|
||||
keep_vis_flat = keep_mask_token[visual_pos_masks] # 1D bool, length = Nvis_total
|
||||
new_deepstack = [x[keep_vis_flat] for x in deepstack_visual_embeds]
|
||||
|
||||
return new_inputs, new_attn, new_pos, new_vis, new_deepstack
|
||||
|
||||
@can_return_tuple
|
||||
def patch_forward(
|
||||
self,
|
||||
input_ids: torch.LongTensor = None,
|
||||
attention_mask: torch.Tensor | None = None,
|
||||
position_ids: torch.LongTensor | None = None,
|
||||
past_key_values: Cache | None = None,
|
||||
inputs_embeds: torch.FloatTensor | None = None,
|
||||
pixel_values: torch.Tensor | None = None,
|
||||
pixel_values_videos: torch.FloatTensor | None = None,
|
||||
image_grid_thw: torch.LongTensor | None = None,
|
||||
video_grid_thw: torch.LongTensor | None = None,
|
||||
cache_position: torch.LongTensor | None = None,
|
||||
**kwargs: Unpack[TransformersKwargs],
|
||||
) -> tuple | Qwen3VLModelOutputWithPast:
|
||||
r"""
|
||||
image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):
|
||||
The temporal, height and width of feature shape of each image in LLM.
|
||||
video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
|
||||
The temporal, height and width of feature shape of each video in LLM.
|
||||
"""
|
||||
|
||||
if (input_ids is None) ^ (inputs_embeds is not None):
|
||||
raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
|
||||
|
||||
if inputs_embeds is None:
|
||||
inputs_embeds = self.get_input_embeddings()(input_ids)
|
||||
|
||||
image_mask = None
|
||||
video_mask = None
|
||||
|
||||
if pixel_values is not None:
|
||||
image_outputs: BaseModelOutputWithDeepstackFeatures = self.get_image_features(
|
||||
pixel_values, image_grid_thw, return_dict=True
|
||||
)
|
||||
image_embeds = image_outputs.pooler_output
|
||||
deepstack_image_embeds = image_outputs.deepstack_features
|
||||
image_embeds = torch.cat(image_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
|
||||
image_mask, _ = self.get_placeholder_mask(
|
||||
input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
|
||||
)
|
||||
inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
|
||||
|
||||
if pixel_values_videos is not None:
|
||||
video_outputs: BaseModelOutputWithDeepstackFeatures = self.get_video_features(
|
||||
pixel_values_videos, video_grid_thw, return_dict=True
|
||||
)
|
||||
video_embeds = video_outputs.pooler_output
|
||||
deepstack_video_embeds = video_outputs.deepstack_features
|
||||
video_embeds = torch.cat(video_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
|
||||
_, video_mask = self.get_placeholder_mask(
|
||||
input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
|
||||
)
|
||||
inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
|
||||
|
||||
visual_pos_masks = None
|
||||
deepstack_visual_embeds = None
|
||||
if image_mask is not None and video_mask is not None:
|
||||
# aggregate visual_pos_masks and deepstack_visual_embeds
|
||||
image_mask = image_mask[..., 0]
|
||||
video_mask = video_mask[..., 0]
|
||||
visual_pos_masks = image_mask | video_mask
|
||||
deepstack_visual_embeds = []
|
||||
image_mask_joint = image_mask[visual_pos_masks]
|
||||
video_mask_joint = video_mask[visual_pos_masks]
|
||||
for img_embed, vid_embed in zip(deepstack_image_embeds, deepstack_video_embeds):
|
||||
embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)
|
||||
embed_joint[image_mask_joint, :] = img_embed
|
||||
embed_joint[video_mask_joint, :] = vid_embed
|
||||
deepstack_visual_embeds.append(embed_joint)
|
||||
elif image_mask is not None:
|
||||
image_mask = image_mask[..., 0]
|
||||
visual_pos_masks = image_mask
|
||||
deepstack_visual_embeds = deepstack_image_embeds
|
||||
elif video_mask is not None:
|
||||
video_mask = video_mask[..., 0]
|
||||
visual_pos_masks = video_mask
|
||||
deepstack_visual_embeds = deepstack_video_embeds
|
||||
|
||||
if position_ids is None:
|
||||
position_ids = self.compute_3d_position_ids(
|
||||
input_ids=input_ids,
|
||||
image_grid_thw=image_grid_thw,
|
||||
video_grid_thw=video_grid_thw,
|
||||
inputs_embeds=inputs_embeds,
|
||||
attention_mask=attention_mask,
|
||||
past_key_values=past_key_values,
|
||||
)
|
||||
|
||||
# ====== 稀疏采样裁剪:只在 prefill 做(past_key_values is None)=====
|
||||
if past_key_values.get_seq_length() == 0 and visual_pos_masks is not None:
|
||||
# 这些参数你可以通过 kwargs 传入
|
||||
keep_ratio = kwargs.pop("visual_keep_ratio", 0.1) # 只保留 25% 视觉 token
|
||||
min_keep = kwargs.pop("min_keep_per_vis", 0) # 每段视觉最少保留多少(可设比如 16)
|
||||
max_len = kwargs.pop("truncate_max_len", None) # 总长度上限(可选)
|
||||
|
||||
inputs_embeds, attention_mask, position_ids, visual_pos_masks, deepstack_visual_embeds = sparse_keep_and_gather(
|
||||
inputs_embeds=inputs_embeds,
|
||||
attention_mask=attention_mask,
|
||||
position_ids=position_ids,
|
||||
visual_pos_masks=visual_pos_masks,
|
||||
deepstack_visual_embeds=deepstack_visual_embeds,
|
||||
keep_ratio=keep_ratio,
|
||||
min_keep_per_vis=min_keep,
|
||||
max_len=max_len,
|
||||
)
|
||||
|
||||
# cache_position 建议重建为 0..L-1(避免对齐问题)
|
||||
cache_position = torch.arange(
|
||||
inputs_embeds.shape[1], device=inputs_embeds.device, dtype=torch.long
|
||||
).unsqueeze(0).expand(inputs_embeds.shape[0], -1)
|
||||
|
||||
# rope_deltas 建议也按裁剪后的序列重算(防止不一致)
|
||||
eff_len = attention_mask.sum(dim=1).to(torch.long) # (B,)
|
||||
max_pos = position_ids.max(dim=0).values.max(dim=1).values # (B,)
|
||||
self.rope_deltas = (max_pos + 1 - eff_len).unsqueeze(1)
|
||||
# ====== 裁剪结束 ======
|
||||
|
||||
outputs = self.language_model(
|
||||
input_ids=None,
|
||||
position_ids=position_ids,
|
||||
attention_mask=attention_mask,
|
||||
past_key_values=past_key_values,
|
||||
inputs_embeds=inputs_embeds,
|
||||
cache_position=cache_position,
|
||||
visual_pos_masks=visual_pos_masks,
|
||||
deepstack_visual_embeds=deepstack_visual_embeds,
|
||||
**kwargs,
|
||||
)
|
||||
|
||||
return Qwen3VLModelOutputWithPast(
|
||||
**outputs,
|
||||
rope_deltas=self.rope_deltas,
|
||||
)
|
||||
Reference in New Issue
Block a user