.gitignore

2026-02-27 01:41:52 +00:00
parent 2f4420bb2d
commit 9556909e78
4 changed files with 2 additions and 1185 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,4 +1,5 @@
 data/*
 Qwen3-VL-2B-Instruct/*
 __pycache__/
-.vscode/
+.vscode/
+.ipynb_checkpoints/
--- a/.ipynb_checkpoints/README_CN-checkpoint.md
+++ b/.ipynb_checkpoints/README_CN-checkpoint.md
@ -1,414 +0,0 @@
-# AICAS 2026 - 面向AI芯片的VLM高效推理与优化赛道
-
-##  目录
- [概述](#概述)
- [代码结构](#代码结构)
- [核心文件](#核心文件)
- [快速开始](#快速开始)
- [评测指标](#评测指标)
- [比赛规则](#比赛规则)
- [重要提示](#重要提示)
- [提交指南](#提交指南)
-
-
-## 概述
-
-本次竞赛专注于优化视觉语言模型（VLM）的推理性能。参赛者需要修改 `evaluation_wrapper.py` 中的 `VLMModel` 类，在保持准确率的同时提升首 Token 时间（TTFT）和吞吐量（Throughput）。
-
-## 代码结构
-
-```
-AICASGC/
-├── benchmark.py              # 基准测试脚本
-├── evaluation_wrapper.py     # 模型包装器（选手在此实现优化）
-├── requirements.txt          # Python 依赖包
-├── data/                     # 验证数据集
-│   ├── data-*.arrow          # 数据集文件
-│   ├── dataset_info.json     # 数据集元信息
-│   └── state.json            # 数据集状态
-├── Qwen3-VL-2B-Instruct/    # 模型权重目录（需要选手自行下载）
-└── README.md / README_CN.md   # 说明文档
-```
-
-
-## 核心文件
-
- **`benchmark.py`** - 自测基准脚本（⚠️ **不建议修改**）
- **`evaluation_wrapper.py`** - 模型包装器，参赛者在此实现优化
- **`Qwen3-VL-2B-Instruct/`** - 竞赛模型权重（需要选手自行下载，见"快速开始"部分）
- **`data/`** - 验证数据集
- **`requirements.txt`** - Python 依赖包
-
-## 快速开始
-
-### 0. 下载模型（首次使用）
-
-模型文件较大，需要单独下载。请先创建模型目录，然后下载模型：
-
-```bash
-# 创建模型目录
-mkdir -p Qwen3-VL-2B-Instruct
-
-# 安装 huggingface_hub（如果未安装）
-pip install -U huggingface_hub
-
-# 设置镜像源（国内用户推荐，加速下载）
-export HF_ENDPOINT=https://hf-mirror.com
-
-# 下载模型到指定目录
-huggingface-cli download \
-  --resume-download \
-  Qwen/Qwen3-VL-2B-Instruct \
-  --local-dir ./Qwen3-VL-2B-Instruct \
-  --local-dir-use-symlinks False
-```
-
-**注意：**
- 模型大小约 4-5GB，下载可能需要一些时间
- 如果下载中断，可以重新运行命令，会自动续传（`--resume-download`）
- 下载完成后，`Qwen3-VL-2B-Instruct/` 文件夹会包含所有模型文件
- 确保有足够的磁盘空间（至少 5GB）
-
-### 1. 安装依赖
-
-```bash
-pip install -r requirements.txt
-```
-
-### 2. 运行测试
-
-```bash
-python benchmark.py \
-    --model-path ./Qwen3-VL-2B-Instruct \
-    --dataset-path ./data \
-    --output result.json \
-    --num-samples 100
-```
-
-### 3. 实现你的优化
-
-编辑 `evaluation_wrapper.py` 中的 `VLMModel` 类。优化采用**模块化设计**，每个优化方向对应一个独立方法。
-
-#### 3.1 探索模型结构（可选）
-
-在开始优化前，可以先探索模型结构，了解优化目标：
-
-```python
-class VLMModel:
-    def __init__(self, model_path: str, device: str = "cuda:0"):
-        # ... 加载模型 ...
-        
-        # 可选：探索模型结构
-        self._explore_model_structure()  # 会打印模型结构信息
-```
-
-#### 3.2 启用优化方法
-
-在 `__init__` 方法中，通过注释/取消注释来启用/禁用不同的优化：
-
-```python
-class VLMModel:
-    def __init__(self, model_path: str, device: str = "cuda:0"):
-        # ... 加载模型 ...
-        
-        # ================================================================
-        # 选手优化区域 - 启用/禁用优化方法
-        # ================================================================
-        
-        # 1. Vision Encoder 加速（优化大分辨率图像处理）
-        # self._optimize_vision_encoder()
-        
-        # 2. KV Cache 管理（优化生成过程中的内存碎片）
-        # self._optimize_kv_cache()
-        
-        # 3. 跨模态融合层优化（优化 Cross-modal Connector）
-        # self._optimize_cross_modal_connector()
-        
-        # 4. Flash Attention 优化
-        # self._enable_flash_attention()
-        
-        # 5. 量化优化
-        # self._apply_quantization()
-```
-
-#### 3.3 实现优化代码
-
-在各个优化方法中实现你的优化逻辑。例如，优化 Vision Encoder：
-
-```python
-def _optimize_vision_encoder(self):
-    """在 evaluation_wrapper.py 中找到这个方法，实现你的优化"""
-    
-    # 示例：替换注意力算子
-    # from your_optimization import optimized_attention
-    # if hasattr(self._model, 'vision_model'):
-    #     for layer in self._model.vision_model.encoder.layers:
-    #         layer.self_attn.forward = optimized_attention
-    
-    # TODO: 实现你的 Vision Encoder 优化
-    pass
-```
-
-
-
-
-### 4. 测试你的优化模型
-
-```bash
-python benchmark.py \
-    --model-path ./Qwen3-VL-2B-Instruct \
-    --dataset-path ./data \
-    --output result_optimized.json \
-    --num-samples 100
-```
-
-### 5. 生成完整结果用于提交
-
-```bash
-python benchmark.py \
-    --model-path ./Qwen3-VL-2B-Instruct \
-    --dataset-path ./data \
-    --output result.json \
-    --num-samples 5000
-```
-
-## 评测指标
-
-最终得分计算公式：
-
-```
-最终得分 = 0.4 × 准确率 + 0.3 × TTFT提升率 + 0.3 × 吞吐量提升率
-```
-
-### 指标说明
-
- **TTFT (Time To First Token)**: 从输入准备到生成第一个 Token 的时间（毫秒）
-  - 包含：图像编码、文本编码、跨模态交互、Prefill 阶段、第一个 Token 生成
-  - Baseline: ~80ms
-  - 提升率 = (Baseline - 你的TTFT) / Baseline
-
- **Throughput (吞吐量)**: 端到端 Token 生成速率（tokens/秒）
-  - Baseline: ~55 tokens/sec
-  - 提升率 = (你的吞吐量 - Baseline) / Baseline
-
- **Accuracy (准确率)**: 验证集上的 VQA 准确率（5000 个样本）
-  - 支持多个标准答案的软匹配
-
-## 比赛规则
-
-###  重要规则
-
-
-1. **不要修改 `benchmark.py`**
-   - 此基准脚本仅用于自测
-   - 最终评测将使用独立的官方基准系统
-   - 修改此文件可能导致本地结果与最终评测结果不一致
-
-2. **仅修改 `evaluation_wrapper.py`**
-
-
-3. **保持必需的属性**
-   - `VLMModel` 类必须暴露 `processor`、`model` 和 `device` 属性
-   - Benchmark 使用这些属性来访问模型和处理器
-   - `generate()` 方法是可选的，主要用于调试
-
-4. **禁止行为**
-   - 禁止硬编码答案
-   - 禁止修改数据集
-   - 禁止使用外部 API 或服务
-   - 所有优化必须是本地且自包含的
-
-
-
-
-### 优化方向
- 鼓励实现算子替换与内核优化：使用Triton、CUDA C++等重写或替换标准算子实现（如Attention、LayerNorm、Conv2d等）
-
- 鼓励实现内存与缓存优化：优化KV Cache内存布局、减少内存碎片、优化显存访问模式
-
-
- 鼓励实现编译与图优化：使用torch.compile进行计算图优化、自定义内核调度
-
-
- 鼓励实现注意力机制优化：实现Flash Attention、内存高效注意力、稀疏注意力
-
- 鼓励实现生成过程优化：优化解码策略、缓存管理、生成配置参数
-
-
-**不允许：**
- 使用外部服务：禁止调用外部API、云服务或任何需要网络连接的功能
-
- 数据与答案作弊：禁止使用测试数据进行训练、预计算答案、硬编码输出
-
- 模型替换与篡改：希望选手着重做算子优化，不要用额外的数据集去训练模型、改变模型架构、直接修改权重数值等。
-
-
- 过拟合优化：禁止针对特定评测样本进行条件分支或特殊处理
-
- 黑盒工具套用：仅修改配置文件而无实质性代码贡献的行为不被认可
-
- 环境操纵：禁止通过修改系统环境、GPU频率锁定等方式干扰公平评测
-
-
-
-## 重要提示
-
-### 样本选择
-
- 提供的 `benchmark.py` 使用**固定顺序**（从索引 0 开始的前 N 个样本）
- 运行 `--num-samples 100` 时，会评测样本 0-99
- 这确保了本地自测的可复现性
- **注意**：竞赛委员会使用的官方评测系统可能采用不同的采样策略（包括随机采样）进行最终验证
-
-### 硬件信息
-
-基准测试会自动记录详细的硬件信息：
- Python 版本、PyTorch 版本、CUDA 版本
- GPU 名称、显存、计算能力
- CPU 型号、核心数、频率
- 系统信息（操作系统、内核、架构）
- PPU 信息（如果可用）
-
-这些信息保存在 `result.json` 的 `system_info` 字段中，用于统计分析。
-
-### 性能测量
-
- **预热**：在实际测量前使用 10 个样本进行 GPU 预热
- **TTFT 测量**：测量从输入准备到第一个 Token 的时间（包含所有预处理）
- **吞吐量测量**：测量生成 128 个 Token 的端到端时间
- **状态隔离**：在测量之间清理 GPU 缓存，确保公平性
-
-### 随机种子
-
- `--random-seed` 参数仅影响 PyTorch 的随机数生成器
- 它**不会**影响样本选择顺序（始终是固定的）
- 用于模型推理随机性的可复现性
-
-### 输出格式
-
-`result.json` 文件包含：
-```json
-{
-  "system_info": {
-    "timestamp": "...",
-    "python_version": "...",
-    "torch_version": "...",
-    "cuda_version": "...",
-    "gpu_name": "...",
-    ...
-  },
-  "performance": {
-    "avg_ttft_ms": 90.55,
-    "avg_throughput_tokens_per_sec": 57.77
-  },
-  "answers": [
-    {
-      "question_id": 34602,
-      "prediction": "你的答案文本"
-    },
-    ...
-  ]
-}
-```
-
-## 提交指南
-
-### 初赛提交必需文件
-
-1. **`result.json`** - 通过运行 `benchmark.py` 生成
-   - 包含所有样本的预测 
-   - 必须包含有效的 `performance` 指标
-   - **重要**：上传到天池平台的 `result.json` 仅用于参考。最终成绩将由竞赛委员会使用标准化硬件和官方评测系统进行评测。
-
-2. **你的优化代码** - 包含你优化的 `VLMModel` 类的 `evaluation_wrapper.py`
-
-3. **Docker 镜像**- 包含你优化环境的容器
-
-
-
-### 评测流程
-
-1. **自测**：使用提供的 `benchmark.py` 在本地测试你的优化
-2. **提交**：将你的 `result.json` 上传到天池平台（仅用于参考）
-3. **官方评测**：竞赛委员会将使用以下方式评测你的代码：
-   - 提交Docker镜像
-   - 标准化硬件环境
-   - 官方评测代码
-   - 完整验证集，随机采样进行验证
-4. **最终排名**：基于官方评测系统计算的最终得分
-
-
-
-## 祝你好运！
-
-希望你会专注于算子级优化、内核替换和高效的内存管理。记住：准确率和速度同样重要！祝你好运！
-
-
-
-
-Qwen3VLForConditionalGeneration(
-  (model): Qwen3VLModel(
-    (visual): Qwen3VLVisionModel(
-      (patch_embed): Qwen3VLVisionPatchEmbed(
-        (proj): Conv3d(3, 1024, kernel_size=(2, 16, 16), stride=(2, 16, 16))
-      )
-      (pos_embed): Embedding(2304, 1024)
-      (rotary_pos_emb): Qwen3VLVisionRotaryEmbedding()
-      (blocks): ModuleList(
-        (0-23): 24 x Qwen3VLVisionBlock(
-          (norm1): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
-          (norm2): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
-          (attn): Qwen3VLVisionAttention(
-            (qkv): Linear(in_features=1024, out_features=3072, bias=True)
-            (proj): Linear(in_features=1024, out_features=1024, bias=True)
-          )
-          (mlp): Qwen3VLVisionMLP(
-            (linear_fc1): Linear(in_features=1024, out_features=4096, bias=True)
-            (linear_fc2): Linear(in_features=4096, out_features=1024, bias=True)
-            (act_fn): GELUTanh()
-          )
-        )
-      )
-      (merger): Qwen3VLVisionPatchMerger(
-        (norm): LayerNorm((1024,), eps=1e-06, elementwise_affine=True)
-        (linear_fc1): Linear(in_features=4096, out_features=4096, bias=True)
-        (act_fn): GELU(approximate='none')
-        (linear_fc2): Linear(in_features=4096, out_features=2048, bias=True)
-      )
-      (deepstack_merger_list): ModuleList(
-        (0-2): 3 x Qwen3VLVisionPatchMerger(
-          (norm): LayerNorm((4096,), eps=1e-06, elementwise_affine=True)
-          (linear_fc1): Linear(in_features=4096, out_features=4096, bias=True)
-          (act_fn): GELU(approximate='none')
-          (linear_fc2): Linear(in_features=4096, out_features=2048, bias=True)
-        )
-      )
-    )
-    (language_model): Qwen3VLTextModel(
-      (embed_tokens): Embedding(151936, 2048)
-      (layers): ModuleList(
-        (0-27): 28 x Qwen3VLTextDecoderLayer(
-          (self_attn): Qwen3VLTextAttention(
-            (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
-            (k_proj): Linear(in_features=2048, out_features=1024, bias=False)
-            (v_proj): Linear(in_features=2048, out_features=1024, bias=False)
-            (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
-            (q_norm): Qwen3VLTextRMSNorm((128,), eps=1e-06)
-            (k_norm): Qwen3VLTextRMSNorm((128,), eps=1e-06)
-          )
-          (mlp): Qwen3VLTextMLP(
-            (gate_proj): Linear(in_features=2048, out_features=6144, bias=False)
-            (up_proj): Linear(in_features=2048, out_features=6144, bias=False)
-            (down_proj): Linear(in_features=6144, out_features=2048, bias=False)
-            (act_fn): SiLUActivation()
-          )
-          (input_layernorm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
-          (post_attention_layernorm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
-        )
-      )
-      (norm): Qwen3VLTextRMSNorm((2048,), eps=1e-06)
-      (rotary_emb): Qwen3VLTextRotaryEmbedding()
-    )
-  )
-  (lm_head): Linear(in_features=2048, out_features=151936, bias=False)
-)
--- a/.ipynb_checkpoints/evaluation_wrapper-checkpoint.py
+++ b/.ipynb_checkpoints/evaluation_wrapper-checkpoint.py
@ -1,406 +0,0 @@
-"""
-AICAS 2026 - Participant Core Modification File
-
-Participants should modify the VLMModel class to implement optimizations.
-
-Note:
- Benchmark directly calls self.model.generate() for performance testing.
- Your optimizations should modify self.model or its operators in __init__ via Monkey Patch.
- The generate() method is optional and mainly for debugging.
-"""
-from typing import Dict
-try:
-    from PIL import Image
-except ImportError:
-    # For testing without PIL
-    class Image:
-        pass
-import torch
-from transformers import AutoModelForImageTextToText, AutoProcessor
-
-
-class VLMModel:
-    """
-    Participant optimization class - modify this to implement optimizations.
-    
-    Optimization Architecture:
-    - Split optimizations into separate methods for isolation and testing
-    - Enable/disable each optimization independently in __init__
-    - Each optimization method can be tested individually
-    
-    Important Notes:
-    1. Benchmark directly calls self.model.generate() for performance testing.
-    2. Your optimizations should modify self.model or its operators via Monkey Patch.
-    3. All optimizations are applied in __init__ by calling optimization methods.
-    """
-    
-    def __init__(self, model_path: str, device: str = "cuda:0"):
-        """
-        Initialize model and apply optimizations.
-        
-        Args:
-            model_path: Qwen3-VL-2B-Instruct model path
-            device: CUDA device, e.g., "cuda:0"
-        """
-        self._device = device
-        self.model_path = model_path
-        
-        # Load processor
-        print(f"[VLMModel] Loading processor from {model_path}...")
-        self._processor = AutoProcessor.from_pretrained(model_path)
-        
-        # Load model
-        print(f"[VLMModel] Loading model with FP16...")
-        self._model = AutoModelForImageTextToText.from_pretrained(
-            model_path,
-            torch_dtype=torch.float16,
-            device_map=device
-        )
-        self._model.eval()
-        
-        # Track applied optimizations
-        self._optimizations_applied = []
-        
-        # ================================================================
-        # Participant Optimization Area - Enable/disable optimizations here
-        # Uncomment the optimization methods you want to apply
-        # ================================================================
-        
-        # 1. Vision Encoder Acceleration
-        # self._optimize_vision_encoder()
-        
-        # 2. KV Cache Management
-        # self._optimize_kv_cache()
-        
-        # 3. Cross-modal Connector Optimization
-        # self._optimize_cross_modal_connector()
-        
-        # 4. Flash Attention Optimization
-        # self._enable_flash_attention()
-        
-        # 5. Quantization
-        # self._apply_quantization()
-        
-        # Optional: Explore model structure before optimization
-        # self._explore_model_structure()
-        
-        # ================================================================
-        
-        print(f"[VLMModel] Model loaded successfully on {device}")
-        if self._optimizations_applied:
-            print(f"[VLMModel] Applied optimizations: {', '.join(self._optimizations_applied)}")
-    
-    # ================================================================
-    # Optimization Methods - Implement your optimizations here
-    # ================================================================
-    
-    def _explore_model_structure(self):
-        """
-        Helper method to explore model structure.
-        
-        Use this to understand the model architecture before implementing optimizations.
-        This helps identify where to apply monkey patches.
-        """
-        print("=" * 60)
-        print("Model Structure Exploration")
-        print("=" * 60)
-        
-        # Explore vision model structure
-        if hasattr(self._model, 'vision_model'):
-            print(f"Vision Model: {type(self._model.vision_model)}")
-            if hasattr(self._model.vision_model, 'encoder'):
-                if hasattr(self._model.vision_model.encoder, 'layers'):
-                    print(f"  Vision Encoder Layers: {len(self._model.vision_model.encoder.layers)}")
-                    # Show first layer structure
-                    if len(self._model.vision_model.encoder.layers) > 0:
-                        print(f"  First Layer Type: {type(self._model.vision_model.encoder.layers[0])}")
-        else:
-            print("Vision Model: Not found (model structure may differ)")
-        
-        # Explore language model structure
-        if hasattr(self._model, 'model'):
-            print(f"Language Model: {type(self._model.model)}")
-            if hasattr(self._model.model, 'layers'):
-                print(f"  Language Model Layers: {len(self._model.model.layers)}")
-        else:
-            print("Language Model: Not found (model structure may differ)")
-        
-        # Explore cross-modal components
-        cross_modal_attrs = ['connector', 'cross_attn', 'cross_attention', 'proj', 'projector']
-        found_components = []
-        for attr in cross_modal_attrs:
-            if hasattr(self._model, attr):
-                found_components.append(attr)
-        if found_components:
-            print(f"Cross-modal Components: {', '.join(found_components)}")
-        else:
-            print("Cross-modal Components: Explore manually (structure may vary)")
-        
-        print("=" * 60)
-        print("Tip: Use print(self._model) to see full model structure")
-        print("=" * 60)
-    
-    def _optimize_vision_encoder(self):
-        """
-        Optimize Vision Encoder for high-resolution image inputs.
-        
-        Optimization Directions:
-        1. Patch embedding convolution optimization
-        2. Vision Transformer attention mechanism optimization
-        3. Layer normalization optimization
-        4. Memory-efficient image processing
-        
-        Implementation Steps:
-        1. Inspect model structure: call self._explore_model_structure()
-        2. Identify bottlenecks using profiling tools (PyTorch Profiler, nsys, etc.)
-        3. Implement optimized operators (Triton/CUDA kernels)
-        4. Replace original operators via monkey patch
-        
-        Target Components:
-        - self._model.vision_model (if exists)
-        - Vision encoder layers and attention mechanisms
-        - Convolution operations in patch embedding
-        """
-        # TODO: Implement your Vision Encoder optimization here
-        # 
-        # Example workflow:
-        # 1. from your_optimization import optimized_attention, optimized_conv
-        # 2. Inspect: print(self._model.vision_model) to find target layers
-        # 3. Replace: layer.self_attn.forward = optimized_attention
-        # 4. Test: Run benchmark to verify improvement
-        
-        if 'vision_encoder' not in self._optimizations_applied:
-            self._optimizations_applied.append('vision_encoder')
-    
-    def _optimize_kv_cache(self):
-        """
-        Optimize KV Cache management to reduce memory fragmentation.
-        
-        Optimization Directions:
-        1. Memory layout optimization (contiguous memory allocation)
-        2. Fragmentation-free allocation strategies
-        3. Efficient cache reuse patterns
-        4. Dynamic cache sizing
-        
-        Implementation Steps:
-        1. Understand current KV cache implementation in model layers
-        2. Design memory-efficient cache allocation strategy
-        3. Implement custom KV cache allocator if needed
-        4. Apply optimizations via monkey patch or config modification
-        
-        Target Components:
-        - self._model.config (cache configuration)
-        - Attention layers (KV cache allocation)
-        - Generation loop (cache management)
-        """
-        # Enable KV Cache first
-        self._model.config.use_cache = True
-        if hasattr(self._model.config, 'pad_token_id'):
-            if self._model.config.pad_token_id is None:
-                self._model.config.pad_token_id = self._model.config.eos_token_id
-        
-        # TODO: Implement advanced KV Cache optimizations here
-        # 
-        # Example workflow:
-        # 1. from your_optimization import FragmentationFreeKVCache
-        # 2. for layer in self._model.model.layers:
-        # 3.     layer.attention.custom_kv_cache = FragmentationFreeKVCache()
-        # 4. Test: Monitor memory usage and generation speed
-        
-        if 'kv_cache' not in self._optimizations_applied:
-            self._optimizations_applied.append('kv_cache')
-    
-    def _optimize_cross_modal_connector(self):
-        """
-        Optimize Cross-modal Connector computation efficiency.
-        
-        Optimization Directions:
-        1. Cross-attention mechanism optimization
-        2. Vision-to-language projection optimization
-        3. Multi-modal fusion layer efficiency
-        4. Feature alignment and transformation optimization
-        
-        Implementation Steps:
-        1. Identify cross-modal components using self._explore_model_structure()
-        2. Profile cross-modal operations to find bottlenecks
-        3. Implement optimized cross-attention or projection kernels
-        4. Replace original operations via monkey patch
-        
-        Note: Qwen3-VL's cross-modal structure may vary.
-        Use model exploration to identify actual component names and locations.
-        """
-        # TODO: Implement your Cross-modal Connector optimization here
-        # 
-        # Example workflow:
-        # 1. Explore: self._explore_model_structure() to find connector components
-        # 2. from your_optimization import optimized_cross_attention
-        # 3. Identify: Inspect model to find cross-attention layers
-        # 4. Replace: connector.cross_attention.forward = optimized_cross_attention
-        # 5. Test: Verify accuracy and performance improvements
-        
-        from my_patch import patch_forward
-        self._model.model.__class__.forward = patch_forward
-
-        if 'cross_modal' not in self._optimizations_applied:
-            self._optimizations_applied.append('cross_modal')
-    
-    def _enable_flash_attention(self):
-        """
-        Enable or implement Flash Attention optimization.
-        
-        Implementation Approaches:
-        
-        Approach 1: Enable PyTorch's Built-in Flash Attention (Simple)
-            - Uses torch.backends.cuda.enable_flash_sdp(True)
-            - Easy to enable but limited customization
-            - May not work for all attention patterns in Qwen3-VL
-        
-        Approach 2: Implement Custom Flash Attention (Advanced, Recommended)
-            - Write custom Triton/CUDA kernels for attention computation
-            - Replace torch.nn.functional.scaled_dot_product_attention
-            - Full control over attention computation and memory layout
-            - Better performance potential but requires more implementation effort
-        
-        Recommended: Implement Approach 2 for better performance gains.
-        Use profiling to identify which attention operations benefit most from optimization.
-        """
-        # TODO: Choose and implement your Flash Attention approach
-        
-        # Approach 1: Simple (enable PyTorch built-in)
-        # torch.backends.cuda.enable_flash_sdp(True)
-        
-        # Approach 2: Advanced (custom implementation - recommended)
-        # from your_optimization import custom_flash_attention
-        # torch.nn.functional.scaled_dot_product_attention = custom_flash_attention
-        # 
-        # Or replace at layer level:
-        # for layer in self._model.model.layers:
-        #     layer.self_attn.forward = custom_attention_with_flash
-        
-        if 'flash_attention' not in self._optimizations_applied:
-            self._optimizations_applied.append('flash_attention')
-    
-    def _apply_quantization(self):
-        """
-        Apply quantization to reduce model size and speed up inference.
-        
-        Optimization Directions:
-        1. INT8 quantization (8-bit integer)
-        2. FP8 quantization (8-bit floating point)
-        3. Mixed precision quantization
-        4. Dynamic vs static quantization
-        
-        Implementation Steps:
-        1. Choose quantization strategy based on accuracy/performance trade-off
-        2. Use quantization libraries (BitsAndBytes, TensorRT, etc.)
-        3. Calibrate quantized model on validation data
-        4. Verify accuracy preservation
-        
-        Note: Quantization may require reloading the model with quantization config.
-        Consider applying quantization before other optimizations if model reload is needed.
-        """
-        # TODO: Implement your quantization here
-        # 
-        # Example workflow:
-        # 1. from transformers import BitsAndBytesConfig
-        # 2. quantization_config = BitsAndBytesConfig(load_in_8bit=True)
-        # 3. Note: May need to reload model with quantization config
-        # 4. Test: Verify accuracy and performance improvements
-        
-        if 'quantization' not in self._optimizations_applied:
-            self._optimizations_applied.append('quantization')
-    
-    # Required properties for benchmark
-    @property
-    def processor(self):
-        """
-        Required by benchmark for input processing.
-        
-        Benchmark uses this to prepare inputs with unified tokenizer.
-        """
-        return self._processor
-    
-    @property
-    def model(self):
-        """
-        Required by benchmark for direct model.generate() calls.
-        
-        Benchmark directly calls self.model.generate() for performance testing.
-        Your optimizations should modify this model object or its operators.
-        """
-        return self._model
-    
-    @property
-    def device(self):
-        """
-        Required by benchmark for device information.
-        """
-        return self._device
-    
-    def generate(
-        self, 
-        image: Image.Image, 
-        question: str, 
-        max_new_tokens: int = 128
-    ) -> Dict:
-        """
-        Generate answer (optional method, mainly for debugging).
-        
-        Note: Benchmark uses self.model.generate() directly for performance testing.
-        This method is provided for convenience and debugging purposes.
-        
-        Args:
-            image: PIL Image object
-            question: Question text
-            max_new_tokens: Maximum tokens to generate
-        
-        Returns:
-            Dict: {
-                "text": str,        # Generated text answer
-                "token_count": int  # Generated token count
-            }
-        """
-        # Build Qwen3-VL message format
-        messages = [{
-            "role": "user",
-            "content": [
-                {"type": "image", "image": image},
-                {"type": "text", "text": question}
-            ]
-        }]
-        
-        # Process inputs
-        inputs = self._processor.apply_chat_template(
-            messages,
-            tokenize=True,
-            add_generation_prompt=True,
-            return_dict=True,
-            return_tensors="pt"
-        ).to(self._device)
-        
-        # Generate
-        with torch.no_grad():
-            output_ids = self._model.generate(
-                **inputs,
-                max_new_tokens=max_new_tokens,
-                do_sample=False,
-                temperature=0.0,
-                top_p=1.0,
-                use_cache=True
-            )
-        
-        # Extract generated tokens (remove input part)
-        input_len = inputs.input_ids.shape[1]
-        generated_ids = output_ids[0][input_len:]
-        
-        # Decode
-        text = self._processor.tokenizer.decode(
-            generated_ids,
-            skip_special_tokens=True,
-            clean_up_tokenization_spaces=False
-        )
-        
-        return {
-            "text": text,
-            "token_count": len(generated_ids)
-        }
--- a/.ipynb_checkpoints/my_patch-checkpoint.py
+++ b/.ipynb_checkpoints/my_patch-checkpoint.py
@ -1,364 +0,0 @@
-import numpy as np
-import torch
-
-from transformers.models.qwen3_vl.processing_qwen3_vl import Qwen3VLProcessor, Qwen3VLProcessorKwargs
-from transformers.models.qwen3_vl.modeling_qwen3_vl import Qwen3VLModelOutputWithPast, BaseModelOutputWithDeepstackFeatures
-from transformers.feature_extraction_utils import BatchFeature
-from transformers.image_utils import ImageInput
-from transformers.processing_utils import Unpack
-from transformers.tokenization_utils_base import PreTokenizedInput, TextInput
-from transformers.utils import logging, TransformersKwargs, can_return_tuple
-from transformers.video_utils import VideoInput
-from transformers.cache_utils import Cache
-from transformers.processing_utils import Unpack
-
-logger = logging.get_logger(__name__)
-
-class myQwen3VLProcessor(Qwen3VLProcessor):
-    def __init__(self, image_processor=None, tokenizer=None, video_processor=None, chat_template=None, **kwargs):
-        super().__init__(image_processor, tokenizer, video_processor, chat_template, **kwargs)
-    
-    def __call__(
-        self,
-        images: ImageInput = None,
-        text: TextInput | PreTokenizedInput | list[TextInput] | list[PreTokenizedInput] = None,
-        videos: VideoInput = None,
-        **kwargs: Unpack[Qwen3VLProcessorKwargs],
-    ) -> BatchFeature:
-        r"""
-        Returns:
-            [`BatchFeature`]: A [`BatchFeature`] with the following fields:
-
-            - **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
-            - **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
-              `return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
-              `None`).
-            - **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
-            - **pixel_values_videos** -- Pixel values of videos to be fed to a model. Returned when `videos` is not `None`.
-            - **image_grid_thw** -- List of image 3D grid in LLM. Returned when `images` is not `None`.
-            - **video_grid_thw** -- List of video 3D grid in LLM. Returned when `videos` is not `None`.
-        """
-        output_kwargs = self._merge_kwargs(
-            Qwen3VLProcessorKwargs,
-            tokenizer_init_kwargs=self.tokenizer.init_kwargs,
-            **kwargs,
-        )
-        if images is not None:
-            image_inputs = self.image_processor(images=images, **output_kwargs["images_kwargs"])
-            image_grid_thw = image_inputs["image_grid_thw"]
-        else:
-            image_inputs = {}
-            image_grid_thw = None
-
-        if videos is not None:
-            videos_inputs = self.video_processor(videos=videos, **output_kwargs["videos_kwargs"])
-            video_grid_thw = videos_inputs["video_grid_thw"]
-            # If user has not requested video metadata, pop it
-            if not kwargs.get("return_metadata"):
-                video_metadata = videos_inputs.pop("video_metadata")
-            else:
-                video_metadata = videos_inputs["video_metadata"]
-        else:
-            videos_inputs = {}
-            video_grid_thw = None
-
-        if not isinstance(text, list):
-            text = [text]
-
-        text = text.copy()  # below lines change text in-place
-        if image_grid_thw is not None:
-            merge_length = self.image_processor.merge_size**2
-            index = 0
-            for i in range(len(text)):
-                while self.image_token in text[i]:
-                    # num_image_tokens = image_grid_thw[index].prod() // merge_length
-                    num_image_tokens = 40
-                    text[i] = text[i].replace(self.image_token, "<|placeholder|>" * num_image_tokens, 1)
-                    index += 1
-                text[i] = text[i].replace("<|placeholder|>", self.image_token)
-
-        if video_grid_thw is not None:
-            merge_length = self.video_processor.merge_size**2
-            index = 0
-            for i in range(len(text)):
-                while self.video_token in text[i]:
-                    metadata = video_metadata[index]
-                    if metadata.fps is None:
-                        logger.warning_once(
-                            "Qwen3VL requires frame timestamps to construct prompts, but the `fps` of the input video could not be inferred. "
-                            "Probably `video_metadata` was missing from inputs and you passed pre-sampled frames. "
-                            "Defaulting to `fps=24`. Please provide `video_metadata` for more accurate results."
-                        )
-                        metadata.fps = 24 if metadata.fps is None else metadata.fps
-
-                    # if timestamps are not provided, calculate them
-                    curr_timestamp = self._calculate_timestamps(
-                        metadata.frames_indices,
-                        metadata.fps,
-                        self.video_processor.temporal_patch_size,
-                    )
-
-                    video_placeholder = ""
-                    frame_seqlen = video_grid_thw[index][1:].prod() // merge_length
-                    for frame_idx in range(video_grid_thw[index][0]):
-                        curr_time = curr_timestamp[frame_idx]
-                        video_placeholder += f"<{curr_time:.1f} seconds>"
-                        video_placeholder += (
-                            self.vision_start_token + "<|placeholder|>" * frame_seqlen + self.vision_end_token
-                        )
-                    if f"{self.vision_start_token}{self.video_token}{self.vision_end_token}" in text[i]:
-                        text[i] = text[i].replace(
-                            f"{self.vision_start_token}{self.video_token}{self.vision_end_token}", video_placeholder, 1
-                        )
-                    else:
-                        # vllm may input video token directly
-                        text[i] = text[i].replace(self.video_token, video_placeholder, 1)
-                    index += 1
-
-                text[i] = text[i].replace("<|placeholder|>", self.video_token)
-
-        return_tensors = output_kwargs["text_kwargs"].pop("return_tensors", None)
-        return_mm_token_type_ids = output_kwargs["text_kwargs"].pop("return_mm_token_type_ids", None)
-        text_inputs = self.tokenizer(text, **output_kwargs["text_kwargs"])
-        self._check_special_mm_tokens(text, text_inputs, modalities=["image", "video"])
-
-        if return_mm_token_type_ids:
-            array_ids = np.array(text_inputs["input_ids"])
-            mm_token_type_ids = np.zeros_like(text_inputs["input_ids"])
-            mm_token_type_ids[array_ids == self.image_token_id] = 1
-            text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
-
-        return BatchFeature(data={**text_inputs, **image_inputs, **videos_inputs}, tensor_type=return_tensors)
-    
-def _sample_indices_uniform(idx: torch.LongTensor, keep_ratio: float, min_keep: int = 0):
-    """
-    idx: 1D indices in original sequence (sorted)
-    keep_ratio: 0~1, keep uniformly spaced
-    """
-    n = idx.numel()
-    if n == 0:
-        return idx
-    k = max(min_keep, int(torch.ceil(torch.tensor(n * keep_ratio)).item()))
-    k = min(k, n)
-    if k == n:
-        return idx
-    # uniform pick: linspace over [0, n-1]
-    pos = torch.linspace(0, n - 1, steps=k, device=idx.device)
-    pos = pos.round().long().clamp(0, n - 1)
-    return idx[pos]
-
-def sparse_keep_and_gather(
-    inputs_embeds,          # (B,S,D)
-    attention_mask,         # (B,S)
-    position_ids,           # (4,B,S)
-    visual_pos_masks,       # (B,S) bool
-    deepstack_visual_embeds,# list[tensor] each (Nvis_total,D) OR None
-    keep_ratio: float = 0.25,
-    min_keep_per_vis: int = 0,
-    max_len: int | None = None,
-):
-    """
-    稀疏保留：保留全部文本 token；视觉 token 按 keep_ratio 均匀采样保留。
-    可选 max_len：如果最终还超长，再从视觉 token 里继续裁（不动文本）。
-    """
-    device = inputs_embeds.device
-    B, S, D = inputs_embeds.shape
-    eff = attention_mask.bool()
-
-    keep_mask_token = torch.zeros((B, S), dtype=torch.bool, device=device)
-
-    for b in range(B):
-        eff_idx = eff[b].nonzero(as_tuple=False).squeeze(1)          # 有效 token
-        if eff_idx.numel() == 0:
-            continue
-
-        vis_eff = visual_pos_masks[b, eff_idx]                      # 有效里哪些是视觉
-        text_idx = eff_idx[~vis_eff]                                # 全保留
-        vis_idx  = eff_idx[vis_eff]                                 # 待稀疏
-
-        # 视觉稀疏采样（删中间就靠这一步）
-        kept_vis = _sample_indices_uniform(vis_idx, keep_ratio, min_keep=min_keep_per_vis)
-
-        chosen = torch.cat([text_idx, kept_vis], dim=0)
-        chosen, _ = torch.sort(chosen)                              # 保持原序
-
-        # 如果还要控最大长度：优先继续裁视觉（不裁文本）
-        if max_len is not None and chosen.numel() > max_len:
-            # 已保留的视觉位置
-            chosen_vis = chosen[visual_pos_masks[b, chosen]]
-            chosen_txt = chosen[~visual_pos_masks[b, chosen]]
-            # 文本若已超 max_len，只能截文本（极少）
-            if chosen_txt.numel() >= max_len:
-                chosen = chosen_txt[:max_len]
-            else:
-                budget = max_len - chosen_txt.numel()
-                # 对视觉再均匀裁到 budget
-                chosen_vis = _sample_indices_uniform(chosen_vis, budget / max(chosen_vis.numel(), 1))
-                chosen = torch.cat([chosen_txt, chosen_vis], dim=0)
-                chosen, _ = torch.sort(chosen)
-
-        keep_mask_token[b, chosen] = True
-
-    # ===== gather + pad 到 batch 内最大长度 =====
-    keep_lens = keep_mask_token.sum(dim=1).tolist()
-    max_keep = max(keep_lens) if keep_lens else 0
-
-    new_inputs = inputs_embeds.new_zeros((B, max_keep, D))
-    new_attn   = attention_mask.new_zeros((B, max_keep))
-    new_pos    = position_ids.new_zeros((4, B, max_keep))
-    new_vis    = visual_pos_masks.new_zeros((B, max_keep), dtype=torch.bool)
-
-    for b in range(B):
-        idx = keep_mask_token[b].nonzero(as_tuple=False).squeeze(1)
-        L = idx.numel()
-        if L == 0:
-            continue
-        new_inputs[b, :L, :] = inputs_embeds[b, idx, :]
-        new_attn[b, :L]      = attention_mask[b, idx]
-        new_pos[:, b, :L]    = position_ids[:, b, idx]
-        new_vis[b, :L]       = visual_pos_masks[b, idx]
-
-    # ===== deepstack 同步裁剪（关键！）=====
-    new_deepstack = None
-    if deepstack_visual_embeds is not None:
-        # deepstack 的顺序 = visual_pos_masks flatten 后 True 的顺序
-        # 所以用 keep_mask_token 在这些位置的布尔值来裁剪
-        keep_vis_flat = keep_mask_token[visual_pos_masks]  # 1D bool, length = Nvis_total
-        new_deepstack = [x[keep_vis_flat] for x in deepstack_visual_embeds]
-
-    return new_inputs, new_attn, new_pos, new_vis, new_deepstack
-
-@can_return_tuple
-def patch_forward(
-    self,
-    input_ids: torch.LongTensor = None,
-    attention_mask: torch.Tensor | None = None,
-    position_ids: torch.LongTensor | None = None,
-    past_key_values: Cache | None = None,
-    inputs_embeds: torch.FloatTensor | None = None,
-    pixel_values: torch.Tensor | None = None,
-    pixel_values_videos: torch.FloatTensor | None = None,
-    image_grid_thw: torch.LongTensor | None = None,
-    video_grid_thw: torch.LongTensor | None = None,
-    cache_position: torch.LongTensor | None = None,
-    **kwargs: Unpack[TransformersKwargs],
-) -> tuple | Qwen3VLModelOutputWithPast:
-    r"""
-    image_grid_thw (`torch.LongTensor` of shape `(num_images, 3)`, *optional*):
-        The temporal, height and width of feature shape of each image in LLM.
-    video_grid_thw (`torch.LongTensor` of shape `(num_videos, 3)`, *optional*):
-        The temporal, height and width of feature shape of each video in LLM.
-    """
-
-    if (input_ids is None) ^ (inputs_embeds is not None):
-        raise ValueError("You must specify exactly one of input_ids or inputs_embeds")
-
-    if inputs_embeds is None:
-        inputs_embeds = self.get_input_embeddings()(input_ids)
-
-    image_mask = None
-    video_mask = None
-
-    if pixel_values is not None:
-        image_outputs: BaseModelOutputWithDeepstackFeatures = self.get_image_features(
-            pixel_values, image_grid_thw, return_dict=True
-        )
-        image_embeds = image_outputs.pooler_output
-        deepstack_image_embeds = image_outputs.deepstack_features
-        image_embeds = torch.cat(image_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
-        image_mask, _ = self.get_placeholder_mask(
-            input_ids, inputs_embeds=inputs_embeds, image_features=image_embeds
-        )
-        inputs_embeds = inputs_embeds.masked_scatter(image_mask, image_embeds)
-
-    if pixel_values_videos is not None:
-        video_outputs: BaseModelOutputWithDeepstackFeatures = self.get_video_features(
-            pixel_values_videos, video_grid_thw, return_dict=True
-        )
-        video_embeds = video_outputs.pooler_output
-        deepstack_video_embeds = video_outputs.deepstack_features
-        video_embeds = torch.cat(video_embeds, dim=0).to(inputs_embeds.device, inputs_embeds.dtype)
-        _, video_mask = self.get_placeholder_mask(
-            input_ids, inputs_embeds=inputs_embeds, video_features=video_embeds
-        )
-        inputs_embeds = inputs_embeds.masked_scatter(video_mask, video_embeds)
-
-    visual_pos_masks = None
-    deepstack_visual_embeds = None
-    if image_mask is not None and video_mask is not None:
-        # aggregate visual_pos_masks and deepstack_visual_embeds
-        image_mask = image_mask[..., 0]
-        video_mask = video_mask[..., 0]
-        visual_pos_masks = image_mask | video_mask
-        deepstack_visual_embeds = []
-        image_mask_joint = image_mask[visual_pos_masks]
-        video_mask_joint = video_mask[visual_pos_masks]
-        for img_embed, vid_embed in zip(deepstack_image_embeds, deepstack_video_embeds):
-            embed_joint = img_embed.new_zeros(visual_pos_masks.sum(), img_embed.shape[-1]).to(img_embed.device)
-            embed_joint[image_mask_joint, :] = img_embed
-            embed_joint[video_mask_joint, :] = vid_embed
-            deepstack_visual_embeds.append(embed_joint)
-    elif image_mask is not None:
-        image_mask = image_mask[..., 0]
-        visual_pos_masks = image_mask
-        deepstack_visual_embeds = deepstack_image_embeds
-    elif video_mask is not None:
-        video_mask = video_mask[..., 0]
-        visual_pos_masks = video_mask
-        deepstack_visual_embeds = deepstack_video_embeds
-
-    if position_ids is None:
-        position_ids = self.compute_3d_position_ids(
-            input_ids=input_ids,
-            image_grid_thw=image_grid_thw,
-            video_grid_thw=video_grid_thw,
-            inputs_embeds=inputs_embeds,
-            attention_mask=attention_mask,
-            past_key_values=past_key_values,
-        )
-
-    # ====== 稀疏采样裁剪：只在 prefill 做（past_key_values is None）=====
-    if past_key_values.get_seq_length() == 0 and visual_pos_masks is not None:
-        # 这些参数你可以通过 kwargs 传入
-        keep_ratio = kwargs.pop("visual_keep_ratio", 0.1)          # 只保留 25% 视觉 token
-        min_keep   = kwargs.pop("min_keep_per_vis", 0)              # 每段视觉最少保留多少（可设比如 16）
-        max_len    = kwargs.pop("truncate_max_len", None)           # 总长度上限（可选）
-
-        inputs_embeds, attention_mask, position_ids, visual_pos_masks, deepstack_visual_embeds = sparse_keep_and_gather(
-            inputs_embeds=inputs_embeds,
-            attention_mask=attention_mask,
-            position_ids=position_ids,
-            visual_pos_masks=visual_pos_masks,
-            deepstack_visual_embeds=deepstack_visual_embeds,
-            keep_ratio=keep_ratio,
-            min_keep_per_vis=min_keep,
-            max_len=max_len,
-        )
-
-        # cache_position 建议重建为 0..L-1（避免对齐问题）
-        cache_position = torch.arange(
-            inputs_embeds.shape[1], device=inputs_embeds.device, dtype=torch.long
-        ).unsqueeze(0).expand(inputs_embeds.shape[0], -1)
-
-        # rope_deltas 建议也按裁剪后的序列重算（防止不一致）
-        eff_len = attention_mask.sum(dim=1).to(torch.long)  # (B,)
-        max_pos = position_ids.max(dim=0).values.max(dim=1).values  # (B,)
-        self.rope_deltas = (max_pos + 1 - eff_len).unsqueeze(1)
-    # ====== 裁剪结束 ======
-
-    outputs = self.language_model(
-        input_ids=None,
-        position_ids=position_ids,
-        attention_mask=attention_mask,
-        past_key_values=past_key_values,
-        inputs_embeds=inputs_embeds,
-        cache_position=cache_position,
-        visual_pos_masks=visual_pos_masks,
-        deepstack_visual_embeds=deepstack_visual_embeds,
-        **kwargs,
-    )
-
-    return Qwen3VLModelOutputWithPast(
-        **outputs,
-        rope_deltas=self.rope_deltas,
-    )