init commit

2026-02-25 13:25:56 +08:00
commit d6aa5f568a
6 changed files with 1739 additions and 0 deletions
--- a/.gitignore
+++ b/.gitignore
@ -0,0 +1,2 @@
+data/*
+Qwen3-VL-2B-Instruct/*
--- a/README.md
+++ b/README.md
@ -0,0 +1,342 @@
+# AICAS 2026 - Vision-Language Model Optimization Competition
+
+## Table of Contents
+- [Overview](#overview)
+- [Code Structure](#code-structure)
+- [Core Files](#core-files)
+- [Quick Start](#quick-start)
+- [Evaluation Metrics](#evaluation-metrics)
+- [Competition Rules](#competition-rules)
+- [Important Notes](#important-notes)
+- [Submission Guidelines](#submission-guidelines)
+
+## Overview
+
+This competition focuses on optimizing Vision-Language Models (VLM) for inference performance. Participants are required to modify the `VLMModel` class in `evaluation_wrapper.py` to achieve better Time-To-First-Token (TTFT) and Throughput while maintaining accuracy.
+
+## Code Structure
+
+```
+AICASGC/
+├── benchmark.py              # Benchmark script (not recommended to modify)
+├── evaluation_wrapper.py     # Model wrapper (participants implement optimizations here)
+├── requirements.txt          # Python dependencies
+├── data/                     # Validation dataset
+│   ├── data-*.arrow          # Dataset files
+│   ├── dataset_info.json     # Dataset metadata
+│   └── state.json            # Dataset state
+├── Qwen3-VL-2B-Instruct/    # Model weights directory (participants need to download)
+└── README.md / README_CN.md   # Documentation
+```
+
+
+## Core Files
+
+- **`benchmark.py`** - Self-testing benchmark script (⚠️ **Not recommended to modify**)
+- **`evaluation_wrapper.py`** - Model wrapper where participants implement optimizations
+- **`Qwen3-VL-2B-Instruct/`** - Competition model weights (participants need to download, see "Quick Start" section)
+- **`data/`** - Validation dataset
+- **`requirements.txt`** - Python dependencies
+
+## Quick Start
+
+### 0. Download Model (First Time)
+
+The model files are large and need to be downloaded separately. Please create the model directory first, then download the model:
+
+```bash
+# Create model directory
+mkdir -p Qwen3-VL-2B-Instruct
+
+# Install huggingface_hub (if not installed)
+pip install -U huggingface_hub
+
+# Set mirror endpoint (recommended for users in China, faster download)
+export HF_ENDPOINT=https://hf-mirror.com
+
+# Download model to specified directory
+huggingface-cli download \
+  --resume-download \
+  Qwen/Qwen3-VL-2B-Instruct \
+  --local-dir ./Qwen3-VL-2B-Instruct \
+  --local-dir-use-symlinks False
+```
+
+**Note:**
+- Model size is approximately 4-5GB, download may take some time
+- If download is interrupted, you can rerun the command and it will resume automatically (`--resume-download`)
+- After download completes, the `Qwen3-VL-2B-Instruct/` folder will contain all model files
+- Ensure you have sufficient disk space (at least 5GB)
+
+### 1. Install Dependencies
+
+```bash
+pip install -r requirements.txt
+```
+
+### 2. Run Test
+
+```bash
+python benchmark.py \
+    --model-path ./Qwen3-VL-2B-Instruct \
+    --dataset-path ./data \
+    --output result.json \
+    --num-samples 100
+```
+
+### 3. Implement Your Optimizations
+
+Edit the `VLMModel` class in `evaluation_wrapper.py`. The optimization architecture uses **modular design**, where each optimization direction corresponds to an independent method.
+
+#### 3.1 Explore Model Structure (Optional)
+
+Before starting optimizations, you can explore the model structure to understand optimization targets:
+
+```python
+class VLMModel:
+    def __init__(self, model_path: str, device: str = "cuda:0"):
+        # ... load model ...
+        
+        # Optional: Explore model structure
+        self._explore_model_structure()  # Will print model structure information
+```
+
+#### 3.2 Enable Optimization Methods
+
+In the `__init__` method, enable/disable different optimizations by commenting/uncommenting:
+
+```python
+class VLMModel:
+    def __init__(self, model_path: str, device: str = "cuda:0"):
+        # ... load model ...
+        
+        # ================================================================
+        # Participant Optimization Area - Enable/disable optimization methods
+        # ================================================================
+        
+        # 1. Vision Encoder Acceleration (optimize high-resolution image processing)
+        # self._optimize_vision_encoder()
+        
+        # 2. KV Cache Management (optimize memory fragmentation during generation)
+        # self._optimize_kv_cache()
+        
+        # 3. Cross-modal Connector Optimization (optimize Cross-modal Connector)
+        # self._optimize_cross_modal_connector()
+        
+        # 4. Flash Attention Optimization
+        # self._enable_flash_attention()
+        
+        # 5. Quantization Optimization
+        # self._apply_quantization()
+```
+
+#### 3.3 Implement Optimization Code
+
+Implement your optimization logic in each optimization method. For example, optimizing Vision Encoder:
+
+```python
+def _optimize_vision_encoder(self):
+    """Find this method in evaluation_wrapper.py and implement your optimization"""
+    
+    # Example: Replace attention operator
+    # from your_optimization import optimized_attention
+    # if hasattr(self._model, 'vision_model'):
+    #     for layer in self._model.vision_model.encoder.layers:
+    #         layer.self_attn.forward = optimized_attention
+    
+    # TODO: Implement your Vision Encoder optimization
+    pass
+```
+
+
+**Important Notes:**
+- Benchmark directly calls `self.model.generate()` for performance testing
+- Your optimizations should modify `self.model` or its operators via Monkey Patch in optimization methods
+- All optimization methods are called in `__init__`, and optimizations take effect automatically
+- The `generate()` method is optional and mainly for debugging
+
+### 4. Test Your Optimized Model
+
+```bash
+python benchmark.py \
+    --model-path ./Qwen3-VL-2B-Instruct \
+    --dataset-path ./data \
+    --output result_optimized.json \
+    --num-samples 100
+```
+
+### 5. Generate Full Results for Submission
+
+```bash
+python benchmark.py \
+    --model-path ./Qwen3-VL-2B-Instruct \
+    --dataset-path ./data \
+    --output result.json \
+    --num-samples 5000
+```
+
+## Evaluation Metrics
+
+The final score is calculated as:
+
+```
+Final Score = 0.4 × Accuracy + 0.3 × TTFT_Improvement + 0.3 × Throughput_Improvement
+```
+
+### Metrics Explained
+
+- **TTFT (Time To First Token)**: Time from input preparation to first token generation (in milliseconds)
+  - Includes: image encoding, text encoding, cross-modal interaction, prefill stage, first token generation
+  - Baseline: ~80ms
+  - Improvement = (Baseline - Your_TTFT) / Baseline
+
+- **Throughput**: End-to-end token generation rate (tokens per second)
+  - Baseline: ~55 tokens/sec
+  - Improvement = (Your_Throughput - Baseline) / Baseline
+
+- **Accuracy**: VQA accuracy on validation set (5000 samples)
+  - Soft matching with multiple ground truth answers
+
+## Competition Rules
+
+### Critical Rules
+
+1. **Do not modify `benchmark.py`**
+   - This benchmark script is for self-testing only
+   - Final evaluation will use a separate official benchmark system
+   - Modifying this file may lead to inconsistencies between your local results and final evaluation results
+
+2. **Only modify `evaluation_wrapper.py`**
+
+
+3. **Maintain required properties**
+   - The `VLMModel` class must expose `processor`, `model`, and `device` properties
+   - Benchmark uses these properties to access the model and processor
+   - The `generate()` method is optional and mainly for debugging
+
+4. **Prohibited behaviors**
+   - Do not hardcode answers
+   - Do not modify the dataset
+   - Do not use external APIs or services
+   - All optimizations must be local and self-contained
+
+
+
+
+### Optimization Directions
+
+- Encouraged: Operator replacement and kernel optimization - Rewrite or replace standard operator implementations (such as Attention, LayerNorm, Conv2d, etc.) using Triton, CUDA C++, etc.
+
+- Encouraged: Memory and cache optimization - Optimize KV Cache memory layout, reduce memory fragmentation, optimize GPU memory access patterns
+
+- Encouraged: Compilation and graph optimization - Use torch.compile for computation graph optimization, custom kernel scheduling
+
+- Encouraged: Attention mechanism optimization - Implement Flash Attention, memory-efficient attention, sparse attention
+
+- Encouraged: Generation process optimization - Optimize decoding strategies, cache management, generation configuration parameters
+
+**Not Permitted:**
+- Using external services: Prohibited from calling external APIs, cloud services, or any functionality requiring network connection
+
+- Data and answer cheating: Prohibited from training on test data, pre-computing answers, hardcoding outputs
+
+- Model replacement and tampering: Participants should focus on operator-level optimization. Do not use additional datasets to train the model, change model architecture, or directly modify weight values.
+
+- Overfitting optimization: Prohibited from using conditional branches or special processing for specific evaluation samples
+
+- Black-box tool application: Behavior of only modifying configuration files without substantive code contributions is not recognized
+
+- Environment manipulation: Prohibited from interfering with fair evaluation by modifying system environment, GPU frequency locking, etc.
+
+
+
+## Important Notes
+
+### Sample Selection
+
+- The provided `benchmark.py` uses **fixed order** (first N samples from index 0)
+- When you run `--num-samples 100`, it evaluates samples 0-99
+- This ensures reproducibility for local self-testing
+- **Note**: The official evaluation system used by the competition committee may employ 
+  different sampling strategies (including random sampling) for final verification
+
+### Hardware Information
+
+The benchmark automatically records detailed hardware information:
+- Python version, PyTorch version, CUDA version
+- GPU name, memory, compute capability
+- CPU model, cores, frequency
+- System information (OS, kernel, architecture)
+- PPU information (if available)
+
+This information is saved in `result.json` under `system_info` for statistical analysis.
+
+### Performance Measurement
+
+- **Warmup**: 10 samples are used for GPU warmup before actual measurement
+- **TTFT Measurement**: Measures time from input preparation to first token (includes all preprocessing)
+- **Throughput Measurement**: Measures end-to-end generation time for 128 tokens
+- **State Isolation**: GPU cache is cleared between measurements to ensure fairness
+
+### Random Seed
+
+- The `--random-seed` parameter only affects PyTorch's random number generator
+- It does **NOT** affect sample selection order (which is always fixed)
+- Use it for reproducibility of model inference randomness
+
+### Output Format
+
+The `result.json` file contains:
+```json
+{
+  "system_info": {
+    "timestamp": "...",
+    "python_version": "...",
+    "torch_version": "...",
+    "cuda_version": "...",
+    "gpu_name": "...",
+    ...
+  },
+  "performance": {
+    "avg_ttft_ms": 90.55,
+    "avg_throughput_tokens_per_sec": 57.77
+  },
+  "answers": [
+    {
+      "question_id": 34602,
+      "prediction": "your answer text here"
+    },
+    ...
+  ]
+}
+```
+
+## Submission Guidelines
+
+### Required Files for Preliminary Submission
+
+1. **`result.json`** - Generated by running `benchmark.py`
+   - Contains predictions for all samples
+   - Must include valid `performance` metrics
+   - **Important**: The `result.json` uploaded to the Tianchi platform is for reference only. Final scores will be evaluated by the competition committee using standardized hardware and the official evaluation system.
+
+2. **Your optimized code** - `evaluation_wrapper.py` containing your optimized `VLMModel` class
+
+3. **Docker image** - Container with your optimized environment
+
+### Evaluation Process
+
+1. **Self-Testing**: Use the provided `benchmark.py` to test your optimizations locally
+2. **Submission**: Upload your `result.json` to the Tianchi platform (for reference only)
+3. **Official Evaluation**: The competition committee will evaluate your code using:
+   - Docker image submission
+   - Standardized hardware environment
+   - Official evaluation code
+   - Full validation set with random sampling for verification
+4. **Final Ranking**: Based on the final score calculated by the official evaluation system
+
+
+
+## Good Luck!
+
+We hope you will focus on operator-level optimization, kernel replacement, and efficient memory management. Remember: accuracy and speed are equally important! Good luck!
--- a/README_CN.md
+++ b/README_CN.md
@ -0,0 +1,348 @@
+# AICAS 2026 - 面向AI芯片的VLM高效推理与优化赛道
+
+##  目录
+- [概述](#概述)
+- [代码结构](#代码结构)
+- [核心文件](#核心文件)
+- [快速开始](#快速开始)
+- [评测指标](#评测指标)
+- [比赛规则](#比赛规则)
+- [重要提示](#重要提示)
+- [提交指南](#提交指南)
+
+
+## 概述
+
+本次竞赛专注于优化视觉语言模型（VLM）的推理性能。参赛者需要修改 `evaluation_wrapper.py` 中的 `VLMModel` 类，在保持准确率的同时提升首 Token 时间（TTFT）和吞吐量（Throughput）。
+
+## 代码结构
+
+```
+AICASGC/
+├── benchmark.py              # 基准测试脚本
+├── evaluation_wrapper.py     # 模型包装器（选手在此实现优化）
+├── requirements.txt          # Python 依赖包
+├── data/                     # 验证数据集
+│   ├── data-*.arrow          # 数据集文件
+│   ├── dataset_info.json     # 数据集元信息
+│   └── state.json            # 数据集状态
+├── Qwen3-VL-2B-Instruct/    # 模型权重目录（需要选手自行下载）
+└── README.md / README_CN.md   # 说明文档
+```
+
+
+## 核心文件
+
+- **`benchmark.py`** - 自测基准脚本（⚠️ **不建议修改**）
+- **`evaluation_wrapper.py`** - 模型包装器，参赛者在此实现优化
+- **`Qwen3-VL-2B-Instruct/`** - 竞赛模型权重（需要选手自行下载，见"快速开始"部分）
+- **`data/`** - 验证数据集
+- **`requirements.txt`** - Python 依赖包
+
+## 快速开始
+
+### 0. 下载模型（首次使用）
+
+模型文件较大，需要单独下载。请先创建模型目录，然后下载模型：
+
+```bash
+# 创建模型目录
+mkdir -p Qwen3-VL-2B-Instruct
+
+# 安装 huggingface_hub（如果未安装）
+pip install -U huggingface_hub
+
+# 设置镜像源（国内用户推荐，加速下载）
+export HF_ENDPOINT=https://hf-mirror.com
+
+# 下载模型到指定目录
+huggingface-cli download \
+  --resume-download \
+  Qwen/Qwen3-VL-2B-Instruct \
+  --local-dir ./Qwen3-VL-2B-Instruct \
+  --local-dir-use-symlinks False
+```
+
+**注意：**
+- 模型大小约 4-5GB，下载可能需要一些时间
+- 如果下载中断，可以重新运行命令，会自动续传（`--resume-download`）
+- 下载完成后，`Qwen3-VL-2B-Instruct/` 文件夹会包含所有模型文件
+- 确保有足够的磁盘空间（至少 5GB）
+
+### 1. 安装依赖
+
+```bash
+pip install -r requirements.txt
+```
+
+### 2. 运行测试
+
+```bash
+python benchmark.py \
+    --model-path ./Qwen3-VL-2B-Instruct \
+    --dataset-path ./data \
+    --output result.json \
+    --num-samples 100
+```
+
+### 3. 实现你的优化
+
+编辑 `evaluation_wrapper.py` 中的 `VLMModel` 类。优化采用**模块化设计**，每个优化方向对应一个独立方法。
+
+#### 3.1 探索模型结构（可选）
+
+在开始优化前，可以先探索模型结构，了解优化目标：
+
+```python
+class VLMModel:
+    def __init__(self, model_path: str, device: str = "cuda:0"):
+        # ... 加载模型 ...
+        
+        # 可选：探索模型结构
+        self._explore_model_structure()  # 会打印模型结构信息
+```
+
+#### 3.2 启用优化方法
+
+在 `__init__` 方法中，通过注释/取消注释来启用/禁用不同的优化：
+
+```python
+class VLMModel:
+    def __init__(self, model_path: str, device: str = "cuda:0"):
+        # ... 加载模型 ...
+        
+        # ================================================================
+        # 选手优化区域 - 启用/禁用优化方法
+        # ================================================================
+        
+        # 1. Vision Encoder 加速（优化大分辨率图像处理）
+        # self._optimize_vision_encoder()
+        
+        # 2. KV Cache 管理（优化生成过程中的内存碎片）
+        # self._optimize_kv_cache()
+        
+        # 3. 跨模态融合层优化（优化 Cross-modal Connector）
+        # self._optimize_cross_modal_connector()
+        
+        # 4. Flash Attention 优化
+        # self._enable_flash_attention()
+        
+        # 5. 量化优化
+        # self._apply_quantization()
+```
+
+#### 3.3 实现优化代码
+
+在各个优化方法中实现你的优化逻辑。例如，优化 Vision Encoder：
+
+```python
+def _optimize_vision_encoder(self):
+    """在 evaluation_wrapper.py 中找到这个方法，实现你的优化"""
+    
+    # 示例：替换注意力算子
+    # from your_optimization import optimized_attention
+    # if hasattr(self._model, 'vision_model'):
+    #     for layer in self._model.vision_model.encoder.layers:
+    #         layer.self_attn.forward = optimized_attention
+    
+    # TODO: 实现你的 Vision Encoder 优化
+    pass
+```
+
+
+
+
+### 4. 测试你的优化模型
+
+```bash
+python benchmark.py \
+    --model-path ./Qwen3-VL-2B-Instruct \
+    --dataset-path ./data \
+    --output result_optimized.json \
+    --num-samples 100
+```
+
+### 5. 生成完整结果用于提交
+
+```bash
+python benchmark.py \
+    --model-path ./Qwen3-VL-2B-Instruct \
+    --dataset-path ./data \
+    --output result.json \
+    --num-samples 5000
+```
+
+## 评测指标
+
+最终得分计算公式：
+
+```
+最终得分 = 0.4 × 准确率 + 0.3 × TTFT提升率 + 0.3 × 吞吐量提升率
+```
+
+### 指标说明
+
+- **TTFT (Time To First Token)**: 从输入准备到生成第一个 Token 的时间（毫秒）
+  - 包含：图像编码、文本编码、跨模态交互、Prefill 阶段、第一个 Token 生成
+  - Baseline: ~80ms
+  - 提升率 = (Baseline - 你的TTFT) / Baseline
+
+- **Throughput (吞吐量)**: 端到端 Token 生成速率（tokens/秒）
+  - Baseline: ~55 tokens/sec
+  - 提升率 = (你的吞吐量 - Baseline) / Baseline
+
+- **Accuracy (准确率)**: 验证集上的 VQA 准确率（5000 个样本）
+  - 支持多个标准答案的软匹配
+
+## 比赛规则
+
+###  重要规则
+
+
+1. **不要修改 `benchmark.py`**
+   - 此基准脚本仅用于自测
+   - 最终评测将使用独立的官方基准系统
+   - 修改此文件可能导致本地结果与最终评测结果不一致
+
+2. **仅修改 `evaluation_wrapper.py`**
+
+
+3. **保持必需的属性**
+   - `VLMModel` 类必须暴露 `processor`、`model` 和 `device` 属性
+   - Benchmark 使用这些属性来访问模型和处理器
+   - `generate()` 方法是可选的，主要用于调试
+
+4. **禁止行为**
+   - 禁止硬编码答案
+   - 禁止修改数据集
+   - 禁止使用外部 API 或服务
+   - 所有优化必须是本地且自包含的
+
+
+
+
+### 优化方向
+- 鼓励实现算子替换与内核优化：使用Triton、CUDA C++等重写或替换标准算子实现（如Attention、LayerNorm、Conv2d等）
+
+- 鼓励实现内存与缓存优化：优化KV Cache内存布局、减少内存碎片、优化显存访问模式
+
+
+- 鼓励实现编译与图优化：使用torch.compile进行计算图优化、自定义内核调度
+
+
+- 鼓励实现注意力机制优化：实现Flash Attention、内存高效注意力、稀疏注意力
+
+- 鼓励实现生成过程优化：优化解码策略、缓存管理、生成配置参数
+
+
+**不允许：**
+- 使用外部服务：禁止调用外部API、云服务或任何需要网络连接的功能
+
+- 数据与答案作弊：禁止使用测试数据进行训练、预计算答案、硬编码输出
+
+- 模型替换与篡改：希望选手着重做算子优化，不要用额外的数据集去训练模型、改变模型架构、直接修改权重数值等。
+
+
+- 过拟合优化：禁止针对特定评测样本进行条件分支或特殊处理
+
+- 黑盒工具套用：仅修改配置文件而无实质性代码贡献的行为不被认可
+
+- 环境操纵：禁止通过修改系统环境、GPU频率锁定等方式干扰公平评测
+
+
+
+## 重要提示
+
+### 样本选择
+
+- 提供的 `benchmark.py` 使用**固定顺序**（从索引 0 开始的前 N 个样本）
+- 运行 `--num-samples 100` 时，会评测样本 0-99
+- 这确保了本地自测的可复现性
+- **注意**：竞赛委员会使用的官方评测系统可能采用不同的采样策略（包括随机采样）进行最终验证
+
+### 硬件信息
+
+基准测试会自动记录详细的硬件信息：
+- Python 版本、PyTorch 版本、CUDA 版本
+- GPU 名称、显存、计算能力
+- CPU 型号、核心数、频率
+- 系统信息（操作系统、内核、架构）
+- PPU 信息（如果可用）
+
+这些信息保存在 `result.json` 的 `system_info` 字段中，用于统计分析。
+
+### 性能测量
+
+- **预热**：在实际测量前使用 10 个样本进行 GPU 预热
+- **TTFT 测量**：测量从输入准备到第一个 Token 的时间（包含所有预处理）
+- **吞吐量测量**：测量生成 128 个 Token 的端到端时间
+- **状态隔离**：在测量之间清理 GPU 缓存，确保公平性
+
+### 随机种子
+
+- `--random-seed` 参数仅影响 PyTorch 的随机数生成器
+- 它**不会**影响样本选择顺序（始终是固定的）
+- 用于模型推理随机性的可复现性
+
+### 输出格式
+
+`result.json` 文件包含：
+```json
+{
+  "system_info": {
+    "timestamp": "...",
+    "python_version": "...",
+    "torch_version": "...",
+    "cuda_version": "...",
+    "gpu_name": "...",
+    ...
+  },
+  "performance": {
+    "avg_ttft_ms": 90.55,
+    "avg_throughput_tokens_per_sec": 57.77
+  },
+  "answers": [
+    {
+      "question_id": 34602,
+      "prediction": "你的答案文本"
+    },
+    ...
+  ]
+}
+```
+
+## 提交指南
+
+### 初赛提交必需文件
+
+1. **`result.json`** - 通过运行 `benchmark.py` 生成
+   - 包含所有样本的预测 
+   - 必须包含有效的 `performance` 指标
+   - **重要**：上传到天池平台的 `result.json` 仅用于参考。最终成绩将由竞赛委员会使用标准化硬件和官方评测系统进行评测。
+
+2. **你的优化代码** - 包含你优化的 `VLMModel` 类的 `evaluation_wrapper.py`
+
+3. **Docker 镜像**- 包含你优化环境的容器
+
+
+
+### 评测流程
+
+1. **自测**：使用提供的 `benchmark.py` 在本地测试你的优化
+2. **提交**：将你的 `result.json` 上传到天池平台（仅用于参考）
+3. **官方评测**：竞赛委员会将使用以下方式评测你的代码：
+   - 提交Docker镜像
+   - 标准化硬件环境
+   - 官方评测代码
+   - 完整验证集，随机采样进行验证
+4. **最终排名**：基于官方评测系统计算的最终得分
+
+
+
+## 祝你好运！
+
+希望你会专注于算子级优化、内核替换和高效的内存管理。记住：准确率和速度同样重要！祝你好运！
+
+
+
+
--- a/benchmark.py
+++ b/benchmark.py
@ -0,0 +1,613 @@
+#!/usr/bin/env python3
+"""
+AICAS 2026 - Self-Testing Benchmark Tool
+
+Measures TTFT and Throughput, generates result.json for self-testing.
+
+Note: It is recommended not to modify this file. This benchmark is intended for 
+self-testing purposes only. The final evaluation will be conducted using a 
+separate official benchmark system on standardized hardware by the competition 
+committee.
+"""
+import sys
+import json
+import time
+import argparse
+import platform
+import subprocess
+from datetime import datetime
+from pathlib import Path
+
+import torch
+from PIL import Image
+from datasets import load_from_disk
+from tqdm import tqdm
+
+try:
+    import psutil
+    HAS_PSUTIL = True
+except ImportError:
+    HAS_PSUTIL = False
+
+from evaluation_wrapper import VLMModel
+
+# Fixed parameters - Not recommended to modify
+MAX_NEW_TOKENS = 128          # Token length for performance testing
+ACCURACY_MAX_TOKENS = 1024    # Token length for accuracy testing
+WARMUP_SAMPLES = 10           # Warmup samples for GPU stabilization
+PERFORMANCE_SAMPLES = None    # Performance test samples (None = all samples)
+VAL_SAMPLES = 5000            # Total validation samples
+
+
+def get_system_info() -> dict:
+    """Collect system information (hardware and software environment)"""
+    info = {
+        "timestamp": datetime.now().isoformat(),
+    }
+    
+    # Python environment
+    info["python_version"] = sys.version.split()[0]
+    info["python_full_version"] = sys.version
+    
+    # PyTorch information
+    info["torch_version"] = torch.__version__
+    
+    # CUDA information
+    if torch.cuda.is_available():
+        info["cuda_available"] = True
+        info["cuda_version"] = torch.version.cuda if hasattr(torch.version, 'cuda') else "N/A"
+        try:
+            if torch.backends.cudnn.is_available():
+                info["cudnn_version"] = str(torch.backends.cudnn.version())
+            else:
+                info["cudnn_version"] = "N/A"
+        except:
+            info["cudnn_version"] = "N/A"
+        
+        # GPU information
+        info["gpu_count"] = torch.cuda.device_count()
+        info["gpu_name"] = torch.cuda.get_device_name(0)
+        
+        # GPU memory
+        try:
+            gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3)  # GB
+            info["gpu_memory_gb"] = round(gpu_memory, 2)
+        except:
+            info["gpu_memory_gb"] = "N/A"
+        
+        # GPU compute capability
+        try:
+            compute_capability = torch.cuda.get_device_properties(0).major, torch.cuda.get_device_properties(0).minor
+            info["gpu_compute_capability"] = f"{compute_capability[0]}.{compute_capability[1]}"
+        except:
+            info["gpu_compute_capability"] = "N/A"
+    else:
+        info["cuda_available"] = False
+        info["cuda_version"] = "N/A"
+        info["gpu_count"] = 0
+        info["gpu_name"] = "N/A"
+    
+    # CPU information
+    info["cpu_processor"] = platform.processor() or "N/A"
+    
+    if HAS_PSUTIL:
+        try:
+            info["cpu_count_physical"] = psutil.cpu_count(logical=False)
+            info["cpu_count_logical"] = psutil.cpu_count(logical=True)
+            cpu_freq = psutil.cpu_freq()
+            if cpu_freq:
+                info["cpu_freq_mhz"] = round(cpu_freq.current, 2) if cpu_freq.current else "N/A"
+            else:
+                info["cpu_freq_mhz"] = "N/A"
+        except:
+            info["cpu_count_physical"] = "N/A"
+            info["cpu_count_logical"] = "N/A"
+            info["cpu_freq_mhz"] = "N/A"
+    else:
+        info["cpu_count_physical"] = "N/A"
+        info["cpu_count_logical"] = "N/A"
+        info["cpu_freq_mhz"] = "N/A"
+    
+    # Try to get CPU model from /proc/cpuinfo (Linux)
+    try:
+        if platform.system() == "Linux":
+            with open("/proc/cpuinfo", "r") as f:
+                for line in f:
+                    if "model name" in line.lower():
+                        info["cpu_model"] = line.split(":")[1].strip()
+                        break
+                    elif "Processor" in line and ":" in line:
+                        info["cpu_model"] = line.split(":")[1].strip()
+                        break
+    except:
+        pass
+    
+    if "cpu_model" not in info:
+        info["cpu_model"] = platform.processor() or "N/A"
+    
+    # System information
+    info["platform_system"] = platform.system()
+    info["platform_release"] = platform.release()
+    info["platform_version"] = platform.version()
+    info["platform_machine"] = platform.machine()
+    info["platform_architecture"] = platform.architecture()[0]
+    
+    # PPU information (if available)
+    info["ppu_available"] = False
+    info["ppu_info"] = {}
+    
+    # Check for PPU-related devices
+    try:
+        if torch.cuda.is_available():
+            gpu_name = torch.cuda.get_device_name(0).lower()
+            if "ppu" in gpu_name or "pu" in gpu_name:
+                info["ppu_available"] = True
+                info["ppu_info"] = {
+                    "name": torch.cuda.get_device_name(0),
+                    "type": "detected_from_gpu_name"
+                }
+    except:
+        pass
+    
+    # Try to get detailed GPU info via nvidia-smi (if available)
+    if torch.cuda.is_available() and platform.system() == "Linux":
+        try:
+            result = subprocess.run(
+                ["nvidia-smi", "--query-gpu=name,driver_version,memory.total", "--format=csv,noheader"],
+                capture_output=True,
+                text=True,
+                timeout=5
+            )
+            if result.returncode == 0:
+                lines = result.stdout.strip().split("\n")
+                if lines:
+                    parts = lines[0].split(",")
+                    if len(parts) >= 3:
+                        info["gpu_driver_version"] = parts[1].strip() if len(parts) > 1 else "N/A"
+                        info["gpu_memory_total"] = parts[2].strip() if len(parts) > 2 else "N/A"
+        except:
+            pass
+    
+    # Memory information
+    if HAS_PSUTIL:
+        try:
+            mem = psutil.virtual_memory()
+            info["memory_total_gb"] = round(mem.total / (1024**3), 2)
+            info["memory_available_gb"] = round(mem.available / (1024**3), 2)
+        except:
+            pass
+    
+    return info
+
+
+def measure_performance(model: VLMModel, image: Image.Image, question: str) -> tuple:
+    """
+    Measure performance metrics (TTFT and Throughput)
+    
+    TTFT measurement: Full model call time (generating 1 token)
+    Includes: image encoding, text encoding, cross-modal interaction, prefill, first token generation
+    
+    Args:
+        model: VLMModel instance (must expose processor and model attributes)
+        image: PIL Image
+        question: Question text
+    
+    Returns:
+        tuple: (ttft, throughput, token_count)
+    """
+    if not hasattr(model, 'processor') or not hasattr(model, 'model'):
+        raise AttributeError("Model must expose 'processor' and 'model' attributes")
+    
+    processor = model.processor
+    device = model.device
+    model_obj = model.model
+    
+    # Clear GPU state
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+    
+    # Prepare inputs
+    messages = [{
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": question}
+        ]
+    }]
+    
+    inputs = processor.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,
+        return_dict=True,
+        return_tensors="pt"
+    ).to(device)
+    
+    input_len = inputs.input_ids.shape[1]
+    
+    # Step 1: Measure TTFT (generate 1 token, includes all preprocessing)
+    try:
+        torch.cuda.synchronize()
+        start_ttft = time.perf_counter()
+        
+        # Direct call to underlying model
+        with torch.no_grad():
+            output_ids_ttft = model_obj.generate(
+                **inputs,
+                max_new_tokens=1,
+                do_sample=False,
+                temperature=0.0,
+                use_cache=True
+            )
+        
+        torch.cuda.synchronize()
+        ttft = time.perf_counter() - start_ttft
+        
+    except torch.cuda.OutOfMemoryError as e:
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.synchronize()
+        print(f"[Error] OOM during TTFT measurement: {e}")
+        return float('inf'), 0.0, 0
+    except Exception as e:
+        print(f"[Error] Error during TTFT measurement: {e}")
+        import traceback
+        traceback.print_exc()
+        return float('inf'), 0.0, 0
+    
+    # Clear state
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+        time.sleep(0.005)  # Ensure state reset
+    
+    # Step 2: Measure full generation (for Throughput)
+    try:
+        torch.cuda.synchronize()
+        start_full = time.perf_counter()
+        
+        # Direct call to underlying model
+        with torch.no_grad():
+            output_ids = model_obj.generate(
+                **inputs,
+                max_new_tokens=MAX_NEW_TOKENS,
+                do_sample=False,
+                temperature=0.0,
+                use_cache=True
+            )
+        
+        torch.cuda.synchronize()
+        total_time = time.perf_counter() - start_full
+        
+        # Extract generated tokens
+        generated_ids = output_ids[0][input_len:]
+        token_count = len(generated_ids)
+        
+    except torch.cuda.OutOfMemoryError as e:
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.synchronize()
+        print(f"[Error] OOM during full generation: {e}")
+        return ttft, 0.0, 0
+    except Exception as e:
+        print(f"[Error] Error during full generation: {e}")
+        import traceback
+        traceback.print_exc()
+        return ttft, 0.0, 0
+    
+    # Calculate throughput
+    if total_time > 0.001 and token_count > 0:
+        throughput = token_count / total_time
+    else:
+        throughput = 0.0
+    
+    return ttft, throughput, token_count
+
+
+def generate_answer(model: VLMModel, image: Image.Image, question: str, max_new_tokens: int = ACCURACY_MAX_TOKENS) -> dict:
+    """
+    Generate full answer (for accuracy evaluation)
+    
+    Args:
+        model: VLMModel instance
+        image: PIL Image
+        question: Question text
+        max_new_tokens: Maximum tokens to generate
+    
+    Returns:
+        dict: {"text": str, "token_count": int}
+    """
+    if not hasattr(model, 'processor') or not hasattr(model, 'model'):
+        # Fallback: use generate method
+        return model.generate(image, question, max_new_tokens=max_new_tokens)
+    
+    processor = model.processor
+    device = model.device
+    model_obj = model.model
+    
+    # Prepare inputs
+    messages = [{
+        "role": "user",
+        "content": [
+            {"type": "image", "image": image},
+            {"type": "text", "text": question}
+        ]
+    }]
+    
+    inputs = processor.apply_chat_template(
+        messages,
+        tokenize=True,
+        add_generation_prompt=True,
+        return_dict=True,
+        return_tensors="pt"
+    ).to(device)
+    
+    input_len = inputs.input_ids.shape[1]
+    
+    # Generate answer using underlying model
+    with torch.no_grad():
+        output_ids = model_obj.generate(
+            **inputs,
+            max_new_tokens=max_new_tokens,
+            do_sample=False,
+            temperature=0.0,
+            use_cache=True
+        )
+    
+    # Extract generated tokens
+    generated_ids = output_ids[0][input_len:]
+    text = processor.tokenizer.decode(
+        generated_ids,
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=False
+    )
+    
+    return {
+        "text": text,
+        "token_count": len(generated_ids)
+    }
+
+
+def run_benchmark(
+    model_class,
+    model_path: str,
+    dataset_path: str,
+    output_path: str,
+    num_samples: int = None,
+    random_seed: int = None
+):
+    """
+    Run benchmark evaluation
+    
+    Process:
+    1. Load participant model
+    2. Measure TTFT and Throughput
+    3. Generate answers
+    4. Calculate statistics
+    5. Save results
+    
+    Args:
+        random_seed: Random seed for reproducibility
+    """
+    # Set random seed (if provided)
+    if random_seed is not None:
+        import random
+        import numpy as np
+        random.seed(random_seed)
+        np.random.seed(random_seed)
+        torch.manual_seed(random_seed)
+        if torch.cuda.is_available():
+            torch.cuda.manual_seed_all(random_seed)
+    
+    # Clear GPU cache
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+    
+    # Load dataset
+    print("=" * 60)
+    print("AICAS 2026 Benchmark Tool")
+    print("=" * 60)
+    print(f"\nLoading dataset from: {dataset_path}")
+    
+    dataset = load_from_disk(dataset_path)
+    total_samples = num_samples or min(VAL_SAMPLES, len(dataset))
+    
+    # Performance test samples
+    if PERFORMANCE_SAMPLES is None:
+        perf_samples = total_samples  # Test all samples
+    else:
+        perf_samples = min(PERFORMANCE_SAMPLES, total_samples)
+    
+    print(f"Total samples: {total_samples}")
+    print(f"Performance test samples: {perf_samples}")
+    
+    # Prepare samples (fixed order: first N samples)
+    samples = []
+    for i in range(total_samples):
+        item = dataset[i]
+        samples.append({
+            "question_id": item.get("question_id", i),
+            "image": item["image"],
+            "question": item["question"],
+        })
+    
+    results = {
+        "system_info": get_system_info(),
+        "performance": {},
+        "answers": []
+    }
+    
+    # Load and test participant model
+    print("\n" + "=" * 60)
+    print("Running Model Benchmark")
+    print("=" * 60)
+    
+    model = model_class(model_path)
+    
+    # Warmup
+    print(f"\nWarming up ({WARMUP_SAMPLES} samples)...")
+    for i in range(min(WARMUP_SAMPLES, len(samples))):
+        try:
+            generate_answer(model, samples[i]["image"], samples[i]["question"], max_new_tokens=10)
+        except Exception as e:
+            print(f"[Warning] Warmup sample {i} failed: {e}")
+    
+    # Clear state after warmup
+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+        torch.cuda.synchronize()
+    
+    # Performance testing + answer generation
+    ttfts = []
+    throughputs = []
+    predictions = []
+    
+    print(f"\nMeasuring performance & generating answers...")
+    
+    # Performance test samples: measure performance + generate full answers
+    for sample in tqdm(samples[:perf_samples], desc="Performance"):
+        # Clear state before each measurement for fairness
+        if torch.cuda.is_available():
+            torch.cuda.empty_cache()
+            torch.cuda.synchronize()
+        
+        try:
+            # Step 1: Measure performance
+            ttft, throughput, token_count = measure_performance(
+                model, sample["image"], sample["question"]
+            )
+            
+            # Check for failures
+            if ttft == float('inf') or throughput == 0.0:
+                print(f"[Warning] Sample {sample['question_id']} failed (TTFT={ttft}, Throughput={throughput})")
+            else:
+                ttfts.append(ttft)
+                throughputs.append(throughput)
+            
+            # Clear state again before generating full answer
+            if torch.cuda.is_available():
+                torch.cuda.empty_cache()
+                torch.cuda.synchronize()
+            
+            # Step 2: Generate full answer (for accuracy evaluation)
+            try:
+                result_full = generate_answer(
+                    model,
+                    sample["image"], 
+                    sample["question"],
+                    max_new_tokens=ACCURACY_MAX_TOKENS
+                )
+                
+                predictions.append({
+                    "question_id": sample["question_id"],
+                    "prediction": result_full["text"]
+                })
+            except Exception as e:
+                print(f"[Error] Error generating full answer for sample {sample['question_id']}: {e}")
+                predictions.append({
+                    "question_id": sample["question_id"],
+                    "prediction": ""
+                })
+                
+        except Exception as e:
+            print(f"[Error] Sample {sample['question_id']} failed: {e}")
+            predictions.append({
+                "question_id": sample["question_id"],
+                "prediction": ""
+            })
+            continue
+    
+    # If there are remaining samples, only generate answers
+    if total_samples > perf_samples:
+        for sample in tqdm(samples[perf_samples:], desc="Accuracy"):
+            try:
+                result = generate_answer(
+                    model,
+                    sample["image"], 
+                    sample["question"],
+                    max_new_tokens=ACCURACY_MAX_TOKENS
+                )
+                predictions.append({
+                    "question_id": sample["question_id"],
+                    "prediction": result["text"]
+                })
+            except Exception as e:
+                print(f"[Error] Error generating answer for sample {sample['question_id']}: {e}")
+                predictions.append({
+                    "question_id": sample["question_id"],
+                    "prediction": ""
+                })
+    
+    # Calculate statistics
+    if len(ttfts) > 0:
+        avg_ttft = sum(ttfts) / len(ttfts) * 1000  # Convert to ms
+        avg_throughput = sum(throughputs) / len(throughputs)
+    else:
+        avg_ttft = float('inf')
+        avg_throughput = 0.0
+    
+    # Build performance results
+    performance = {
+        "avg_ttft_ms": round(avg_ttft, 2) if avg_ttft != float('inf') else None,
+        "avg_throughput_tokens_per_sec": round(avg_throughput, 2),
+    }
+    
+    results["performance"] = performance
+    results["answers"] = predictions
+    
+    # Print summary
+    if len(ttfts) > 0:
+        print(f"\n✓ TTFT: {avg_ttft:.2f} ms")
+        print(f"✓ Throughput: {avg_throughput:.2f} tokens/sec")
+    else:
+        print(f"\n✗ All samples failed!")
+    
+    # Save results
+    with open(output_path, "w", encoding="utf-8") as f:
+        json.dump(results, f, indent=2, ensure_ascii=False)
+    
+    print("\n" + "=" * 60)
+    print("Benchmark Complete!")
+    print("=" * 60)
+    print(f"\n📊 Results Summary:")
+    if len(ttfts) > 0:
+        print(f"   TTFT: {avg_ttft:.2f} ms")
+        print(f"   Throughput: {avg_throughput:.2f} tokens/sec")
+    else:
+        print(f"   ⚠ All samples failed!")
+    print(f"   Samples evaluated: {total_samples}")
+    print(f"\n💾 Results saved to: {output_path}")
+    
+    return results
+
+
+def main():
+    parser = argparse.ArgumentParser(description="AICAS 2026 Benchmark Tool")
+    parser.add_argument("--model-path", type=str, default="./Qwen3-VL-2B-Instruct", help="Path to model weights")
+    parser.add_argument("--dataset-path", type=str, default="./data", help="Path to validation dataset")
+    parser.add_argument("--output", type=str, default="result.json", help="Output JSON file path")
+    parser.add_argument("--num-samples", type=int, default=None, help="Number of samples to evaluate (default: all)")
+    parser.add_argument("--random-seed", type=int, default=None, help="Random seed for reproducibility")
+    
+    args = parser.parse_args()
+    
+    # Use VLMModel (participants modify this class in evaluation_wrapper.py)
+    print("=" * 60)
+    print("Using VLMModel (modify evaluation_wrapper.py to add optimizations)")
+    print("=" * 60)
+    
+    # Run benchmark
+    run_benchmark(
+        model_class=VLMModel,
+        model_path=args.model_path,
+        dataset_path=args.dataset_path,
+        output_path=args.output,
+        num_samples=args.num_samples,
+        random_seed=args.random_seed
+    )
+
+
+if __name__ == "__main__":
+    main()
--- a/evaluation_wrapper.py
+++ b/evaluation_wrapper.py
@ -0,0 +1,403 @@
+"""
+AICAS 2026 - Participant Core Modification File
+
+Participants should modify the VLMModel class to implement optimizations.
+
+Note:
+- Benchmark directly calls self.model.generate() for performance testing.
+- Your optimizations should modify self.model or its operators in __init__ via Monkey Patch.
+- The generate() method is optional and mainly for debugging.
+"""
+from typing import Dict
+try:
+    from PIL import Image
+except ImportError:
+    # For testing without PIL
+    class Image:
+        pass
+import torch
+from transformers import AutoModelForImageTextToText, AutoProcessor
+
+
+class VLMModel:
+    """
+    Participant optimization class - modify this to implement optimizations.
+    
+    Optimization Architecture:
+    - Split optimizations into separate methods for isolation and testing
+    - Enable/disable each optimization independently in __init__
+    - Each optimization method can be tested individually
+    
+    Important Notes:
+    1. Benchmark directly calls self.model.generate() for performance testing.
+    2. Your optimizations should modify self.model or its operators via Monkey Patch.
+    3. All optimizations are applied in __init__ by calling optimization methods.
+    """
+    
+    def __init__(self, model_path: str, device: str = "cuda:0"):
+        """
+        Initialize model and apply optimizations.
+        
+        Args:
+            model_path: Qwen3-VL-2B-Instruct model path
+            device: CUDA device, e.g., "cuda:0"
+        """
+        self._device = device
+        self.model_path = model_path
+        
+        # Load processor
+        print(f"[VLMModel] Loading processor from {model_path}...")
+        self._processor = AutoProcessor.from_pretrained(model_path)
+        
+        # Load model
+        print(f"[VLMModel] Loading model with FP16...")
+        self._model = AutoModelForImageTextToText.from_pretrained(
+            model_path,
+            torch_dtype=torch.float16,
+            device_map=device
+        )
+        self._model.eval()
+        
+        # Track applied optimizations
+        self._optimizations_applied = []
+        
+        # ================================================================
+        # Participant Optimization Area - Enable/disable optimizations here
+        # Uncomment the optimization methods you want to apply
+        # ================================================================
+        
+        # 1. Vision Encoder Acceleration
+        # self._optimize_vision_encoder()
+        
+        # 2. KV Cache Management
+        # self._optimize_kv_cache()
+        
+        # 3. Cross-modal Connector Optimization
+        # self._optimize_cross_modal_connector()
+        
+        # 4. Flash Attention Optimization
+        # self._enable_flash_attention()
+        
+        # 5. Quantization
+        # self._apply_quantization()
+        
+        # Optional: Explore model structure before optimization
+        # self._explore_model_structure()
+        
+        # ================================================================
+        
+        print(f"[VLMModel] Model loaded successfully on {device}")
+        if self._optimizations_applied:
+            print(f"[VLMModel] Applied optimizations: {', '.join(self._optimizations_applied)}")
+    
+    # ================================================================
+    # Optimization Methods - Implement your optimizations here
+    # ================================================================
+    
+    def _explore_model_structure(self):
+        """
+        Helper method to explore model structure.
+        
+        Use this to understand the model architecture before implementing optimizations.
+        This helps identify where to apply monkey patches.
+        """
+        print("=" * 60)
+        print("Model Structure Exploration")
+        print("=" * 60)
+        
+        # Explore vision model structure
+        if hasattr(self._model, 'vision_model'):
+            print(f"Vision Model: {type(self._model.vision_model)}")
+            if hasattr(self._model.vision_model, 'encoder'):
+                if hasattr(self._model.vision_model.encoder, 'layers'):
+                    print(f"  Vision Encoder Layers: {len(self._model.vision_model.encoder.layers)}")
+                    # Show first layer structure
+                    if len(self._model.vision_model.encoder.layers) > 0:
+                        print(f"  First Layer Type: {type(self._model.vision_model.encoder.layers[0])}")
+        else:
+            print("Vision Model: Not found (model structure may differ)")
+        
+        # Explore language model structure
+        if hasattr(self._model, 'model'):
+            print(f"Language Model: {type(self._model.model)}")
+            if hasattr(self._model.model, 'layers'):
+                print(f"  Language Model Layers: {len(self._model.model.layers)}")
+        else:
+            print("Language Model: Not found (model structure may differ)")
+        
+        # Explore cross-modal components
+        cross_modal_attrs = ['connector', 'cross_attn', 'cross_attention', 'proj', 'projector']
+        found_components = []
+        for attr in cross_modal_attrs:
+            if hasattr(self._model, attr):
+                found_components.append(attr)
+        if found_components:
+            print(f"Cross-modal Components: {', '.join(found_components)}")
+        else:
+            print("Cross-modal Components: Explore manually (structure may vary)")
+        
+        print("=" * 60)
+        print("Tip: Use print(self._model) to see full model structure")
+        print("=" * 60)
+    
+    def _optimize_vision_encoder(self):
+        """
+        Optimize Vision Encoder for high-resolution image inputs.
+        
+        Optimization Directions:
+        1. Patch embedding convolution optimization
+        2. Vision Transformer attention mechanism optimization
+        3. Layer normalization optimization
+        4. Memory-efficient image processing
+        
+        Implementation Steps:
+        1. Inspect model structure: call self._explore_model_structure()
+        2. Identify bottlenecks using profiling tools (PyTorch Profiler, nsys, etc.)
+        3. Implement optimized operators (Triton/CUDA kernels)
+        4. Replace original operators via monkey patch
+        
+        Target Components:
+        - self._model.vision_model (if exists)
+        - Vision encoder layers and attention mechanisms
+        - Convolution operations in patch embedding
+        """
+        # TODO: Implement your Vision Encoder optimization here
+        # 
+        # Example workflow:
+        # 1. from your_optimization import optimized_attention, optimized_conv
+        # 2. Inspect: print(self._model.vision_model) to find target layers
+        # 3. Replace: layer.self_attn.forward = optimized_attention
+        # 4. Test: Run benchmark to verify improvement
+        
+        if 'vision_encoder' not in self._optimizations_applied:
+            self._optimizations_applied.append('vision_encoder')
+    
+    def _optimize_kv_cache(self):
+        """
+        Optimize KV Cache management to reduce memory fragmentation.
+        
+        Optimization Directions:
+        1. Memory layout optimization (contiguous memory allocation)
+        2. Fragmentation-free allocation strategies
+        3. Efficient cache reuse patterns
+        4. Dynamic cache sizing
+        
+        Implementation Steps:
+        1. Understand current KV cache implementation in model layers
+        2. Design memory-efficient cache allocation strategy
+        3. Implement custom KV cache allocator if needed
+        4. Apply optimizations via monkey patch or config modification
+        
+        Target Components:
+        - self._model.config (cache configuration)
+        - Attention layers (KV cache allocation)
+        - Generation loop (cache management)
+        """
+        # Enable KV Cache first
+        self._model.config.use_cache = True
+        if hasattr(self._model.config, 'pad_token_id'):
+            if self._model.config.pad_token_id is None:
+                self._model.config.pad_token_id = self._model.config.eos_token_id
+        
+        # TODO: Implement advanced KV Cache optimizations here
+        # 
+        # Example workflow:
+        # 1. from your_optimization import FragmentationFreeKVCache
+        # 2. for layer in self._model.model.layers:
+        # 3.     layer.attention.custom_kv_cache = FragmentationFreeKVCache()
+        # 4. Test: Monitor memory usage and generation speed
+        
+        if 'kv_cache' not in self._optimizations_applied:
+            self._optimizations_applied.append('kv_cache')
+    
+    def _optimize_cross_modal_connector(self):
+        """
+        Optimize Cross-modal Connector computation efficiency.
+        
+        Optimization Directions:
+        1. Cross-attention mechanism optimization
+        2. Vision-to-language projection optimization
+        3. Multi-modal fusion layer efficiency
+        4. Feature alignment and transformation optimization
+        
+        Implementation Steps:
+        1. Identify cross-modal components using self._explore_model_structure()
+        2. Profile cross-modal operations to find bottlenecks
+        3. Implement optimized cross-attention or projection kernels
+        4. Replace original operations via monkey patch
+        
+        Note: Qwen3-VL's cross-modal structure may vary.
+        Use model exploration to identify actual component names and locations.
+        """
+        # TODO: Implement your Cross-modal Connector optimization here
+        # 
+        # Example workflow:
+        # 1. Explore: self._explore_model_structure() to find connector components
+        # 2. from your_optimization import optimized_cross_attention
+        # 3. Identify: Inspect model to find cross-attention layers
+        # 4. Replace: connector.cross_attention.forward = optimized_cross_attention
+        # 5. Test: Verify accuracy and performance improvements
+        
+        if 'cross_modal' not in self._optimizations_applied:
+            self._optimizations_applied.append('cross_modal')
+    
+    def _enable_flash_attention(self):
+        """
+        Enable or implement Flash Attention optimization.
+        
+        Implementation Approaches:
+        
+        Approach 1: Enable PyTorch's Built-in Flash Attention (Simple)
+            - Uses torch.backends.cuda.enable_flash_sdp(True)
+            - Easy to enable but limited customization
+            - May not work for all attention patterns in Qwen3-VL
+        
+        Approach 2: Implement Custom Flash Attention (Advanced, Recommended)
+            - Write custom Triton/CUDA kernels for attention computation
+            - Replace torch.nn.functional.scaled_dot_product_attention
+            - Full control over attention computation and memory layout
+            - Better performance potential but requires more implementation effort
+        
+        Recommended: Implement Approach 2 for better performance gains.
+        Use profiling to identify which attention operations benefit most from optimization.
+        """
+        # TODO: Choose and implement your Flash Attention approach
+        
+        # Approach 1: Simple (enable PyTorch built-in)
+        # torch.backends.cuda.enable_flash_sdp(True)
+        
+        # Approach 2: Advanced (custom implementation - recommended)
+        # from your_optimization import custom_flash_attention
+        # torch.nn.functional.scaled_dot_product_attention = custom_flash_attention
+        # 
+        # Or replace at layer level:
+        # for layer in self._model.model.layers:
+        #     layer.self_attn.forward = custom_attention_with_flash
+        
+        if 'flash_attention' not in self._optimizations_applied:
+            self._optimizations_applied.append('flash_attention')
+    
+    def _apply_quantization(self):
+        """
+        Apply quantization to reduce model size and speed up inference.
+        
+        Optimization Directions:
+        1. INT8 quantization (8-bit integer)
+        2. FP8 quantization (8-bit floating point)
+        3. Mixed precision quantization
+        4. Dynamic vs static quantization
+        
+        Implementation Steps:
+        1. Choose quantization strategy based on accuracy/performance trade-off
+        2. Use quantization libraries (BitsAndBytes, TensorRT, etc.)
+        3. Calibrate quantized model on validation data
+        4. Verify accuracy preservation
+        
+        Note: Quantization may require reloading the model with quantization config.
+        Consider applying quantization before other optimizations if model reload is needed.
+        """
+        # TODO: Implement your quantization here
+        # 
+        # Example workflow:
+        # 1. from transformers import BitsAndBytesConfig
+        # 2. quantization_config = BitsAndBytesConfig(load_in_8bit=True)
+        # 3. Note: May need to reload model with quantization config
+        # 4. Test: Verify accuracy and performance improvements
+        
+        if 'quantization' not in self._optimizations_applied:
+            self._optimizations_applied.append('quantization')
+    
+    # Required properties for benchmark
+    @property
+    def processor(self):
+        """
+        Required by benchmark for input processing.
+        
+        Benchmark uses this to prepare inputs with unified tokenizer.
+        """
+        return self._processor
+    
+    @property
+    def model(self):
+        """
+        Required by benchmark for direct model.generate() calls.
+        
+        Benchmark directly calls self.model.generate() for performance testing.
+        Your optimizations should modify this model object or its operators.
+        """
+        return self._model
+    
+    @property
+    def device(self):
+        """
+        Required by benchmark for device information.
+        """
+        return self._device
+    
+    def generate(
+        self, 
+        image: Image.Image, 
+        question: str, 
+        max_new_tokens: int = 128
+    ) -> Dict:
+        """
+        Generate answer (optional method, mainly for debugging).
+        
+        Note: Benchmark uses self.model.generate() directly for performance testing.
+        This method is provided for convenience and debugging purposes.
+        
+        Args:
+            image: PIL Image object
+            question: Question text
+            max_new_tokens: Maximum tokens to generate
+        
+        Returns:
+            Dict: {
+                "text": str,        # Generated text answer
+                "token_count": int  # Generated token count
+            }
+        """
+        # Build Qwen3-VL message format
+        messages = [{
+            "role": "user",
+            "content": [
+                {"type": "image", "image": image},
+                {"type": "text", "text": question}
+            ]
+        }]
+        
+        # Process inputs
+        inputs = self._processor.apply_chat_template(
+            messages,
+            tokenize=True,
+            add_generation_prompt=True,
+            return_dict=True,
+            return_tensors="pt"
+        ).to(self._device)
+        
+        # Generate
+        with torch.no_grad():
+            output_ids = self._model.generate(
+                **inputs,
+                max_new_tokens=max_new_tokens,
+                do_sample=False,
+                temperature=0.0,
+                top_p=1.0,
+                use_cache=True
+            )
+        
+        # Extract generated tokens (remove input part)
+        input_len = inputs.input_ids.shape[1]
+        generated_ids = output_ids[0][input_len:]
+        
+        # Decode
+        text = self._processor.tokenizer.decode(
+            generated_ids,
+            skip_special_tokens=True,
+            clean_up_tokenization_spaces=False
+        )
+        
+        return {
+            "text": text,
+            "token_count": len(generated_ids)
+        }
--- a/requirements.txt
+++ b/requirements.txt
@ -0,0 +1,31 @@
+# AICAS 2026 - 环境依赖
+# ============================================
+
+# 核心框架
+torch>=2.0.0
+transformers>=4.40.0
+accelerate>=0.25.0
+
+# 数据处理
+datasets>=2.14.0
+Pillow>=9.0.0
+
+# 进度条
+tqdm>=4.65.0
+
+# 系统信息（可选，用于获取详细的硬件信息）
+psutil>=5.9.0
+
+# 可选：Triton 算子开发
+triton>=2.1.0
+
+# 可选：Flash Attention（需要 CUDA 编译）
+# flash-attn>=2.0.0
+
+# 可选：量化工具
+# bitsandbytes>=0.41.0
+# auto-gptq>=0.5.0
+
+# 可选：Profiling
+# tensorboard>=2.14.0
+