init commit
This commit is contained in:
2
.gitignore
vendored
Normal file
2
.gitignore
vendored
Normal file
@ -0,0 +1,2 @@
|
||||
data/*
|
||||
Qwen3-VL-2B-Instruct/*
|
||||
342
README.md
Executable file
342
README.md
Executable file
@ -0,0 +1,342 @@
|
||||
# AICAS 2026 - Vision-Language Model Optimization Competition
|
||||
|
||||
## Table of Contents
|
||||
- [Overview](#overview)
|
||||
- [Code Structure](#code-structure)
|
||||
- [Core Files](#core-files)
|
||||
- [Quick Start](#quick-start)
|
||||
- [Evaluation Metrics](#evaluation-metrics)
|
||||
- [Competition Rules](#competition-rules)
|
||||
- [Important Notes](#important-notes)
|
||||
- [Submission Guidelines](#submission-guidelines)
|
||||
|
||||
## Overview
|
||||
|
||||
This competition focuses on optimizing Vision-Language Models (VLM) for inference performance. Participants are required to modify the `VLMModel` class in `evaluation_wrapper.py` to achieve better Time-To-First-Token (TTFT) and Throughput while maintaining accuracy.
|
||||
|
||||
## Code Structure
|
||||
|
||||
```
|
||||
AICASGC/
|
||||
├── benchmark.py # Benchmark script (not recommended to modify)
|
||||
├── evaluation_wrapper.py # Model wrapper (participants implement optimizations here)
|
||||
├── requirements.txt # Python dependencies
|
||||
├── data/ # Validation dataset
|
||||
│ ├── data-*.arrow # Dataset files
|
||||
│ ├── dataset_info.json # Dataset metadata
|
||||
│ └── state.json # Dataset state
|
||||
├── Qwen3-VL-2B-Instruct/ # Model weights directory (participants need to download)
|
||||
└── README.md / README_CN.md # Documentation
|
||||
```
|
||||
|
||||
|
||||
## Core Files
|
||||
|
||||
- **`benchmark.py`** - Self-testing benchmark script (⚠️ **Not recommended to modify**)
|
||||
- **`evaluation_wrapper.py`** - Model wrapper where participants implement optimizations
|
||||
- **`Qwen3-VL-2B-Instruct/`** - Competition model weights (participants need to download, see "Quick Start" section)
|
||||
- **`data/`** - Validation dataset
|
||||
- **`requirements.txt`** - Python dependencies
|
||||
|
||||
## Quick Start
|
||||
|
||||
### 0. Download Model (First Time)
|
||||
|
||||
The model files are large and need to be downloaded separately. Please create the model directory first, then download the model:
|
||||
|
||||
```bash
|
||||
# Create model directory
|
||||
mkdir -p Qwen3-VL-2B-Instruct
|
||||
|
||||
# Install huggingface_hub (if not installed)
|
||||
pip install -U huggingface_hub
|
||||
|
||||
# Set mirror endpoint (recommended for users in China, faster download)
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
|
||||
# Download model to specified directory
|
||||
huggingface-cli download \
|
||||
--resume-download \
|
||||
Qwen/Qwen3-VL-2B-Instruct \
|
||||
--local-dir ./Qwen3-VL-2B-Instruct \
|
||||
--local-dir-use-symlinks False
|
||||
```
|
||||
|
||||
**Note:**
|
||||
- Model size is approximately 4-5GB, download may take some time
|
||||
- If download is interrupted, you can rerun the command and it will resume automatically (`--resume-download`)
|
||||
- After download completes, the `Qwen3-VL-2B-Instruct/` folder will contain all model files
|
||||
- Ensure you have sufficient disk space (at least 5GB)
|
||||
|
||||
### 1. Install Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. Run Test
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result.json \
|
||||
--num-samples 100
|
||||
```
|
||||
|
||||
### 3. Implement Your Optimizations
|
||||
|
||||
Edit the `VLMModel` class in `evaluation_wrapper.py`. The optimization architecture uses **modular design**, where each optimization direction corresponds to an independent method.
|
||||
|
||||
#### 3.1 Explore Model Structure (Optional)
|
||||
|
||||
Before starting optimizations, you can explore the model structure to understand optimization targets:
|
||||
|
||||
```python
|
||||
class VLMModel:
|
||||
def __init__(self, model_path: str, device: str = "cuda:0"):
|
||||
# ... load model ...
|
||||
|
||||
# Optional: Explore model structure
|
||||
self._explore_model_structure() # Will print model structure information
|
||||
```
|
||||
|
||||
#### 3.2 Enable Optimization Methods
|
||||
|
||||
In the `__init__` method, enable/disable different optimizations by commenting/uncommenting:
|
||||
|
||||
```python
|
||||
class VLMModel:
|
||||
def __init__(self, model_path: str, device: str = "cuda:0"):
|
||||
# ... load model ...
|
||||
|
||||
# ================================================================
|
||||
# Participant Optimization Area - Enable/disable optimization methods
|
||||
# ================================================================
|
||||
|
||||
# 1. Vision Encoder Acceleration (optimize high-resolution image processing)
|
||||
# self._optimize_vision_encoder()
|
||||
|
||||
# 2. KV Cache Management (optimize memory fragmentation during generation)
|
||||
# self._optimize_kv_cache()
|
||||
|
||||
# 3. Cross-modal Connector Optimization (optimize Cross-modal Connector)
|
||||
# self._optimize_cross_modal_connector()
|
||||
|
||||
# 4. Flash Attention Optimization
|
||||
# self._enable_flash_attention()
|
||||
|
||||
# 5. Quantization Optimization
|
||||
# self._apply_quantization()
|
||||
```
|
||||
|
||||
#### 3.3 Implement Optimization Code
|
||||
|
||||
Implement your optimization logic in each optimization method. For example, optimizing Vision Encoder:
|
||||
|
||||
```python
|
||||
def _optimize_vision_encoder(self):
|
||||
"""Find this method in evaluation_wrapper.py and implement your optimization"""
|
||||
|
||||
# Example: Replace attention operator
|
||||
# from your_optimization import optimized_attention
|
||||
# if hasattr(self._model, 'vision_model'):
|
||||
# for layer in self._model.vision_model.encoder.layers:
|
||||
# layer.self_attn.forward = optimized_attention
|
||||
|
||||
# TODO: Implement your Vision Encoder optimization
|
||||
pass
|
||||
```
|
||||
|
||||
|
||||
**Important Notes:**
|
||||
- Benchmark directly calls `self.model.generate()` for performance testing
|
||||
- Your optimizations should modify `self.model` or its operators via Monkey Patch in optimization methods
|
||||
- All optimization methods are called in `__init__`, and optimizations take effect automatically
|
||||
- The `generate()` method is optional and mainly for debugging
|
||||
|
||||
### 4. Test Your Optimized Model
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result_optimized.json \
|
||||
--num-samples 100
|
||||
```
|
||||
|
||||
### 5. Generate Full Results for Submission
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result.json \
|
||||
--num-samples 5000
|
||||
```
|
||||
|
||||
## Evaluation Metrics
|
||||
|
||||
The final score is calculated as:
|
||||
|
||||
```
|
||||
Final Score = 0.4 × Accuracy + 0.3 × TTFT_Improvement + 0.3 × Throughput_Improvement
|
||||
```
|
||||
|
||||
### Metrics Explained
|
||||
|
||||
- **TTFT (Time To First Token)**: Time from input preparation to first token generation (in milliseconds)
|
||||
- Includes: image encoding, text encoding, cross-modal interaction, prefill stage, first token generation
|
||||
- Baseline: ~80ms
|
||||
- Improvement = (Baseline - Your_TTFT) / Baseline
|
||||
|
||||
- **Throughput**: End-to-end token generation rate (tokens per second)
|
||||
- Baseline: ~55 tokens/sec
|
||||
- Improvement = (Your_Throughput - Baseline) / Baseline
|
||||
|
||||
- **Accuracy**: VQA accuracy on validation set (5000 samples)
|
||||
- Soft matching with multiple ground truth answers
|
||||
|
||||
## Competition Rules
|
||||
|
||||
### Critical Rules
|
||||
|
||||
1. **Do not modify `benchmark.py`**
|
||||
- This benchmark script is for self-testing only
|
||||
- Final evaluation will use a separate official benchmark system
|
||||
- Modifying this file may lead to inconsistencies between your local results and final evaluation results
|
||||
|
||||
2. **Only modify `evaluation_wrapper.py`**
|
||||
|
||||
|
||||
3. **Maintain required properties**
|
||||
- The `VLMModel` class must expose `processor`, `model`, and `device` properties
|
||||
- Benchmark uses these properties to access the model and processor
|
||||
- The `generate()` method is optional and mainly for debugging
|
||||
|
||||
4. **Prohibited behaviors**
|
||||
- Do not hardcode answers
|
||||
- Do not modify the dataset
|
||||
- Do not use external APIs or services
|
||||
- All optimizations must be local and self-contained
|
||||
|
||||
|
||||
|
||||
|
||||
### Optimization Directions
|
||||
|
||||
- Encouraged: Operator replacement and kernel optimization - Rewrite or replace standard operator implementations (such as Attention, LayerNorm, Conv2d, etc.) using Triton, CUDA C++, etc.
|
||||
|
||||
- Encouraged: Memory and cache optimization - Optimize KV Cache memory layout, reduce memory fragmentation, optimize GPU memory access patterns
|
||||
|
||||
- Encouraged: Compilation and graph optimization - Use torch.compile for computation graph optimization, custom kernel scheduling
|
||||
|
||||
- Encouraged: Attention mechanism optimization - Implement Flash Attention, memory-efficient attention, sparse attention
|
||||
|
||||
- Encouraged: Generation process optimization - Optimize decoding strategies, cache management, generation configuration parameters
|
||||
|
||||
**Not Permitted:**
|
||||
- Using external services: Prohibited from calling external APIs, cloud services, or any functionality requiring network connection
|
||||
|
||||
- Data and answer cheating: Prohibited from training on test data, pre-computing answers, hardcoding outputs
|
||||
|
||||
- Model replacement and tampering: Participants should focus on operator-level optimization. Do not use additional datasets to train the model, change model architecture, or directly modify weight values.
|
||||
|
||||
- Overfitting optimization: Prohibited from using conditional branches or special processing for specific evaluation samples
|
||||
|
||||
- Black-box tool application: Behavior of only modifying configuration files without substantive code contributions is not recognized
|
||||
|
||||
- Environment manipulation: Prohibited from interfering with fair evaluation by modifying system environment, GPU frequency locking, etc.
|
||||
|
||||
|
||||
|
||||
## Important Notes
|
||||
|
||||
### Sample Selection
|
||||
|
||||
- The provided `benchmark.py` uses **fixed order** (first N samples from index 0)
|
||||
- When you run `--num-samples 100`, it evaluates samples 0-99
|
||||
- This ensures reproducibility for local self-testing
|
||||
- **Note**: The official evaluation system used by the competition committee may employ
|
||||
different sampling strategies (including random sampling) for final verification
|
||||
|
||||
### Hardware Information
|
||||
|
||||
The benchmark automatically records detailed hardware information:
|
||||
- Python version, PyTorch version, CUDA version
|
||||
- GPU name, memory, compute capability
|
||||
- CPU model, cores, frequency
|
||||
- System information (OS, kernel, architecture)
|
||||
- PPU information (if available)
|
||||
|
||||
This information is saved in `result.json` under `system_info` for statistical analysis.
|
||||
|
||||
### Performance Measurement
|
||||
|
||||
- **Warmup**: 10 samples are used for GPU warmup before actual measurement
|
||||
- **TTFT Measurement**: Measures time from input preparation to first token (includes all preprocessing)
|
||||
- **Throughput Measurement**: Measures end-to-end generation time for 128 tokens
|
||||
- **State Isolation**: GPU cache is cleared between measurements to ensure fairness
|
||||
|
||||
### Random Seed
|
||||
|
||||
- The `--random-seed` parameter only affects PyTorch's random number generator
|
||||
- It does **NOT** affect sample selection order (which is always fixed)
|
||||
- Use it for reproducibility of model inference randomness
|
||||
|
||||
### Output Format
|
||||
|
||||
The `result.json` file contains:
|
||||
```json
|
||||
{
|
||||
"system_info": {
|
||||
"timestamp": "...",
|
||||
"python_version": "...",
|
||||
"torch_version": "...",
|
||||
"cuda_version": "...",
|
||||
"gpu_name": "...",
|
||||
...
|
||||
},
|
||||
"performance": {
|
||||
"avg_ttft_ms": 90.55,
|
||||
"avg_throughput_tokens_per_sec": 57.77
|
||||
},
|
||||
"answers": [
|
||||
{
|
||||
"question_id": 34602,
|
||||
"prediction": "your answer text here"
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## Submission Guidelines
|
||||
|
||||
### Required Files for Preliminary Submission
|
||||
|
||||
1. **`result.json`** - Generated by running `benchmark.py`
|
||||
- Contains predictions for all samples
|
||||
- Must include valid `performance` metrics
|
||||
- **Important**: The `result.json` uploaded to the Tianchi platform is for reference only. Final scores will be evaluated by the competition committee using standardized hardware and the official evaluation system.
|
||||
|
||||
2. **Your optimized code** - `evaluation_wrapper.py` containing your optimized `VLMModel` class
|
||||
|
||||
3. **Docker image** - Container with your optimized environment
|
||||
|
||||
### Evaluation Process
|
||||
|
||||
1. **Self-Testing**: Use the provided `benchmark.py` to test your optimizations locally
|
||||
2. **Submission**: Upload your `result.json` to the Tianchi platform (for reference only)
|
||||
3. **Official Evaluation**: The competition committee will evaluate your code using:
|
||||
- Docker image submission
|
||||
- Standardized hardware environment
|
||||
- Official evaluation code
|
||||
- Full validation set with random sampling for verification
|
||||
4. **Final Ranking**: Based on the final score calculated by the official evaluation system
|
||||
|
||||
|
||||
|
||||
## Good Luck!
|
||||
|
||||
We hope you will focus on operator-level optimization, kernel replacement, and efficient memory management. Remember: accuracy and speed are equally important! Good luck!
|
||||
348
README_CN.md
Executable file
348
README_CN.md
Executable file
@ -0,0 +1,348 @@
|
||||
# AICAS 2026 - 面向AI芯片的VLM高效推理与优化赛道
|
||||
|
||||
## 目录
|
||||
- [概述](#概述)
|
||||
- [代码结构](#代码结构)
|
||||
- [核心文件](#核心文件)
|
||||
- [快速开始](#快速开始)
|
||||
- [评测指标](#评测指标)
|
||||
- [比赛规则](#比赛规则)
|
||||
- [重要提示](#重要提示)
|
||||
- [提交指南](#提交指南)
|
||||
|
||||
|
||||
## 概述
|
||||
|
||||
本次竞赛专注于优化视觉语言模型(VLM)的推理性能。参赛者需要修改 `evaluation_wrapper.py` 中的 `VLMModel` 类,在保持准确率的同时提升首 Token 时间(TTFT)和吞吐量(Throughput)。
|
||||
|
||||
## 代码结构
|
||||
|
||||
```
|
||||
AICASGC/
|
||||
├── benchmark.py # 基准测试脚本
|
||||
├── evaluation_wrapper.py # 模型包装器(选手在此实现优化)
|
||||
├── requirements.txt # Python 依赖包
|
||||
├── data/ # 验证数据集
|
||||
│ ├── data-*.arrow # 数据集文件
|
||||
│ ├── dataset_info.json # 数据集元信息
|
||||
│ └── state.json # 数据集状态
|
||||
├── Qwen3-VL-2B-Instruct/ # 模型权重目录(需要选手自行下载)
|
||||
└── README.md / README_CN.md # 说明文档
|
||||
```
|
||||
|
||||
|
||||
## 核心文件
|
||||
|
||||
- **`benchmark.py`** - 自测基准脚本(⚠️ **不建议修改**)
|
||||
- **`evaluation_wrapper.py`** - 模型包装器,参赛者在此实现优化
|
||||
- **`Qwen3-VL-2B-Instruct/`** - 竞赛模型权重(需要选手自行下载,见"快速开始"部分)
|
||||
- **`data/`** - 验证数据集
|
||||
- **`requirements.txt`** - Python 依赖包
|
||||
|
||||
## 快速开始
|
||||
|
||||
### 0. 下载模型(首次使用)
|
||||
|
||||
模型文件较大,需要单独下载。请先创建模型目录,然后下载模型:
|
||||
|
||||
```bash
|
||||
# 创建模型目录
|
||||
mkdir -p Qwen3-VL-2B-Instruct
|
||||
|
||||
# 安装 huggingface_hub(如果未安装)
|
||||
pip install -U huggingface_hub
|
||||
|
||||
# 设置镜像源(国内用户推荐,加速下载)
|
||||
export HF_ENDPOINT=https://hf-mirror.com
|
||||
|
||||
# 下载模型到指定目录
|
||||
huggingface-cli download \
|
||||
--resume-download \
|
||||
Qwen/Qwen3-VL-2B-Instruct \
|
||||
--local-dir ./Qwen3-VL-2B-Instruct \
|
||||
--local-dir-use-symlinks False
|
||||
```
|
||||
|
||||
**注意:**
|
||||
- 模型大小约 4-5GB,下载可能需要一些时间
|
||||
- 如果下载中断,可以重新运行命令,会自动续传(`--resume-download`)
|
||||
- 下载完成后,`Qwen3-VL-2B-Instruct/` 文件夹会包含所有模型文件
|
||||
- 确保有足够的磁盘空间(至少 5GB)
|
||||
|
||||
### 1. 安装依赖
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
### 2. 运行测试
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result.json \
|
||||
--num-samples 100
|
||||
```
|
||||
|
||||
### 3. 实现你的优化
|
||||
|
||||
编辑 `evaluation_wrapper.py` 中的 `VLMModel` 类。优化采用**模块化设计**,每个优化方向对应一个独立方法。
|
||||
|
||||
#### 3.1 探索模型结构(可选)
|
||||
|
||||
在开始优化前,可以先探索模型结构,了解优化目标:
|
||||
|
||||
```python
|
||||
class VLMModel:
|
||||
def __init__(self, model_path: str, device: str = "cuda:0"):
|
||||
# ... 加载模型 ...
|
||||
|
||||
# 可选:探索模型结构
|
||||
self._explore_model_structure() # 会打印模型结构信息
|
||||
```
|
||||
|
||||
#### 3.2 启用优化方法
|
||||
|
||||
在 `__init__` 方法中,通过注释/取消注释来启用/禁用不同的优化:
|
||||
|
||||
```python
|
||||
class VLMModel:
|
||||
def __init__(self, model_path: str, device: str = "cuda:0"):
|
||||
# ... 加载模型 ...
|
||||
|
||||
# ================================================================
|
||||
# 选手优化区域 - 启用/禁用优化方法
|
||||
# ================================================================
|
||||
|
||||
# 1. Vision Encoder 加速(优化大分辨率图像处理)
|
||||
# self._optimize_vision_encoder()
|
||||
|
||||
# 2. KV Cache 管理(优化生成过程中的内存碎片)
|
||||
# self._optimize_kv_cache()
|
||||
|
||||
# 3. 跨模态融合层优化(优化 Cross-modal Connector)
|
||||
# self._optimize_cross_modal_connector()
|
||||
|
||||
# 4. Flash Attention 优化
|
||||
# self._enable_flash_attention()
|
||||
|
||||
# 5. 量化优化
|
||||
# self._apply_quantization()
|
||||
```
|
||||
|
||||
#### 3.3 实现优化代码
|
||||
|
||||
在各个优化方法中实现你的优化逻辑。例如,优化 Vision Encoder:
|
||||
|
||||
```python
|
||||
def _optimize_vision_encoder(self):
|
||||
"""在 evaluation_wrapper.py 中找到这个方法,实现你的优化"""
|
||||
|
||||
# 示例:替换注意力算子
|
||||
# from your_optimization import optimized_attention
|
||||
# if hasattr(self._model, 'vision_model'):
|
||||
# for layer in self._model.vision_model.encoder.layers:
|
||||
# layer.self_attn.forward = optimized_attention
|
||||
|
||||
# TODO: 实现你的 Vision Encoder 优化
|
||||
pass
|
||||
```
|
||||
|
||||
|
||||
|
||||
|
||||
### 4. 测试你的优化模型
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result_optimized.json \
|
||||
--num-samples 100
|
||||
```
|
||||
|
||||
### 5. 生成完整结果用于提交
|
||||
|
||||
```bash
|
||||
python benchmark.py \
|
||||
--model-path ./Qwen3-VL-2B-Instruct \
|
||||
--dataset-path ./data \
|
||||
--output result.json \
|
||||
--num-samples 5000
|
||||
```
|
||||
|
||||
## 评测指标
|
||||
|
||||
最终得分计算公式:
|
||||
|
||||
```
|
||||
最终得分 = 0.4 × 准确率 + 0.3 × TTFT提升率 + 0.3 × 吞吐量提升率
|
||||
```
|
||||
|
||||
### 指标说明
|
||||
|
||||
- **TTFT (Time To First Token)**: 从输入准备到生成第一个 Token 的时间(毫秒)
|
||||
- 包含:图像编码、文本编码、跨模态交互、Prefill 阶段、第一个 Token 生成
|
||||
- Baseline: ~80ms
|
||||
- 提升率 = (Baseline - 你的TTFT) / Baseline
|
||||
|
||||
- **Throughput (吞吐量)**: 端到端 Token 生成速率(tokens/秒)
|
||||
- Baseline: ~55 tokens/sec
|
||||
- 提升率 = (你的吞吐量 - Baseline) / Baseline
|
||||
|
||||
- **Accuracy (准确率)**: 验证集上的 VQA 准确率(5000 个样本)
|
||||
- 支持多个标准答案的软匹配
|
||||
|
||||
## 比赛规则
|
||||
|
||||
### 重要规则
|
||||
|
||||
|
||||
1. **不要修改 `benchmark.py`**
|
||||
- 此基准脚本仅用于自测
|
||||
- 最终评测将使用独立的官方基准系统
|
||||
- 修改此文件可能导致本地结果与最终评测结果不一致
|
||||
|
||||
2. **仅修改 `evaluation_wrapper.py`**
|
||||
|
||||
|
||||
3. **保持必需的属性**
|
||||
- `VLMModel` 类必须暴露 `processor`、`model` 和 `device` 属性
|
||||
- Benchmark 使用这些属性来访问模型和处理器
|
||||
- `generate()` 方法是可选的,主要用于调试
|
||||
|
||||
4. **禁止行为**
|
||||
- 禁止硬编码答案
|
||||
- 禁止修改数据集
|
||||
- 禁止使用外部 API 或服务
|
||||
- 所有优化必须是本地且自包含的
|
||||
|
||||
|
||||
|
||||
|
||||
### 优化方向
|
||||
- 鼓励实现算子替换与内核优化:使用Triton、CUDA C++等重写或替换标准算子实现(如Attention、LayerNorm、Conv2d等)
|
||||
|
||||
- 鼓励实现内存与缓存优化:优化KV Cache内存布局、减少内存碎片、优化显存访问模式
|
||||
|
||||
|
||||
- 鼓励实现编译与图优化:使用torch.compile进行计算图优化、自定义内核调度
|
||||
|
||||
|
||||
- 鼓励实现注意力机制优化:实现Flash Attention、内存高效注意力、稀疏注意力
|
||||
|
||||
- 鼓励实现生成过程优化:优化解码策略、缓存管理、生成配置参数
|
||||
|
||||
|
||||
**不允许:**
|
||||
- 使用外部服务:禁止调用外部API、云服务或任何需要网络连接的功能
|
||||
|
||||
- 数据与答案作弊:禁止使用测试数据进行训练、预计算答案、硬编码输出
|
||||
|
||||
- 模型替换与篡改:希望选手着重做算子优化,不要用额外的数据集去训练模型、改变模型架构、直接修改权重数值等。
|
||||
|
||||
|
||||
- 过拟合优化:禁止针对特定评测样本进行条件分支或特殊处理
|
||||
|
||||
- 黑盒工具套用:仅修改配置文件而无实质性代码贡献的行为不被认可
|
||||
|
||||
- 环境操纵:禁止通过修改系统环境、GPU频率锁定等方式干扰公平评测
|
||||
|
||||
|
||||
|
||||
## 重要提示
|
||||
|
||||
### 样本选择
|
||||
|
||||
- 提供的 `benchmark.py` 使用**固定顺序**(从索引 0 开始的前 N 个样本)
|
||||
- 运行 `--num-samples 100` 时,会评测样本 0-99
|
||||
- 这确保了本地自测的可复现性
|
||||
- **注意**:竞赛委员会使用的官方评测系统可能采用不同的采样策略(包括随机采样)进行最终验证
|
||||
|
||||
### 硬件信息
|
||||
|
||||
基准测试会自动记录详细的硬件信息:
|
||||
- Python 版本、PyTorch 版本、CUDA 版本
|
||||
- GPU 名称、显存、计算能力
|
||||
- CPU 型号、核心数、频率
|
||||
- 系统信息(操作系统、内核、架构)
|
||||
- PPU 信息(如果可用)
|
||||
|
||||
这些信息保存在 `result.json` 的 `system_info` 字段中,用于统计分析。
|
||||
|
||||
### 性能测量
|
||||
|
||||
- **预热**:在实际测量前使用 10 个样本进行 GPU 预热
|
||||
- **TTFT 测量**:测量从输入准备到第一个 Token 的时间(包含所有预处理)
|
||||
- **吞吐量测量**:测量生成 128 个 Token 的端到端时间
|
||||
- **状态隔离**:在测量之间清理 GPU 缓存,确保公平性
|
||||
|
||||
### 随机种子
|
||||
|
||||
- `--random-seed` 参数仅影响 PyTorch 的随机数生成器
|
||||
- 它**不会**影响样本选择顺序(始终是固定的)
|
||||
- 用于模型推理随机性的可复现性
|
||||
|
||||
### 输出格式
|
||||
|
||||
`result.json` 文件包含:
|
||||
```json
|
||||
{
|
||||
"system_info": {
|
||||
"timestamp": "...",
|
||||
"python_version": "...",
|
||||
"torch_version": "...",
|
||||
"cuda_version": "...",
|
||||
"gpu_name": "...",
|
||||
...
|
||||
},
|
||||
"performance": {
|
||||
"avg_ttft_ms": 90.55,
|
||||
"avg_throughput_tokens_per_sec": 57.77
|
||||
},
|
||||
"answers": [
|
||||
{
|
||||
"question_id": 34602,
|
||||
"prediction": "你的答案文本"
|
||||
},
|
||||
...
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
## 提交指南
|
||||
|
||||
### 初赛提交必需文件
|
||||
|
||||
1. **`result.json`** - 通过运行 `benchmark.py` 生成
|
||||
- 包含所有样本的预测
|
||||
- 必须包含有效的 `performance` 指标
|
||||
- **重要**:上传到天池平台的 `result.json` 仅用于参考。最终成绩将由竞赛委员会使用标准化硬件和官方评测系统进行评测。
|
||||
|
||||
2. **你的优化代码** - 包含你优化的 `VLMModel` 类的 `evaluation_wrapper.py`
|
||||
|
||||
3. **Docker 镜像**- 包含你优化环境的容器
|
||||
|
||||
|
||||
|
||||
### 评测流程
|
||||
|
||||
1. **自测**:使用提供的 `benchmark.py` 在本地测试你的优化
|
||||
2. **提交**:将你的 `result.json` 上传到天池平台(仅用于参考)
|
||||
3. **官方评测**:竞赛委员会将使用以下方式评测你的代码:
|
||||
- 提交Docker镜像
|
||||
- 标准化硬件环境
|
||||
- 官方评测代码
|
||||
- 完整验证集,随机采样进行验证
|
||||
4. **最终排名**:基于官方评测系统计算的最终得分
|
||||
|
||||
|
||||
|
||||
## 祝你好运!
|
||||
|
||||
希望你会专注于算子级优化、内核替换和高效的内存管理。记住:准确率和速度同样重要!祝你好运!
|
||||
|
||||
|
||||
|
||||
|
||||
613
benchmark.py
Executable file
613
benchmark.py
Executable file
@ -0,0 +1,613 @@
|
||||
#!/usr/bin/env python3
|
||||
"""
|
||||
AICAS 2026 - Self-Testing Benchmark Tool
|
||||
|
||||
Measures TTFT and Throughput, generates result.json for self-testing.
|
||||
|
||||
Note: It is recommended not to modify this file. This benchmark is intended for
|
||||
self-testing purposes only. The final evaluation will be conducted using a
|
||||
separate official benchmark system on standardized hardware by the competition
|
||||
committee.
|
||||
"""
|
||||
import sys
|
||||
import json
|
||||
import time
|
||||
import argparse
|
||||
import platform
|
||||
import subprocess
|
||||
from datetime import datetime
|
||||
from pathlib import Path
|
||||
|
||||
import torch
|
||||
from PIL import Image
|
||||
from datasets import load_from_disk
|
||||
from tqdm import tqdm
|
||||
|
||||
try:
|
||||
import psutil
|
||||
HAS_PSUTIL = True
|
||||
except ImportError:
|
||||
HAS_PSUTIL = False
|
||||
|
||||
from evaluation_wrapper import VLMModel
|
||||
|
||||
# Fixed parameters - Not recommended to modify
|
||||
MAX_NEW_TOKENS = 128 # Token length for performance testing
|
||||
ACCURACY_MAX_TOKENS = 1024 # Token length for accuracy testing
|
||||
WARMUP_SAMPLES = 10 # Warmup samples for GPU stabilization
|
||||
PERFORMANCE_SAMPLES = None # Performance test samples (None = all samples)
|
||||
VAL_SAMPLES = 5000 # Total validation samples
|
||||
|
||||
|
||||
def get_system_info() -> dict:
|
||||
"""Collect system information (hardware and software environment)"""
|
||||
info = {
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
}
|
||||
|
||||
# Python environment
|
||||
info["python_version"] = sys.version.split()[0]
|
||||
info["python_full_version"] = sys.version
|
||||
|
||||
# PyTorch information
|
||||
info["torch_version"] = torch.__version__
|
||||
|
||||
# CUDA information
|
||||
if torch.cuda.is_available():
|
||||
info["cuda_available"] = True
|
||||
info["cuda_version"] = torch.version.cuda if hasattr(torch.version, 'cuda') else "N/A"
|
||||
try:
|
||||
if torch.backends.cudnn.is_available():
|
||||
info["cudnn_version"] = str(torch.backends.cudnn.version())
|
||||
else:
|
||||
info["cudnn_version"] = "N/A"
|
||||
except:
|
||||
info["cudnn_version"] = "N/A"
|
||||
|
||||
# GPU information
|
||||
info["gpu_count"] = torch.cuda.device_count()
|
||||
info["gpu_name"] = torch.cuda.get_device_name(0)
|
||||
|
||||
# GPU memory
|
||||
try:
|
||||
gpu_memory = torch.cuda.get_device_properties(0).total_memory / (1024**3) # GB
|
||||
info["gpu_memory_gb"] = round(gpu_memory, 2)
|
||||
except:
|
||||
info["gpu_memory_gb"] = "N/A"
|
||||
|
||||
# GPU compute capability
|
||||
try:
|
||||
compute_capability = torch.cuda.get_device_properties(0).major, torch.cuda.get_device_properties(0).minor
|
||||
info["gpu_compute_capability"] = f"{compute_capability[0]}.{compute_capability[1]}"
|
||||
except:
|
||||
info["gpu_compute_capability"] = "N/A"
|
||||
else:
|
||||
info["cuda_available"] = False
|
||||
info["cuda_version"] = "N/A"
|
||||
info["gpu_count"] = 0
|
||||
info["gpu_name"] = "N/A"
|
||||
|
||||
# CPU information
|
||||
info["cpu_processor"] = platform.processor() or "N/A"
|
||||
|
||||
if HAS_PSUTIL:
|
||||
try:
|
||||
info["cpu_count_physical"] = psutil.cpu_count(logical=False)
|
||||
info["cpu_count_logical"] = psutil.cpu_count(logical=True)
|
||||
cpu_freq = psutil.cpu_freq()
|
||||
if cpu_freq:
|
||||
info["cpu_freq_mhz"] = round(cpu_freq.current, 2) if cpu_freq.current else "N/A"
|
||||
else:
|
||||
info["cpu_freq_mhz"] = "N/A"
|
||||
except:
|
||||
info["cpu_count_physical"] = "N/A"
|
||||
info["cpu_count_logical"] = "N/A"
|
||||
info["cpu_freq_mhz"] = "N/A"
|
||||
else:
|
||||
info["cpu_count_physical"] = "N/A"
|
||||
info["cpu_count_logical"] = "N/A"
|
||||
info["cpu_freq_mhz"] = "N/A"
|
||||
|
||||
# Try to get CPU model from /proc/cpuinfo (Linux)
|
||||
try:
|
||||
if platform.system() == "Linux":
|
||||
with open("/proc/cpuinfo", "r") as f:
|
||||
for line in f:
|
||||
if "model name" in line.lower():
|
||||
info["cpu_model"] = line.split(":")[1].strip()
|
||||
break
|
||||
elif "Processor" in line and ":" in line:
|
||||
info["cpu_model"] = line.split(":")[1].strip()
|
||||
break
|
||||
except:
|
||||
pass
|
||||
|
||||
if "cpu_model" not in info:
|
||||
info["cpu_model"] = platform.processor() or "N/A"
|
||||
|
||||
# System information
|
||||
info["platform_system"] = platform.system()
|
||||
info["platform_release"] = platform.release()
|
||||
info["platform_version"] = platform.version()
|
||||
info["platform_machine"] = platform.machine()
|
||||
info["platform_architecture"] = platform.architecture()[0]
|
||||
|
||||
# PPU information (if available)
|
||||
info["ppu_available"] = False
|
||||
info["ppu_info"] = {}
|
||||
|
||||
# Check for PPU-related devices
|
||||
try:
|
||||
if torch.cuda.is_available():
|
||||
gpu_name = torch.cuda.get_device_name(0).lower()
|
||||
if "ppu" in gpu_name or "pu" in gpu_name:
|
||||
info["ppu_available"] = True
|
||||
info["ppu_info"] = {
|
||||
"name": torch.cuda.get_device_name(0),
|
||||
"type": "detected_from_gpu_name"
|
||||
}
|
||||
except:
|
||||
pass
|
||||
|
||||
# Try to get detailed GPU info via nvidia-smi (if available)
|
||||
if torch.cuda.is_available() and platform.system() == "Linux":
|
||||
try:
|
||||
result = subprocess.run(
|
||||
["nvidia-smi", "--query-gpu=name,driver_version,memory.total", "--format=csv,noheader"],
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=5
|
||||
)
|
||||
if result.returncode == 0:
|
||||
lines = result.stdout.strip().split("\n")
|
||||
if lines:
|
||||
parts = lines[0].split(",")
|
||||
if len(parts) >= 3:
|
||||
info["gpu_driver_version"] = parts[1].strip() if len(parts) > 1 else "N/A"
|
||||
info["gpu_memory_total"] = parts[2].strip() if len(parts) > 2 else "N/A"
|
||||
except:
|
||||
pass
|
||||
|
||||
# Memory information
|
||||
if HAS_PSUTIL:
|
||||
try:
|
||||
mem = psutil.virtual_memory()
|
||||
info["memory_total_gb"] = round(mem.total / (1024**3), 2)
|
||||
info["memory_available_gb"] = round(mem.available / (1024**3), 2)
|
||||
except:
|
||||
pass
|
||||
|
||||
return info
|
||||
|
||||
|
||||
def measure_performance(model: VLMModel, image: Image.Image, question: str) -> tuple:
|
||||
"""
|
||||
Measure performance metrics (TTFT and Throughput)
|
||||
|
||||
TTFT measurement: Full model call time (generating 1 token)
|
||||
Includes: image encoding, text encoding, cross-modal interaction, prefill, first token generation
|
||||
|
||||
Args:
|
||||
model: VLMModel instance (must expose processor and model attributes)
|
||||
image: PIL Image
|
||||
question: Question text
|
||||
|
||||
Returns:
|
||||
tuple: (ttft, throughput, token_count)
|
||||
"""
|
||||
if not hasattr(model, 'processor') or not hasattr(model, 'model'):
|
||||
raise AttributeError("Model must expose 'processor' and 'model' attributes")
|
||||
|
||||
processor = model.processor
|
||||
device = model.device
|
||||
model_obj = model.model
|
||||
|
||||
# Clear GPU state
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Prepare inputs
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image", "image": image},
|
||||
{"type": "text", "text": question}
|
||||
]
|
||||
}]
|
||||
|
||||
inputs = processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=True,
|
||||
add_generation_prompt=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt"
|
||||
).to(device)
|
||||
|
||||
input_len = inputs.input_ids.shape[1]
|
||||
|
||||
# Step 1: Measure TTFT (generate 1 token, includes all preprocessing)
|
||||
try:
|
||||
torch.cuda.synchronize()
|
||||
start_ttft = time.perf_counter()
|
||||
|
||||
# Direct call to underlying model
|
||||
with torch.no_grad():
|
||||
output_ids_ttft = model_obj.generate(
|
||||
**inputs,
|
||||
max_new_tokens=1,
|
||||
do_sample=False,
|
||||
temperature=0.0,
|
||||
use_cache=True
|
||||
)
|
||||
|
||||
torch.cuda.synchronize()
|
||||
ttft = time.perf_counter() - start_ttft
|
||||
|
||||
except torch.cuda.OutOfMemoryError as e:
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
print(f"[Error] OOM during TTFT measurement: {e}")
|
||||
return float('inf'), 0.0, 0
|
||||
except Exception as e:
|
||||
print(f"[Error] Error during TTFT measurement: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return float('inf'), 0.0, 0
|
||||
|
||||
# Clear state
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
time.sleep(0.005) # Ensure state reset
|
||||
|
||||
# Step 2: Measure full generation (for Throughput)
|
||||
try:
|
||||
torch.cuda.synchronize()
|
||||
start_full = time.perf_counter()
|
||||
|
||||
# Direct call to underlying model
|
||||
with torch.no_grad():
|
||||
output_ids = model_obj.generate(
|
||||
**inputs,
|
||||
max_new_tokens=MAX_NEW_TOKENS,
|
||||
do_sample=False,
|
||||
temperature=0.0,
|
||||
use_cache=True
|
||||
)
|
||||
|
||||
torch.cuda.synchronize()
|
||||
total_time = time.perf_counter() - start_full
|
||||
|
||||
# Extract generated tokens
|
||||
generated_ids = output_ids[0][input_len:]
|
||||
token_count = len(generated_ids)
|
||||
|
||||
except torch.cuda.OutOfMemoryError as e:
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
print(f"[Error] OOM during full generation: {e}")
|
||||
return ttft, 0.0, 0
|
||||
except Exception as e:
|
||||
print(f"[Error] Error during full generation: {e}")
|
||||
import traceback
|
||||
traceback.print_exc()
|
||||
return ttft, 0.0, 0
|
||||
|
||||
# Calculate throughput
|
||||
if total_time > 0.001 and token_count > 0:
|
||||
throughput = token_count / total_time
|
||||
else:
|
||||
throughput = 0.0
|
||||
|
||||
return ttft, throughput, token_count
|
||||
|
||||
|
||||
def generate_answer(model: VLMModel, image: Image.Image, question: str, max_new_tokens: int = ACCURACY_MAX_TOKENS) -> dict:
|
||||
"""
|
||||
Generate full answer (for accuracy evaluation)
|
||||
|
||||
Args:
|
||||
model: VLMModel instance
|
||||
image: PIL Image
|
||||
question: Question text
|
||||
max_new_tokens: Maximum tokens to generate
|
||||
|
||||
Returns:
|
||||
dict: {"text": str, "token_count": int}
|
||||
"""
|
||||
if not hasattr(model, 'processor') or not hasattr(model, 'model'):
|
||||
# Fallback: use generate method
|
||||
return model.generate(image, question, max_new_tokens=max_new_tokens)
|
||||
|
||||
processor = model.processor
|
||||
device = model.device
|
||||
model_obj = model.model
|
||||
|
||||
# Prepare inputs
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image", "image": image},
|
||||
{"type": "text", "text": question}
|
||||
]
|
||||
}]
|
||||
|
||||
inputs = processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=True,
|
||||
add_generation_prompt=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt"
|
||||
).to(device)
|
||||
|
||||
input_len = inputs.input_ids.shape[1]
|
||||
|
||||
# Generate answer using underlying model
|
||||
with torch.no_grad():
|
||||
output_ids = model_obj.generate(
|
||||
**inputs,
|
||||
max_new_tokens=max_new_tokens,
|
||||
do_sample=False,
|
||||
temperature=0.0,
|
||||
use_cache=True
|
||||
)
|
||||
|
||||
# Extract generated tokens
|
||||
generated_ids = output_ids[0][input_len:]
|
||||
text = processor.tokenizer.decode(
|
||||
generated_ids,
|
||||
skip_special_tokens=True,
|
||||
clean_up_tokenization_spaces=False
|
||||
)
|
||||
|
||||
return {
|
||||
"text": text,
|
||||
"token_count": len(generated_ids)
|
||||
}
|
||||
|
||||
|
||||
def run_benchmark(
|
||||
model_class,
|
||||
model_path: str,
|
||||
dataset_path: str,
|
||||
output_path: str,
|
||||
num_samples: int = None,
|
||||
random_seed: int = None
|
||||
):
|
||||
"""
|
||||
Run benchmark evaluation
|
||||
|
||||
Process:
|
||||
1. Load participant model
|
||||
2. Measure TTFT and Throughput
|
||||
3. Generate answers
|
||||
4. Calculate statistics
|
||||
5. Save results
|
||||
|
||||
Args:
|
||||
random_seed: Random seed for reproducibility
|
||||
"""
|
||||
# Set random seed (if provided)
|
||||
if random_seed is not None:
|
||||
import random
|
||||
import numpy as np
|
||||
random.seed(random_seed)
|
||||
np.random.seed(random_seed)
|
||||
torch.manual_seed(random_seed)
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.manual_seed_all(random_seed)
|
||||
|
||||
# Clear GPU cache
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Load dataset
|
||||
print("=" * 60)
|
||||
print("AICAS 2026 Benchmark Tool")
|
||||
print("=" * 60)
|
||||
print(f"\nLoading dataset from: {dataset_path}")
|
||||
|
||||
dataset = load_from_disk(dataset_path)
|
||||
total_samples = num_samples or min(VAL_SAMPLES, len(dataset))
|
||||
|
||||
# Performance test samples
|
||||
if PERFORMANCE_SAMPLES is None:
|
||||
perf_samples = total_samples # Test all samples
|
||||
else:
|
||||
perf_samples = min(PERFORMANCE_SAMPLES, total_samples)
|
||||
|
||||
print(f"Total samples: {total_samples}")
|
||||
print(f"Performance test samples: {perf_samples}")
|
||||
|
||||
# Prepare samples (fixed order: first N samples)
|
||||
samples = []
|
||||
for i in range(total_samples):
|
||||
item = dataset[i]
|
||||
samples.append({
|
||||
"question_id": item.get("question_id", i),
|
||||
"image": item["image"],
|
||||
"question": item["question"],
|
||||
})
|
||||
|
||||
results = {
|
||||
"system_info": get_system_info(),
|
||||
"performance": {},
|
||||
"answers": []
|
||||
}
|
||||
|
||||
# Load and test participant model
|
||||
print("\n" + "=" * 60)
|
||||
print("Running Model Benchmark")
|
||||
print("=" * 60)
|
||||
|
||||
model = model_class(model_path)
|
||||
|
||||
# Warmup
|
||||
print(f"\nWarming up ({WARMUP_SAMPLES} samples)...")
|
||||
for i in range(min(WARMUP_SAMPLES, len(samples))):
|
||||
try:
|
||||
generate_answer(model, samples[i]["image"], samples[i]["question"], max_new_tokens=10)
|
||||
except Exception as e:
|
||||
print(f"[Warning] Warmup sample {i} failed: {e}")
|
||||
|
||||
# Clear state after warmup
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Performance testing + answer generation
|
||||
ttfts = []
|
||||
throughputs = []
|
||||
predictions = []
|
||||
|
||||
print(f"\nMeasuring performance & generating answers...")
|
||||
|
||||
# Performance test samples: measure performance + generate full answers
|
||||
for sample in tqdm(samples[:perf_samples], desc="Performance"):
|
||||
# Clear state before each measurement for fairness
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
|
||||
try:
|
||||
# Step 1: Measure performance
|
||||
ttft, throughput, token_count = measure_performance(
|
||||
model, sample["image"], sample["question"]
|
||||
)
|
||||
|
||||
# Check for failures
|
||||
if ttft == float('inf') or throughput == 0.0:
|
||||
print(f"[Warning] Sample {sample['question_id']} failed (TTFT={ttft}, Throughput={throughput})")
|
||||
else:
|
||||
ttfts.append(ttft)
|
||||
throughputs.append(throughput)
|
||||
|
||||
# Clear state again before generating full answer
|
||||
if torch.cuda.is_available():
|
||||
torch.cuda.empty_cache()
|
||||
torch.cuda.synchronize()
|
||||
|
||||
# Step 2: Generate full answer (for accuracy evaluation)
|
||||
try:
|
||||
result_full = generate_answer(
|
||||
model,
|
||||
sample["image"],
|
||||
sample["question"],
|
||||
max_new_tokens=ACCURACY_MAX_TOKENS
|
||||
)
|
||||
|
||||
predictions.append({
|
||||
"question_id": sample["question_id"],
|
||||
"prediction": result_full["text"]
|
||||
})
|
||||
except Exception as e:
|
||||
print(f"[Error] Error generating full answer for sample {sample['question_id']}: {e}")
|
||||
predictions.append({
|
||||
"question_id": sample["question_id"],
|
||||
"prediction": ""
|
||||
})
|
||||
|
||||
except Exception as e:
|
||||
print(f"[Error] Sample {sample['question_id']} failed: {e}")
|
||||
predictions.append({
|
||||
"question_id": sample["question_id"],
|
||||
"prediction": ""
|
||||
})
|
||||
continue
|
||||
|
||||
# If there are remaining samples, only generate answers
|
||||
if total_samples > perf_samples:
|
||||
for sample in tqdm(samples[perf_samples:], desc="Accuracy"):
|
||||
try:
|
||||
result = generate_answer(
|
||||
model,
|
||||
sample["image"],
|
||||
sample["question"],
|
||||
max_new_tokens=ACCURACY_MAX_TOKENS
|
||||
)
|
||||
predictions.append({
|
||||
"question_id": sample["question_id"],
|
||||
"prediction": result["text"]
|
||||
})
|
||||
except Exception as e:
|
||||
print(f"[Error] Error generating answer for sample {sample['question_id']}: {e}")
|
||||
predictions.append({
|
||||
"question_id": sample["question_id"],
|
||||
"prediction": ""
|
||||
})
|
||||
|
||||
# Calculate statistics
|
||||
if len(ttfts) > 0:
|
||||
avg_ttft = sum(ttfts) / len(ttfts) * 1000 # Convert to ms
|
||||
avg_throughput = sum(throughputs) / len(throughputs)
|
||||
else:
|
||||
avg_ttft = float('inf')
|
||||
avg_throughput = 0.0
|
||||
|
||||
# Build performance results
|
||||
performance = {
|
||||
"avg_ttft_ms": round(avg_ttft, 2) if avg_ttft != float('inf') else None,
|
||||
"avg_throughput_tokens_per_sec": round(avg_throughput, 2),
|
||||
}
|
||||
|
||||
results["performance"] = performance
|
||||
results["answers"] = predictions
|
||||
|
||||
# Print summary
|
||||
if len(ttfts) > 0:
|
||||
print(f"\n✓ TTFT: {avg_ttft:.2f} ms")
|
||||
print(f"✓ Throughput: {avg_throughput:.2f} tokens/sec")
|
||||
else:
|
||||
print(f"\n✗ All samples failed!")
|
||||
|
||||
# Save results
|
||||
with open(output_path, "w", encoding="utf-8") as f:
|
||||
json.dump(results, f, indent=2, ensure_ascii=False)
|
||||
|
||||
print("\n" + "=" * 60)
|
||||
print("Benchmark Complete!")
|
||||
print("=" * 60)
|
||||
print(f"\n📊 Results Summary:")
|
||||
if len(ttfts) > 0:
|
||||
print(f" TTFT: {avg_ttft:.2f} ms")
|
||||
print(f" Throughput: {avg_throughput:.2f} tokens/sec")
|
||||
else:
|
||||
print(f" ⚠ All samples failed!")
|
||||
print(f" Samples evaluated: {total_samples}")
|
||||
print(f"\n💾 Results saved to: {output_path}")
|
||||
|
||||
return results
|
||||
|
||||
|
||||
def main():
|
||||
parser = argparse.ArgumentParser(description="AICAS 2026 Benchmark Tool")
|
||||
parser.add_argument("--model-path", type=str, default="./Qwen3-VL-2B-Instruct", help="Path to model weights")
|
||||
parser.add_argument("--dataset-path", type=str, default="./data", help="Path to validation dataset")
|
||||
parser.add_argument("--output", type=str, default="result.json", help="Output JSON file path")
|
||||
parser.add_argument("--num-samples", type=int, default=None, help="Number of samples to evaluate (default: all)")
|
||||
parser.add_argument("--random-seed", type=int, default=None, help="Random seed for reproducibility")
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
# Use VLMModel (participants modify this class in evaluation_wrapper.py)
|
||||
print("=" * 60)
|
||||
print("Using VLMModel (modify evaluation_wrapper.py to add optimizations)")
|
||||
print("=" * 60)
|
||||
|
||||
# Run benchmark
|
||||
run_benchmark(
|
||||
model_class=VLMModel,
|
||||
model_path=args.model_path,
|
||||
dataset_path=args.dataset_path,
|
||||
output_path=args.output,
|
||||
num_samples=args.num_samples,
|
||||
random_seed=args.random_seed
|
||||
)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
403
evaluation_wrapper.py
Executable file
403
evaluation_wrapper.py
Executable file
@ -0,0 +1,403 @@
|
||||
"""
|
||||
AICAS 2026 - Participant Core Modification File
|
||||
|
||||
Participants should modify the VLMModel class to implement optimizations.
|
||||
|
||||
Note:
|
||||
- Benchmark directly calls self.model.generate() for performance testing.
|
||||
- Your optimizations should modify self.model or its operators in __init__ via Monkey Patch.
|
||||
- The generate() method is optional and mainly for debugging.
|
||||
"""
|
||||
from typing import Dict
|
||||
try:
|
||||
from PIL import Image
|
||||
except ImportError:
|
||||
# For testing without PIL
|
||||
class Image:
|
||||
pass
|
||||
import torch
|
||||
from transformers import AutoModelForImageTextToText, AutoProcessor
|
||||
|
||||
|
||||
class VLMModel:
|
||||
"""
|
||||
Participant optimization class - modify this to implement optimizations.
|
||||
|
||||
Optimization Architecture:
|
||||
- Split optimizations into separate methods for isolation and testing
|
||||
- Enable/disable each optimization independently in __init__
|
||||
- Each optimization method can be tested individually
|
||||
|
||||
Important Notes:
|
||||
1. Benchmark directly calls self.model.generate() for performance testing.
|
||||
2. Your optimizations should modify self.model or its operators via Monkey Patch.
|
||||
3. All optimizations are applied in __init__ by calling optimization methods.
|
||||
"""
|
||||
|
||||
def __init__(self, model_path: str, device: str = "cuda:0"):
|
||||
"""
|
||||
Initialize model and apply optimizations.
|
||||
|
||||
Args:
|
||||
model_path: Qwen3-VL-2B-Instruct model path
|
||||
device: CUDA device, e.g., "cuda:0"
|
||||
"""
|
||||
self._device = device
|
||||
self.model_path = model_path
|
||||
|
||||
# Load processor
|
||||
print(f"[VLMModel] Loading processor from {model_path}...")
|
||||
self._processor = AutoProcessor.from_pretrained(model_path)
|
||||
|
||||
# Load model
|
||||
print(f"[VLMModel] Loading model with FP16...")
|
||||
self._model = AutoModelForImageTextToText.from_pretrained(
|
||||
model_path,
|
||||
torch_dtype=torch.float16,
|
||||
device_map=device
|
||||
)
|
||||
self._model.eval()
|
||||
|
||||
# Track applied optimizations
|
||||
self._optimizations_applied = []
|
||||
|
||||
# ================================================================
|
||||
# Participant Optimization Area - Enable/disable optimizations here
|
||||
# Uncomment the optimization methods you want to apply
|
||||
# ================================================================
|
||||
|
||||
# 1. Vision Encoder Acceleration
|
||||
# self._optimize_vision_encoder()
|
||||
|
||||
# 2. KV Cache Management
|
||||
# self._optimize_kv_cache()
|
||||
|
||||
# 3. Cross-modal Connector Optimization
|
||||
# self._optimize_cross_modal_connector()
|
||||
|
||||
# 4. Flash Attention Optimization
|
||||
# self._enable_flash_attention()
|
||||
|
||||
# 5. Quantization
|
||||
# self._apply_quantization()
|
||||
|
||||
# Optional: Explore model structure before optimization
|
||||
# self._explore_model_structure()
|
||||
|
||||
# ================================================================
|
||||
|
||||
print(f"[VLMModel] Model loaded successfully on {device}")
|
||||
if self._optimizations_applied:
|
||||
print(f"[VLMModel] Applied optimizations: {', '.join(self._optimizations_applied)}")
|
||||
|
||||
# ================================================================
|
||||
# Optimization Methods - Implement your optimizations here
|
||||
# ================================================================
|
||||
|
||||
def _explore_model_structure(self):
|
||||
"""
|
||||
Helper method to explore model structure.
|
||||
|
||||
Use this to understand the model architecture before implementing optimizations.
|
||||
This helps identify where to apply monkey patches.
|
||||
"""
|
||||
print("=" * 60)
|
||||
print("Model Structure Exploration")
|
||||
print("=" * 60)
|
||||
|
||||
# Explore vision model structure
|
||||
if hasattr(self._model, 'vision_model'):
|
||||
print(f"Vision Model: {type(self._model.vision_model)}")
|
||||
if hasattr(self._model.vision_model, 'encoder'):
|
||||
if hasattr(self._model.vision_model.encoder, 'layers'):
|
||||
print(f" Vision Encoder Layers: {len(self._model.vision_model.encoder.layers)}")
|
||||
# Show first layer structure
|
||||
if len(self._model.vision_model.encoder.layers) > 0:
|
||||
print(f" First Layer Type: {type(self._model.vision_model.encoder.layers[0])}")
|
||||
else:
|
||||
print("Vision Model: Not found (model structure may differ)")
|
||||
|
||||
# Explore language model structure
|
||||
if hasattr(self._model, 'model'):
|
||||
print(f"Language Model: {type(self._model.model)}")
|
||||
if hasattr(self._model.model, 'layers'):
|
||||
print(f" Language Model Layers: {len(self._model.model.layers)}")
|
||||
else:
|
||||
print("Language Model: Not found (model structure may differ)")
|
||||
|
||||
# Explore cross-modal components
|
||||
cross_modal_attrs = ['connector', 'cross_attn', 'cross_attention', 'proj', 'projector']
|
||||
found_components = []
|
||||
for attr in cross_modal_attrs:
|
||||
if hasattr(self._model, attr):
|
||||
found_components.append(attr)
|
||||
if found_components:
|
||||
print(f"Cross-modal Components: {', '.join(found_components)}")
|
||||
else:
|
||||
print("Cross-modal Components: Explore manually (structure may vary)")
|
||||
|
||||
print("=" * 60)
|
||||
print("Tip: Use print(self._model) to see full model structure")
|
||||
print("=" * 60)
|
||||
|
||||
def _optimize_vision_encoder(self):
|
||||
"""
|
||||
Optimize Vision Encoder for high-resolution image inputs.
|
||||
|
||||
Optimization Directions:
|
||||
1. Patch embedding convolution optimization
|
||||
2. Vision Transformer attention mechanism optimization
|
||||
3. Layer normalization optimization
|
||||
4. Memory-efficient image processing
|
||||
|
||||
Implementation Steps:
|
||||
1. Inspect model structure: call self._explore_model_structure()
|
||||
2. Identify bottlenecks using profiling tools (PyTorch Profiler, nsys, etc.)
|
||||
3. Implement optimized operators (Triton/CUDA kernels)
|
||||
4. Replace original operators via monkey patch
|
||||
|
||||
Target Components:
|
||||
- self._model.vision_model (if exists)
|
||||
- Vision encoder layers and attention mechanisms
|
||||
- Convolution operations in patch embedding
|
||||
"""
|
||||
# TODO: Implement your Vision Encoder optimization here
|
||||
#
|
||||
# Example workflow:
|
||||
# 1. from your_optimization import optimized_attention, optimized_conv
|
||||
# 2. Inspect: print(self._model.vision_model) to find target layers
|
||||
# 3. Replace: layer.self_attn.forward = optimized_attention
|
||||
# 4. Test: Run benchmark to verify improvement
|
||||
|
||||
if 'vision_encoder' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('vision_encoder')
|
||||
|
||||
def _optimize_kv_cache(self):
|
||||
"""
|
||||
Optimize KV Cache management to reduce memory fragmentation.
|
||||
|
||||
Optimization Directions:
|
||||
1. Memory layout optimization (contiguous memory allocation)
|
||||
2. Fragmentation-free allocation strategies
|
||||
3. Efficient cache reuse patterns
|
||||
4. Dynamic cache sizing
|
||||
|
||||
Implementation Steps:
|
||||
1. Understand current KV cache implementation in model layers
|
||||
2. Design memory-efficient cache allocation strategy
|
||||
3. Implement custom KV cache allocator if needed
|
||||
4. Apply optimizations via monkey patch or config modification
|
||||
|
||||
Target Components:
|
||||
- self._model.config (cache configuration)
|
||||
- Attention layers (KV cache allocation)
|
||||
- Generation loop (cache management)
|
||||
"""
|
||||
# Enable KV Cache first
|
||||
self._model.config.use_cache = True
|
||||
if hasattr(self._model.config, 'pad_token_id'):
|
||||
if self._model.config.pad_token_id is None:
|
||||
self._model.config.pad_token_id = self._model.config.eos_token_id
|
||||
|
||||
# TODO: Implement advanced KV Cache optimizations here
|
||||
#
|
||||
# Example workflow:
|
||||
# 1. from your_optimization import FragmentationFreeKVCache
|
||||
# 2. for layer in self._model.model.layers:
|
||||
# 3. layer.attention.custom_kv_cache = FragmentationFreeKVCache()
|
||||
# 4. Test: Monitor memory usage and generation speed
|
||||
|
||||
if 'kv_cache' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('kv_cache')
|
||||
|
||||
def _optimize_cross_modal_connector(self):
|
||||
"""
|
||||
Optimize Cross-modal Connector computation efficiency.
|
||||
|
||||
Optimization Directions:
|
||||
1. Cross-attention mechanism optimization
|
||||
2. Vision-to-language projection optimization
|
||||
3. Multi-modal fusion layer efficiency
|
||||
4. Feature alignment and transformation optimization
|
||||
|
||||
Implementation Steps:
|
||||
1. Identify cross-modal components using self._explore_model_structure()
|
||||
2. Profile cross-modal operations to find bottlenecks
|
||||
3. Implement optimized cross-attention or projection kernels
|
||||
4. Replace original operations via monkey patch
|
||||
|
||||
Note: Qwen3-VL's cross-modal structure may vary.
|
||||
Use model exploration to identify actual component names and locations.
|
||||
"""
|
||||
# TODO: Implement your Cross-modal Connector optimization here
|
||||
#
|
||||
# Example workflow:
|
||||
# 1. Explore: self._explore_model_structure() to find connector components
|
||||
# 2. from your_optimization import optimized_cross_attention
|
||||
# 3. Identify: Inspect model to find cross-attention layers
|
||||
# 4. Replace: connector.cross_attention.forward = optimized_cross_attention
|
||||
# 5. Test: Verify accuracy and performance improvements
|
||||
|
||||
if 'cross_modal' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('cross_modal')
|
||||
|
||||
def _enable_flash_attention(self):
|
||||
"""
|
||||
Enable or implement Flash Attention optimization.
|
||||
|
||||
Implementation Approaches:
|
||||
|
||||
Approach 1: Enable PyTorch's Built-in Flash Attention (Simple)
|
||||
- Uses torch.backends.cuda.enable_flash_sdp(True)
|
||||
- Easy to enable but limited customization
|
||||
- May not work for all attention patterns in Qwen3-VL
|
||||
|
||||
Approach 2: Implement Custom Flash Attention (Advanced, Recommended)
|
||||
- Write custom Triton/CUDA kernels for attention computation
|
||||
- Replace torch.nn.functional.scaled_dot_product_attention
|
||||
- Full control over attention computation and memory layout
|
||||
- Better performance potential but requires more implementation effort
|
||||
|
||||
Recommended: Implement Approach 2 for better performance gains.
|
||||
Use profiling to identify which attention operations benefit most from optimization.
|
||||
"""
|
||||
# TODO: Choose and implement your Flash Attention approach
|
||||
|
||||
# Approach 1: Simple (enable PyTorch built-in)
|
||||
# torch.backends.cuda.enable_flash_sdp(True)
|
||||
|
||||
# Approach 2: Advanced (custom implementation - recommended)
|
||||
# from your_optimization import custom_flash_attention
|
||||
# torch.nn.functional.scaled_dot_product_attention = custom_flash_attention
|
||||
#
|
||||
# Or replace at layer level:
|
||||
# for layer in self._model.model.layers:
|
||||
# layer.self_attn.forward = custom_attention_with_flash
|
||||
|
||||
if 'flash_attention' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('flash_attention')
|
||||
|
||||
def _apply_quantization(self):
|
||||
"""
|
||||
Apply quantization to reduce model size and speed up inference.
|
||||
|
||||
Optimization Directions:
|
||||
1. INT8 quantization (8-bit integer)
|
||||
2. FP8 quantization (8-bit floating point)
|
||||
3. Mixed precision quantization
|
||||
4. Dynamic vs static quantization
|
||||
|
||||
Implementation Steps:
|
||||
1. Choose quantization strategy based on accuracy/performance trade-off
|
||||
2. Use quantization libraries (BitsAndBytes, TensorRT, etc.)
|
||||
3. Calibrate quantized model on validation data
|
||||
4. Verify accuracy preservation
|
||||
|
||||
Note: Quantization may require reloading the model with quantization config.
|
||||
Consider applying quantization before other optimizations if model reload is needed.
|
||||
"""
|
||||
# TODO: Implement your quantization here
|
||||
#
|
||||
# Example workflow:
|
||||
# 1. from transformers import BitsAndBytesConfig
|
||||
# 2. quantization_config = BitsAndBytesConfig(load_in_8bit=True)
|
||||
# 3. Note: May need to reload model with quantization config
|
||||
# 4. Test: Verify accuracy and performance improvements
|
||||
|
||||
if 'quantization' not in self._optimizations_applied:
|
||||
self._optimizations_applied.append('quantization')
|
||||
|
||||
# Required properties for benchmark
|
||||
@property
|
||||
def processor(self):
|
||||
"""
|
||||
Required by benchmark for input processing.
|
||||
|
||||
Benchmark uses this to prepare inputs with unified tokenizer.
|
||||
"""
|
||||
return self._processor
|
||||
|
||||
@property
|
||||
def model(self):
|
||||
"""
|
||||
Required by benchmark for direct model.generate() calls.
|
||||
|
||||
Benchmark directly calls self.model.generate() for performance testing.
|
||||
Your optimizations should modify this model object or its operators.
|
||||
"""
|
||||
return self._model
|
||||
|
||||
@property
|
||||
def device(self):
|
||||
"""
|
||||
Required by benchmark for device information.
|
||||
"""
|
||||
return self._device
|
||||
|
||||
def generate(
|
||||
self,
|
||||
image: Image.Image,
|
||||
question: str,
|
||||
max_new_tokens: int = 128
|
||||
) -> Dict:
|
||||
"""
|
||||
Generate answer (optional method, mainly for debugging).
|
||||
|
||||
Note: Benchmark uses self.model.generate() directly for performance testing.
|
||||
This method is provided for convenience and debugging purposes.
|
||||
|
||||
Args:
|
||||
image: PIL Image object
|
||||
question: Question text
|
||||
max_new_tokens: Maximum tokens to generate
|
||||
|
||||
Returns:
|
||||
Dict: {
|
||||
"text": str, # Generated text answer
|
||||
"token_count": int # Generated token count
|
||||
}
|
||||
"""
|
||||
# Build Qwen3-VL message format
|
||||
messages = [{
|
||||
"role": "user",
|
||||
"content": [
|
||||
{"type": "image", "image": image},
|
||||
{"type": "text", "text": question}
|
||||
]
|
||||
}]
|
||||
|
||||
# Process inputs
|
||||
inputs = self._processor.apply_chat_template(
|
||||
messages,
|
||||
tokenize=True,
|
||||
add_generation_prompt=True,
|
||||
return_dict=True,
|
||||
return_tensors="pt"
|
||||
).to(self._device)
|
||||
|
||||
# Generate
|
||||
with torch.no_grad():
|
||||
output_ids = self._model.generate(
|
||||
**inputs,
|
||||
max_new_tokens=max_new_tokens,
|
||||
do_sample=False,
|
||||
temperature=0.0,
|
||||
top_p=1.0,
|
||||
use_cache=True
|
||||
)
|
||||
|
||||
# Extract generated tokens (remove input part)
|
||||
input_len = inputs.input_ids.shape[1]
|
||||
generated_ids = output_ids[0][input_len:]
|
||||
|
||||
# Decode
|
||||
text = self._processor.tokenizer.decode(
|
||||
generated_ids,
|
||||
skip_special_tokens=True,
|
||||
clean_up_tokenization_spaces=False
|
||||
)
|
||||
|
||||
return {
|
||||
"text": text,
|
||||
"token_count": len(generated_ids)
|
||||
}
|
||||
31
requirements.txt
Executable file
31
requirements.txt
Executable file
@ -0,0 +1,31 @@
|
||||
# AICAS 2026 - 环境依赖
|
||||
# ============================================
|
||||
|
||||
# 核心框架
|
||||
torch>=2.0.0
|
||||
transformers>=4.40.0
|
||||
accelerate>=0.25.0
|
||||
|
||||
# 数据处理
|
||||
datasets>=2.14.0
|
||||
Pillow>=9.0.0
|
||||
|
||||
# 进度条
|
||||
tqdm>=4.65.0
|
||||
|
||||
# 系统信息(可选,用于获取详细的硬件信息)
|
||||
psutil>=5.9.0
|
||||
|
||||
# 可选:Triton 算子开发
|
||||
triton>=2.1.0
|
||||
|
||||
# 可选:Flash Attention(需要 CUDA 编译)
|
||||
# flash-attn>=2.0.0
|
||||
|
||||
# 可选:量化工具
|
||||
# bitsandbytes>=0.41.0
|
||||
# auto-gptq>=0.5.0
|
||||
|
||||
# 可选:Profiling
|
||||
# tensorboard>=2.14.0
|
||||
|
||||
Reference in New Issue
Block a user