推理框架性能分析
基准的定义:一个标准化测试流程和指标,用于衡量系统、算法、硬件或软件的性能表现,需要可重复、可比较的量化结果哦,帮助获取某项技术的能力边界、效率和优化空间。
基准和性能分析的必要性:能用->好用
- 问题定位,
- 资源优化,
- 稳定性保证,验证系统在峰值压力下稳定
- 响应优化,关键路径优化
- 早期发现隐患
- 技术对比,对不同AI模型推理性能验证
SGLang基准测试
-
Benchmark the latency of running a single static batch without a server. The arguments are the same as for
<span class="pre">launch_server.py</span>
. Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.(单个静态批次的延迟,理论极限摸底)python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
-
Benchmark offline processing. This script will start an offline engine and run the benchmark.(离线吞吐量测试,批量优化,资源占满)
python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
-
Benchmark online serving. Please use
<span class="pre">sglang.launch_server</span>
to launch a server first and run the following command.(基准测试在线服务,上线前的验证)python3 -m sglang.bench_serving --backend sglang --num-prompt 10
- 初始化pytorch分布式环境
- 加载模型,加载前显存、加载后显存
- kv分配,已经计算过的kv缓存,K多少size、V多少size,减少重复计算
- CUDA图捕捉,减少推理过程的开销,有向无环图DAG,生成静态执行计划,可以重复调用
预热阶段推理、正式阶段推理
max_total_num_tokens,kv缓存的总容量限制
chunked_prefill_size,当输入长度超过2048,切分块,优化显存使用
max_prefill_tokens
max_running_requests
context_len