推理框架性能分析

基准的定义：一个标准化测试流程和指标，用于衡量系统、算法、硬件或软件的性能表现，需要可重复、可比较的量化结果哦，帮助获取某项技术的能力边界、效率和优化空间。

基准和性能分析的必要性：能用->好用

Benchmark the latency of running a single static batch without a server. The arguments are the same as for <span class="pre">launch_server.py</span>. Note that this is a simplified test script without a dynamic batching server, so it may run out of memory for a batch size that a real server can handle. A real server truncates the prefill into several batches, while this simplified script does not.（单个静态批次的延迟，理论极限摸底）
```
python -m sglang.bench_one_batch --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --batch 32 --input-len 256 --output-len 32
```
Benchmark offline processing. This script will start an offline engine and run the benchmark.（离线吞吐量测试，批量优化，资源占满）
```
python3 -m sglang.bench_offline_throughput --model-path meta-llama/Meta-Llama-3.1-8B-Instruct --num-prompts 10
```
Benchmark online serving. Please use <span class="pre">sglang.launch_server</span> to launch a server first and run the following command.（基准测试在线服务，上线前的验证）
```
python3 -m sglang.bench_serving --backend sglang --num-prompt 10
```

预热阶段推理、正式阶段推理

max_total_num_tokens，kv缓存的总容量限制

chunked_prefill_size，当输入长度超过2048，切分块，优化显存使用

max_prefill_tokens

max_running_requests

context_len

推理框架性能分析

https://lihuigu.cn//archives/tui-li-kuang-jia-xing-neng-fen-xi

作者

lihuigu

发布于

2025-04-30

更新于

2025-04-30

许可协议