EAGLE-3原理与实现详解
SGlang中已经对Eagle3进行了继承,并且可以很方便分析方法带来的性能优化,
eagle官方提供了draft模型:
jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B
使用命令
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct --speculative-algorithm EAGLE3 \
--speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
--speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
--cuda-graph-max-bs 2 --dtype float16
普通后端的启动方式:
python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct
即可在Sglang上完成eagle3后端的启动
测试命令
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 500
实验结果(Eagle-3的后端)
Backend: sglang
Traffic request rate: inf
Max request concurrency: not set
Successful requests: 500
Benchmark duration (s): 260.11
Total input tokens: 153555
Total generated tokens: 91009
Total generated tokens (retokenized): 90988
Request throughput (req/s): 1.92
Input token throughput (tok/s): 590.34
Output token throughput (tok/s): 349.88
Total token throughput (tok/s): 940.22
Concurrency: 222.31
Accept length: 2.80
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 115649.44
Median E2E Latency (ms): 116933.77
---------------Time to First Token----------------
Mean TTFT (ms): 95819.59
Median TTFT (ms): 99399.37
P99 TTFT (ms): 197118.52
---------------Inter-Token Latency----------------
Mean ITL (ms): 109.55
Median ITL (ms): 77.51
P95 ITL (ms): 293.54
P99 ITL (ms): 433.00
Max ITL (ms): 3256.63
普通后端的实验结果:
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: not set
Successful requests: 500
Benchmark duration (s): 77.51
Total input tokens: 153555
Total generated tokens: 91009
Total generated tokens (retokenized): 90997
Request throughput (req/s): 6.45
Input token throughput (tok/s): 1981.20
Output token throughput (tok/s): 1174.22
Total token throughput (tok/s): 3155.43
Concurrency: 215.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 33416.85
Median E2E Latency (ms): 31023.27
---------------Time to First Token----------------
Mean TTFT (ms): 13226.07
Median TTFT (ms): 12372.11
P99 TTFT (ms): 23864.97
---------------Inter-Token Latency----------------
Mean ITL (ms): 111.55
Median ITL (ms): 56.46
P95 ITL (ms): 114.25
P99 ITL (ms): 476.27
Max ITL (ms): 19294.95