EAGLE-3原理与实现详解

SGlang中已经对Eagle3进行了继承，并且可以很方便分析方法带来的性能优化，

eagle官方提供了draft模型：

jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B

使用命令

python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct  --speculative-algorithm EAGLE3 \
    --speculative-draft-model-path jamesliu1/sglang-EAGLE3-Llama-3.1-Instruct-8B --speculative-num-steps 5 \
        --speculative-eagle-topk 8 --speculative-num-draft-tokens 32 --mem-fraction 0.6 \
        --cuda-graph-max-bs 2 --dtype float16

普通后端的启动方式：

python3 -m sglang.launch_server --model meta-llama/Llama-3.1-8B-Instruct

即可在Sglang上完成eagle3后端的启动

测试命令

python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 500

实验结果（Eagle-3的后端）
Backend: sglang
Traffic request rate: inf
Max request concurrency: not set
Successful requests: 500
Benchmark duration (s): 260.11
Total input tokens: 153555
Total generated tokens: 91009
Total generated tokens (retokenized): 90988
Request throughput (req/s): 1.92
Input token throughput (tok/s): 590.34
Output token throughput (tok/s): 349.88
Total token throughput (tok/s): 940.22
Concurrency: 222.31
Accept length: 2.80
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 115649.44
Median E2E Latency (ms): 116933.77
---------------Time to First Token----------------
Mean TTFT (ms): 95819.59
Median TTFT (ms): 99399.37
P99 TTFT (ms): 197118.52
---------------Inter-Token Latency----------------
Mean ITL (ms): 109.55
Median ITL (ms): 77.51
P95 ITL (ms): 293.54
P99 ITL (ms): 433.00
Max ITL (ms): 3256.63

普通后端的实验结果：

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: not set
Successful requests: 500
Benchmark duration (s): 77.51
Total input tokens: 153555
Total generated tokens: 91009
Total generated tokens (retokenized): 90997
Request throughput (req/s): 6.45
Input token throughput (tok/s): 1981.20
Output token throughput (tok/s): 1174.22
Total token throughput (tok/s): 3155.43
Concurrency: 215.58
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 33416.85
Median E2E Latency (ms): 31023.27
---------------Time to First Token----------------
Mean TTFT (ms): 13226.07
Median TTFT (ms): 12372.11
P99 TTFT (ms): 23864.97
---------------Inter-Token Latency----------------
Mean ITL (ms): 111.55
Median ITL (ms): 56.46
P95 ITL (ms): 114.25
P99 ITL (ms): 476.27
Max ITL (ms): 19294.95

EAGLE3-目前的Spec decoding Sota方案

https://lihuigu.cn//archives/eagle3-mu-qian-de-spec-decoding-sotafang-an

作者

lihuigu

发布于

2025-05-13

更新于

2025-05-14

许可协议