Step-3.7-Flash MTP draft (for the NVFP4 checkpoint)

A tiny Multi-Token-Prediction (MTP / nextn) draft for stepfun-ai/Step-3.7-Flash-NVFP4, so you can run speculative decoding on the NVFP4 checkpoint in vLLM.

Why this exists: the official Step-3.7-Flash-NVFP4 checkpoint declares num_nextn_predict_layers: 3 in its config but ships zero MTP weights — the 3 nextn layers were dropped during quantization, and the per-layer config arrays were truncated to 45 (so even loading them would IndexError). The BF16 and FP8 releases keep the MTP weights, but the NVFP4 one — the SM120-friendly, smallest one — cannot do speculative decoding out of the box. This repo is the missing piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're tiny), packaged as a vLLM-loadable draft.

~5.9 GB, BF16. Base = NVFP4 (mixed precision is fine; the draft is small).
Lossless in the speculative sense: vLLM's rejection sampling provably matches the target distribution; at temperature=0 it follows the target's greedy tokens.
Drop-in: point vLLM's --speculative-config at this directory.

Usage (vLLM, stepfun37 image / vLLM ≥ the build with `Step3p5MTP`)

The draft is auto-routed to vLLM's native Step3p5MTP + Step3p5MTPProposer because its config is model_type: step3p7 with num_nextn_predict_layers > 0.

docker run -d --gpus all --ipc=host --shm-size=64g --network host \
  -v /path/to/Step-3.7-Flash-NVFP4:/model:ro \
  -v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \
  vllm/vllm-openai:stepfun37 \
  /model \
    --served-model-name step3p7 --port 8000 \
    --trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \
    --quantization modelopt --kv-cache-dtype fp8 \
    --max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \
    --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}'

JSON for --speculative-config must have no spaces (brace-expansion safety). num_speculative_tokens: 1 (K=1) is the sweet spot — see below.

Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2)

Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off. per_req = decode tok/s a single user feels (prefill excluded). Acceptance ≈ 0.80 in production traffic.

Single-stream decode (short context):

workload	base	+ MTP K=1	speedup	accept
free-form	106.8	125.5	+17.5%	0.77
code	106.7	133.7	+25.3%	0.88
Japanese	107.0	129.3	+20.9%	0.80
tool-call	106.9	135.4	+26.6%	0.90

Decode speedup grows with context length (longer KV → base is more memory-bound → bigger speculative win):

context	C=1	C=2	C=4	C=8
1K	+20%	+8%	+32%	+34%
8K	+22%	+24%	+25%	+44%
32K	+22%	+26%	+20%	+17%
128K	+28%	+33%	+38%	—

Net-positive across the whole concurrency range we tested (MoE stays memory-bound to high batch). Best K: K=1 (K=2/K=3 lose to draft cost — later positions have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5).

How it was built (reproducible)

The draft is not retrained — it's the original StepFun MTP layers, extracted verbatim:

From stepfun-ai/Step-3.7-Flash (BF16), take the 52 tensors of model.layers.{45,46,47}.* (the 3 nextn layers, dense-MLP, 17 tensors each) plus model.embed_tokens.weight. They all live in one shard (model-00024.safetensors).
Keep the original BF16 weight names — vLLM's Step3p5MTP loader does its own renaming (.transformer. strip, shared_head.output→head, .mtp_block. insert).
config.json = the BF16 original config (NOT the NVFP4 one): its per-layer arrays (layer_types, partial_rotary_factors, …) are length 48 and cover the MTP layer indices 45-47. Strip quantization_config so the draft loads as BF16.

Full scripts + benchmark harness: GitHub repo (build_draft.py, launch_mtp.sh, eval_mtp.py, bench_matrix.py).

License & attribution

Apache-2.0, inherited from the base model stepfun-ai/Step-3.7-Flash. These are StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft). All credit for the model and the MTP layers goes to StepFun.

Downloads last month: -

Safetensors

Model size

3B params

Tensor type

BF16

Model tree for Hikari07jp/Step-3.7-Flash-MTP-draft

Base model

stepfun-ai/Step-3.7-Flash

Finetuned

(5)