Step-3.7-Flash MTP draft (for the NVFP4 checkpoint)

A tiny Multi-Token-Prediction (MTP / nextn) draft for stepfun-ai/Step-3.7-Flash-NVFP4, so you can run speculative decoding on the NVFP4 checkpoint in vLLM.

Why this exists: the official Step-3.7-Flash-NVFP4 checkpoint declares num_nextn_predict_layers: 3 in its config but ships zero MTP weights — the 3 nextn layers were dropped during quantization, and the per-layer config arrays were truncated to 45 (so even loading them would IndexError). The BF16 and FP8 releases keep the MTP weights, but the NVFP4 one — the SM120-friendly, smallest one — cannot do speculative decoding out of the box. This repo is the missing piece: the 3 MTP layers extracted from the BF16 release, kept in BF16 (they're tiny), packaged as a vLLM-loadable draft.

  • ~5.9 GB, BF16. Base = NVFP4 (mixed precision is fine; the draft is small).
  • Lossless in the speculative sense: vLLM's rejection sampling provably matches the target distribution; at temperature=0 it follows the target's greedy tokens.
  • Drop-in: point vLLM's --speculative-config at this directory.

Usage (vLLM, stepfun37 image / vLLM ≥ the build with Step3p5MTP)

The draft is auto-routed to vLLM's native Step3p5MTP + Step3p5MTPProposer because its config is model_type: step3p7 with num_nextn_predict_layers > 0.

docker run -d --gpus all --ipc=host --shm-size=64g --network host \
  -v /path/to/Step-3.7-Flash-NVFP4:/model:ro \
  -v /path/to/Step-3.7-Flash-MTP-draft:/draft:ro \
  vllm/vllm-openai:stepfun37 \
  /model \
    --served-model-name step3p7 --port 8000 \
    --trust-remote-code --tensor-parallel-size 2 --enable-expert-parallel \
    --quantization modelopt --kv-cache-dtype fp8 \
    --max-model-len 262144 --gpu-memory-utilization 0.92 --async-scheduling \
    --speculative-config '{"method":"mtp","model":"/draft","num_speculative_tokens":1}'

JSON for --speculative-config must have no spaces (brace-expansion safety). num_speculative_tokens: 1 (K=1) is the sweet spot — see below.

Benchmarks (2× RTX PRO 6000 Blackwell, SM120, TP=2)

Measured on the NVFP4 base + this draft, K=1, vs. NVFP4 with speculation off. per_req = decode tok/s a single user feels (prefill excluded). Acceptance ≈ 0.80 in production traffic.

Single-stream decode (short context):

workload base + MTP K=1 speedup accept
free-form 106.8 125.5 +17.5% 0.77
code 106.7 133.7 +25.3% 0.88
Japanese 107.0 129.3 +20.9% 0.80
tool-call 106.9 135.4 +26.6% 0.90

Decode speedup grows with context length (longer KV → base is more memory-bound → bigger speculative win):

context C=1 C=2 C=4 C=8
1K +20% +8% +32% +34%
8K +22% +24% +25% +44%
32K +22% +26% +20% +17%
128K +28% +33% +38%

Net-positive across the whole concurrency range we tested (MoE stays memory-bound to high batch). Best K: K=1 (K=2/K=3 lose to draft cost — later positions have lower acceptance and add forward cost). NaN-free on SM120 (Gate0 5/5).

How it was built (reproducible)

The draft is not retrained — it's the original StepFun MTP layers, extracted verbatim:

  1. From stepfun-ai/Step-3.7-Flash (BF16), take the 52 tensors of model.layers.{45,46,47}.* (the 3 nextn layers, dense-MLP, 17 tensors each) plus model.embed_tokens.weight. They all live in one shard (model-00024.safetensors).
  2. Keep the original BF16 weight names — vLLM's Step3p5MTP loader does its own renaming (.transformer. strip, shared_head.output→head, .mtp_block. insert).
  3. config.json = the BF16 original config (NOT the NVFP4 one): its per-layer arrays (layer_types, partial_rotary_factors, …) are length 48 and cover the MTP layer indices 45-47. Strip quantization_config so the draft loads as BF16.

Full scripts + benchmark harness: GitHub repo (build_draft.py, launch_mtp.sh, eval_mtp.py, bench_matrix.py).

License & attribution

Apache-2.0, inherited from the base model stepfun-ai/Step-3.7-Flash. These are StepFun's weights, redistributed unchanged (only re-sharded/re-packaged as a draft). All credit for the model and the MTP layers goes to StepFun.

Downloads last month
-
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Hikari07jp/Step-3.7-Flash-MTP-draft

Finetuned
(5)
this model