Running command?

by bhat1 - opened Feb 25

Discussion

bhat1

Feb 25

What settings are you using to get best tokens/s ?

running with enforce eager i was getting only 8 tokens/s

mtcl

Feb 26

What settings are you using to get best tokens/s ?

running with enforce eager i was getting only 8 tokens/s

What's your hardware and your startup command?

bhat1

Feb 26

rtx pro 6000 blackwell.

vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 --trust-remote-code --enforce-eager

if not using enforce eager, i got cuda graph errors

1anH

Feb 28

•

edited Feb 28

rtx pro 6000 blackwell.

vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 --trust-remote-code --enforce-eager

if not using enforce eager, i got cuda graph errors

I can't get this to work with 6000 Pro Blackwell + CUDA 13.0 + vllm/vllm-openai:nightly + huggingface/transformers.git. Could you share more details on your environment? Thanks.

vllm | (EngineCore_DP0 pid=126) 2026-02-28 10:52:44,381 - WARNING - autotuner.py:496 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_cutlass_fused_moe_module..MoERunner object at 0x7f1fe45ed9a0> 14, due to failure while profiling: [TensorRT-LLM][ERROR] Assertion failed: Failed to initialize cutlass TMA WS grouped gemm. Error: Error Internal (/workspace/build/aot/generated/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:60)

Jon-Nielsen

Mar 1

•

edited Mar 2

I got it running just fine (up to 113t/s generation) with this in my bash script:

cd ~/vllm/vllm-p2p
source ~/vllm/vllm-p2p/.venv/bin/activate
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export HF_HOME=~/.cache/huggingface
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export VLLM_SLEEP_WHEN_IDLE=1
export OMP_NUM_THREADS=12
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0


python -m vllm.entrypoints.openai.api_server \
  --model Sehyo/Qwen3.5-122B-A10B-NVFP4 \
  --served-model-name "Qwen3.5-122B" \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --enable-prefix-caching \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.94 \
  --max-model-len auto \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 32 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --swap-space=32 \
  --tool-call-parser qwen3_coder

FWIW you may or may not require all of these parameters. It's adjusted from my MiniMax M2.5 run config. And "vllm-p2p" refers to https://huggingface.co/lukealonso/MiniMax-M2.5-NVFP4/discussions/1#6990a4203cb24455b38036b8 which improves Blackwell p2p.

Edit: 0.95 memory/ 16 seqs for TP=1 works.

mtcl

Mar 2

What's the max context you can get on 1 6000 pro

Jon-Nielsen

Mar 2

According to VLLM startup log, GPU KV is about 138k with default settings, 150k with --language-model-only, or 275k with --kv-cache-dtype fp8_e4m3.

I didn't test capacity near the limits of the cache, prompt with imaging, or precision. Enabling MTP on the updated version caused tool calls to start failing immediately (I never seem to get anything out of those settings anyway).

neokoei

Mar 5

Thank you. Based on the information provided above, I have successfully deployed the setup using Docker.
I am using a Blackwell RTX Pro 6000 96GB Workstation.

Generation Throughput 80 ~ 126 tokens/s
Prompt Throughput 400 ~ 1,300 tokens/s

Here is my successful Docker Compose file for reference.

services:
vllm:
image: vllm/vllm-openai:nightly
container_name: vllm
restart: unless-stopped
ports:
- "8000:8000"
volumes:
- /mnt/data/vllm:/root/.cache/huggingface
- ./no_thought.jinja:/app/no_thought.jinja
environment:
- OMP_NUM_THREADS=4
- PYTORCH_ALLOC_CONF=expandable_segments:True
- VLLM_WORKER_MULTIPROC_METHOD=spawn
- SAFETENSORS_FAST_GPU=1
- VLLM_DISABLE_PYNCCL=1
- NCCL_IB_DISABLE=1
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- VLLM_NVFP4_GEMM_BACKEND=cutlass
- VLLM_USE_FLASHINFER_MOE_FP4=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
command: >
Sehyo/Qwen3.5-122B-A10B-NVFP4
--trust-remote-code
--enable-prefix-caching
--kv-cache-dtype fp8
--max-model-len auto
--max-num-seqs 4
--gpu-memory-utilization 0.94
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser hermes
--enable-chunked-prefill
--swap-space=32
--chat-template /app/no_thought.jinja
networks:
- docker-net

networks:
docker-net:
external: true

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment