Running command?

#2
by bhat1 - opened

What settings are you using to get best tokens/s ?

running with enforce eager i was getting only 8 tokens/s

What settings are you using to get best tokens/s ?

running with enforce eager i was getting only 8 tokens/s

What's your hardware and your startup command?

rtx pro 6000 blackwell.

vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 --trust-remote-code --enforce-eager

if not using enforce eager, i got cuda graph errors

rtx pro 6000 blackwell.

vllm serve Sehyo/Qwen3.5-122B-A10B-NVFP4 --trust-remote-code --enforce-eager

if not using enforce eager, i got cuda graph errors

I can't get this to work with 6000 Pro Blackwell + CUDA 13.0 + vllm/vllm-openai:nightly + huggingface/transformers.git. Could you share more details on your environment? Thanks.

vllm | (EngineCore_DP0 pid=126) 2026-02-28 10:52:44,381 - WARNING - autotuner.py:496 - flashinfer.jit: [Autotuner]: Skipping tactic <flashinfer.fused_moe.core.get_cutlass_fused_moe_module..MoERunner object at 0x7f1fe45ed9a0> 14, due to failure while profiling: [TensorRT-LLM][ERROR] Assertion failed: Failed to initialize cutlass TMA WS grouped gemm. Error: Error Internal (/workspace/build/aot/generated/cutlass_instantiations/120/gemm_grouped/120/cutlass_kernel_file_gemm_grouped_sm120_M128_BS_group2.generated.cu:60)

I got it running just fine (up to 113t/s generation) with this in my bash script:

cd ~/vllm/vllm-p2p
source ~/vllm/vllm-p2p/.venv/bin/activate
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0,1
export HF_HOME=~/.cache/huggingface
export HUGGINGFACE_HUB_CACHE=$HF_HOME/hub
export VLLM_WORKER_MULTIPROC_METHOD=spawn
export SAFETENSORS_FAST_GPU=1
export VLLM_DISABLE_PYNCCL=1
export NCCL_IB_DISABLE=1
export VLLM_SLEEP_WHEN_IDLE=1
export OMP_NUM_THREADS=12
export VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
export VLLM_NVFP4_GEMM_BACKEND=cutlass
export VLLM_USE_FLASHINFER_MOE_FP4=0


python -m vllm.entrypoints.openai.api_server \
  --model Sehyo/Qwen3.5-122B-A10B-NVFP4 \
  --served-model-name "Qwen3.5-122B" \
  --host 0.0.0.0 \
  --port 8000 \
  --trust-remote-code \
  --enable-prefix-caching \
  --tensor-parallel-size 2 \
  --gpu-memory-utilization 0.94 \
  --max-model-len auto \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 32 \
  --reasoning-parser qwen3 \
  --enable-auto-tool-choice \
  --enable-chunked-prefill \
  --swap-space=32 \
  --tool-call-parser qwen3_coder 

FWIW you may or may not require all of these parameters. It's adjusted from my MiniMax M2.5 run config. And "vllm-p2p" refers to https://huggingface.co/lukealonso/MiniMax-M2.5-NVFP4/discussions/1#6990a4203cb24455b38036b8 which improves Blackwell p2p.

Edit: 0.95 memory/ 16 seqs for TP=1 works.

What's the max context you can get on 1 6000 pro

According to VLLM startup log, GPU KV is about 138k with default settings, 150k with --language-model-only, or 275k with --kv-cache-dtype fp8_e4m3.

I didn't test capacity near the limits of the cache, prompt with imaging, or precision. Enabling MTP on the updated version caused tool calls to start failing immediately (I never seem to get anything out of those settings anyway).

Thank you. Based on the information provided above, I have successfully deployed the setup using Docker.
I am using a Blackwell RTX Pro 6000 96GB Workstation.

Generation Throughput 80 ~ 126 tokens/s
Prompt Throughput 400 ~ 1,300 tokens/s

Here is my successful Docker Compose file for reference.

services:
vllm:
image: vllm/vllm-openai:nightly
container_name: vllm
restart: unless-stopped
ports:
- "8000:8000"
volumes:
- /mnt/data/vllm:/root/.cache/huggingface
- ./no_thought.jinja:/app/no_thought.jinja
environment:
- OMP_NUM_THREADS=4
- PYTORCH_ALLOC_CONF=expandable_segments:True
- VLLM_WORKER_MULTIPROC_METHOD=spawn
- SAFETENSORS_FAST_GPU=1
- VLLM_DISABLE_PYNCCL=1
- NCCL_IB_DISABLE=1
- VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
- VLLM_NVFP4_GEMM_BACKEND=cutlass
- VLLM_USE_FLASHINFER_MOE_FP4=0
deploy:
resources:
reservations:
devices:
- driver: nvidia
capabilities: [gpu]
command: >
Sehyo/Qwen3.5-122B-A10B-NVFP4
--trust-remote-code
--enable-prefix-caching
--kv-cache-dtype fp8
--max-model-len auto
--max-num-seqs 4
--gpu-memory-utilization 0.94
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser hermes
--enable-chunked-prefill
--swap-space=32
--chat-template /app/no_thought.jinja
networks:
- docker-net

networks:
docker-net:
external: true

Sign up or log in to comment