Instructions to use thoughtworks/MiniMax-M2.5-Eagle3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use thoughtworks/MiniMax-M2.5-Eagle3 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="thoughtworks/MiniMax-M2.5-Eagle3")

# Load model directly
from transformers import AutoTokenizer, LlamaForCausalLMEagle3

tokenizer = AutoTokenizer.from_pretrained("thoughtworks/MiniMax-M2.5-Eagle3")
model = LlamaForCausalLMEagle3.from_pretrained("thoughtworks/MiniMax-M2.5-Eagle3")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use thoughtworks/MiniMax-M2.5-Eagle3 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "thoughtworks/MiniMax-M2.5-Eagle3"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/MiniMax-M2.5-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/thoughtworks/MiniMax-M2.5-Eagle3

SGLang

How to use thoughtworks/MiniMax-M2.5-Eagle3 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "thoughtworks/MiniMax-M2.5-Eagle3" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/MiniMax-M2.5-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "thoughtworks/MiniMax-M2.5-Eagle3" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "thoughtworks/MiniMax-M2.5-Eagle3",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use thoughtworks/MiniMax-M2.5-Eagle3 with Docker Model Runner:
```
docker model run hf.co/thoughtworks/MiniMax-M2.5-Eagle3
```

Minimax 2.7

by dustinogle1 - opened Apr 12

Discussion

dustinogle1

Apr 12

Does this work with Minimax 2.7? And is it possible to use with something like lmstudio on mlx?

Or is it only vllm? If I can use vllm on mac would that work? I know there is a xmlx project built on vllm.

jpsequeira

Apr 13

It does work with M2.7 with and acceptance rate of around 25%, roughly 16/17% speedup. Don't know about the rest.

jpsequeira

Apr 13

This is actually much better than I thought...
I'm getting upwards of 50% uptick in gen throughput, still investigating the correct balance.

scottgl

Apr 14

It's possible to fine tune the M2.7 EAGLE3 head from M2.5, which would be significantly shorter than fine-tuning it from scratch.

scottgl

Apr 15

It does work with M2.7 with and acceptance rate of around 25%, roughly 16/17% speedup. Don't know about the rest.

Did you get this result using the standard https://huggingface.co/MiniMaxAI/MiniMax-M2.7, or are you able to run it with a quantized model, such as NVFP4?

jpsequeira

Apr 15

Base, I tried but at the moment with voipmonitor:cu130 nvfp4 requires Spec V2 whichs limits top_k to 1 and this hinders the draft model acceptance rate.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment