Instructions to use MimiTechAI/mimi-pro with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use MimiTechAI/mimi-pro with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="MimiTechAI/mimi-pro",
	filename="mimi-qwen3-4b-q4km.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps

llama.cpp

How to use MimiTechAI/mimi-pro with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MimiTechAI/mimi-pro
# Run inference directly in the terminal:
llama-cli -hf MimiTechAI/mimi-pro

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf MimiTechAI/mimi-pro
# Run inference directly in the terminal:
llama-cli -hf MimiTechAI/mimi-pro

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf MimiTechAI/mimi-pro
# Run inference directly in the terminal:
./llama-cli -hf MimiTechAI/mimi-pro

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf MimiTechAI/mimi-pro
# Run inference directly in the terminal:
./build/bin/llama-cli -hf MimiTechAI/mimi-pro

Use Docker

docker model run hf.co/MimiTechAI/mimi-pro

LM Studio
Jan

vLLM

How to use MimiTechAI/mimi-pro with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "MimiTechAI/mimi-pro"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "MimiTechAI/mimi-pro",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/MimiTechAI/mimi-pro

Ollama
How to use MimiTechAI/mimi-pro with Ollama:
```
ollama run hf.co/MimiTechAI/mimi-pro
```

Unsloth Studio new

How to use MimiTechAI/mimi-pro with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MimiTechAI/mimi-pro to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for MimiTechAI/mimi-pro to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for MimiTechAI/mimi-pro to start chatting

Pi new

How to use MimiTechAI/mimi-pro with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf MimiTechAI/mimi-pro

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "mimi-pro"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Docker Model Runner
How to use MimiTechAI/mimi-pro with Docker Model Runner:
```
docker model run hf.co/MimiTechAI/mimi-pro
```

Lemonade

How to use MimiTechAI/mimi-pro with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull MimiTechAI/mimi-pro

Run and chat with the model

lemonade run user.mimi-pro-{{QUANT_TAG}}

List all available models

lemonade list

MIMI Pro

MIMI Pro is a 4-billion parameter AI agent model optimized for structured tool calling and autonomous task execution — designed to run entirely on-device, in the browser, with zero cloud dependencies.

Part of the MIMI Model Family by Mimi Tech AI.

🔬 V1 — Experimental Release. This model is fine-tuned for the MIMI Agent's custom tool-calling format. For standard tool calling, the base Qwen3-4B may perform equally well or better with native <tool_call> prompting. V2 with official BFCL scores and Qwen3-native format support is in development.

Performance

BFCL V4 Benchmark (Partial — Single-Turn, 20 samples/category)

Category	MIMI Pro V1	Base Qwen3-4B	Notes
Simple Python	60.8% (400 tests)	80.0% (20 tests)	Base outperforms
Simple Java	21.0% (100 tests)	60.0% (20 tests)	Base outperforms
Multiple (Sequential)	57.5% (200 tests)	75.0% (20 tests)	Base outperforms
Parallel	2.0% (200 tests)	75.0% (20 tests)	Fine-tune degraded
Irrelevance	90% (20 tests)	100% (20 tests)	Both strong
Live Simple	—	90.0% (20 tests)	Base only

⚠️ Important Context: The previously reported "97.7% accuracy" was a training validation metric (token-level accuracy on the training/eval split), not a standardized benchmark score. The table above shows actual BFCL V4 results. We are working on a full official evaluation.

Training Metrics (Internal)

Metric	Value
Training Token Accuracy	97.66%
Eval Token Accuracy	97.29%
Training Loss	0.084
Parameters	4.02 Billion
Quantized Size	2.3 GB (Q4_K_M)

Architecture

MIMI Pro is built on Qwen3-4B, fine-tuned with LoRA (rank=64, alpha=128) on 1,610 curated tool-calling examples using Unsloth on NVIDIA DGX Spark.

Key Design Decisions:

Custom tool-calling format optimized for the MIMI Agent browser environment
19 tool types covering web search, code execution, file operations, browser automation
Trained on NVIDIA DGX Spark (Grace Blackwell GB10, 128 GB unified memory)

Known Limitations of V1:

Fine-tuning with aggressive hyperparameters (LoRA r=64, 3 epochs, LR 2e-4) caused some capability degradation vs. the base model, particularly for parallel tool calling
The custom {"tool": ..., "parameters": ...} format diverges from Qwen3's native <tool_call> format
V2 will address these issues with conservative fine-tuning and Qwen3-native format support

Supported Tools

Category	Tools
🌐 Web	web_search, browse_url, browser_action
💻 Code	execute_python, create_file, edit_file
🔬 Research	deep_research, generate_document
📁 System	read_file, list_directory, run_terminal
🧠 Reasoning	Multi-step orchestration

Quick Start

Browser (wllama/WebAssembly)

import { Wllama } from '@anthropic-ai/wllama';

const wllama = new Wllama(wasmPaths);
await wllama.loadModelFromUrl(
  'https://huggingface.co/MimiTechAI/mimi-pro/resolve/main/mimi-qwen3-4b-q4km.gguf',
  { n_ctx: 4096 }
);

const response = await wllama.createChatCompletion([
  { role: 'system', content: 'You are MIMI, an AI agent with tool access.' },
  { role: 'user', content: 'Search for the latest AI news and summarize it' }
]);

llama.cpp

./llama-cli -m mimi-qwen3-4b-q4km.gguf \
  -p "<|im_start|>system\nYou are MIMI, an AI agent with tool access.<|im_end|>\n<|im_start|>user\nSearch for the latest AI news<|im_end|>\n<|im_start|>assistant\n" \
  -n 512 --temp 0.6

Python

from llama_cpp import Llama
llm = Llama(model_path="mimi-qwen3-4b-q4km.gguf", n_ctx=4096)
output = llm.create_chat_completion(messages=[
    {"role": "system", "content": "You are MIMI, an AI agent with tool access."},
    {"role": "user", "content": "Search for the latest AI news"}
])

Output Format

MIMI Pro V1 uses a custom format (V2 will support Qwen3-native <tool_call> format):

{"tool": "web_search", "parameters": {"query": "latest AI news March 2026", "limit": 5}}

The MIMI Model Family

Model	Parameters	Size	Target Device	Status
MIMI Nano	0.6B	~400 MB	Any device, IoT	🔜 Coming
MIMI Small	1.7B	~1.0 GB	Mobile & tablets	🔜 Coming
MIMI Pro	4.02B	2.3 GB	Desktop & laptop	✅ Available
MIMI Max	8B	~4.5 GB	Workstations	🔜 Coming

All models share the same tool-calling format, are quantized to GGUF Q4_K_M, and run in the browser via WebAssembly.

Training Details

method: LoRA (PEFT) via Unsloth
base_model: Qwen/Qwen3-4B
lora_rank: 64
lora_alpha: 128
lora_dropout: 0.05
target_modules: [q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj]
learning_rate: 2.0e-04
epochs: 3
effective_batch_size: 8
max_seq_length: 2048
optimizer: adamw_8bit
precision: bf16
gradient_checkpointing: true
packing: true
dataset: 1,610 curated tool-calling examples (178K tokens)
hardware: NVIDIA DGX Spark (GB10 Grace Blackwell, 128 GB unified memory)

Why MIMI?

🔒 Privacy First — Your data never leaves your device. Period.
💰 Zero Cost — No API keys, no subscriptions, no per-token billing.
⚡ Fast — Runs at native speed via WebAssembly, no server round-trips.
🌍 Works Offline — Once downloaded, no internet required.
🔧 Tool Native — Purpose-built for autonomous tool calling.

Limitations

V1 uses a custom tool-calling format (not Qwen3-native <tool_call>)
Parallel tool calling (multiple simultaneous calls) is degraded vs. base model
Context window: 4,096 tokens (training config). Base architecture supports 32K.
Requires ~3 GB RAM for inference in browser.
Q4_K_M quantization trades minimal quality for 3.5x size reduction.

Roadmap

V1 — Custom format, 19 tools, browser-optimized (current release)
V2 — Qwen3-native <tool_call> format, official BFCL V4 scores, conservative fine-tuning
Model Family — Nano (0.6B), Small (1.7B), Max (8B) releases
Multi-Turn — Agentic conversation chains with tool result feedback

About Mimi Tech AI

Mimi Tech AI builds on-device AI — no cloud, no data leaks, full user control.

License

Apache 2.0 — free for commercial and personal use.

Citation

@misc{mimitechai2026mimi,
  title={MIMI Pro: On-Device AI Agent Model for Browser-Based Tool Calling},
  author={Bemler, Michael and Soppa, Michael},
  year={2026},
  publisher={Mimi Tech AI},
  url={https://huggingface.co/MimiTechAI/mimi-pro}
}

Downloads last month: 12

GGUF

Model size

4B params

Architecture

qwen3

Hardware compatibility

We're not able to determine the quantization variants.

View all variants

Model tree for MimiTechAI/mimi-pro

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Quantized

(215)

this model

Evaluation results

Simple Function Calling (Python) on BFCL V4
self-reported

60.800
Multiple Sequential Calls on BFCL V4
self-reported

57.500
Irrelevance Detection on BFCL V4
self-reported

90.000