voyage-4-nano-ONNX
ONNX export of voyageai/voyage-4-nano optimized for CPU inference with sentence-transformers and Vespa.
Note: This is an ONNX-only model. Ignore the auto-generated snippet above - use the code below with
backend="onnx".
Model Variants
| Variant | File | Size | Notes |
|---|---|---|---|
| FP32 | onnx/model.onnx + model.onnx.data |
~1.3GB | Highest accuracy, external data file |
| INT8 | onnx/model_qint8_avx512.onnx |
332MB | Recommended - 4x smaller, cosine sim ~0.96 vs FP32 |
Specifications
- Embedding dimension: 2048
- Max sequence length: 32K tokens
- Pooling: Mean pooling
- Normalization: L2 normalized embeddings
- Model inputs:
input_ids,attention_mask(noposition_idsrequired)
Quick Start with sentence-transformers
from sentence_transformers import SentenceTransformer
# Load the ONNX model
model = SentenceTransformer(
"thomasht86/voyage-4-nano-ONNX",
backend="onnx",
model_kwargs={"file_name": "onnx/model_qint8_avx512.onnx"},
)
# For documents (no prefix needed)
docs = ["The quick brown fox jumps over the lazy dog."]
doc_embeddings = model.encode(docs)
# For queries (use the prompt name)
queries = ["What animal jumps over the dog?"]
query_embeddings = model.encode(queries, prompt_name="query")
print(f"Embedding shape: {doc_embeddings.shape}") # (1, 2048)
ONNX Runtime Direct Usage
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thomasht86/voyage-4-nano-ONNX")
session = ort.InferenceSession("onnx/model_qint8_avx512.onnx")
# Tokenize
text = "The quick brown fox jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)
# Run inference (only input_ids and attention_mask needed)
outputs = session.run(
None,
{
"input_ids": inputs["input_ids"],
"attention_mask": inputs["attention_mask"],
},
)
# outputs[0] is token embeddings, outputs[1] is sentence embedding (pooled + normalized)
token_embeddings = outputs[0] # Shape: (batch, seq_len, 2048)
sentence_embedding = outputs[1] # Shape: (batch, 2048)
print(f"Sentence embedding shape: {sentence_embedding.shape}")
Query Prefix Convention
For retrieval tasks, prepend queries with the instruction prefix:
query_prefix = "Represent the query for retrieving supporting documents: "
query = query_prefix + "What is the capital of France?"
Documents should be encoded without any prefix.
Vespa Configuration
Use this model with Vespa's hugging-face-embedder:
services.xml
<component id="voyage" type="hugging-face-embedder">
<transformer-model url="https://huggingface.co/thomasht86/voyage-4-nano-ONNX/resolve/main/onnx/model_qint8_avx512.onnx"/>
<tokenizer-model url="https://huggingface.co/thomasht86/voyage-4-nano-ONNX/raw/main/tokenizer.json"/>
<pooling-strategy>mean</pooling-strategy>
<normalize>true</normalize>
<prepend>
<query>Represent the query for retrieving supporting documents: </query>
</prepend>
</component>
Schema
schema doc {
document doc {
field text type string {
indexing: summary | index
}
}
field embedding type tensor<float>(x[2048]) {
indexing: input text | embed voyage | attribute | index
attribute {
distance-metric: angular
}
index {
hnsw {
max-links-per-node: 16
neighbors-to-explore-at-insert: 200
}
}
}
}
Benchmarks
Cosine similarity between FP32 and INT8 embeddings on sample texts:
| Text Type | Cosine Similarity |
|---|---|
| Short sentences | ~0.96 |
| Long paragraphs | ~0.96 |
The INT8 quantized model maintains high fidelity to the FP32 model while being 4x smaller.
License
Apache 2.0 - See the original voyage-4-nano model card for details.
Acknowledgments
- Original model by Voyage AI
- ONNX export using sentence-transformers and optimum
- Downloads last month
- 9,133