voyage-4-nano-ONNX

ONNX export of voyageai/voyage-4-nano optimized for CPU inference with sentence-transformers and Vespa.

Note: This is an ONNX-only model. Ignore the auto-generated snippet above - use the code below with backend="onnx".

Model Variants

Variant File Size Notes
FP32 onnx/model.onnx + model.onnx.data ~1.3GB Highest accuracy, external data file
INT8 onnx/model_qint8_avx512.onnx 332MB Recommended - 4x smaller, cosine sim ~0.96 vs FP32

Specifications

  • Embedding dimension: 2048
  • Max sequence length: 32K tokens
  • Pooling: Mean pooling
  • Normalization: L2 normalized embeddings
  • Model inputs: input_ids, attention_mask (no position_ids required)

Quick Start with sentence-transformers

from sentence_transformers import SentenceTransformer

# Load the ONNX model
model = SentenceTransformer(
    "thomasht86/voyage-4-nano-ONNX",
    backend="onnx",
    model_kwargs={"file_name": "onnx/model_qint8_avx512.onnx"},
)

# For documents (no prefix needed)
docs = ["The quick brown fox jumps over the lazy dog."]
doc_embeddings = model.encode(docs)

# For queries (use the prompt name)
queries = ["What animal jumps over the dog?"]
query_embeddings = model.encode(queries, prompt_name="query")

print(f"Embedding shape: {doc_embeddings.shape}")  # (1, 2048)

ONNX Runtime Direct Usage

import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("thomasht86/voyage-4-nano-ONNX")
session = ort.InferenceSession("onnx/model_qint8_avx512.onnx")

# Tokenize
text = "The quick brown fox jumps over the lazy dog."
inputs = tokenizer(text, return_tensors="np", padding=True, truncation=True)

# Run inference (only input_ids and attention_mask needed)
outputs = session.run(
    None,
    {
        "input_ids": inputs["input_ids"],
        "attention_mask": inputs["attention_mask"],
    },
)

# outputs[0] is token embeddings, outputs[1] is sentence embedding (pooled + normalized)
token_embeddings = outputs[0]  # Shape: (batch, seq_len, 2048)
sentence_embedding = outputs[1]  # Shape: (batch, 2048)

print(f"Sentence embedding shape: {sentence_embedding.shape}")

Query Prefix Convention

For retrieval tasks, prepend queries with the instruction prefix:

query_prefix = "Represent the query for retrieving supporting documents: "
query = query_prefix + "What is the capital of France?"

Documents should be encoded without any prefix.

Vespa Configuration

Use this model with Vespa's hugging-face-embedder:

services.xml

<component id="voyage" type="hugging-face-embedder">
    <transformer-model url="https://huggingface.co/thomasht86/voyage-4-nano-ONNX/resolve/main/onnx/model_qint8_avx512.onnx"/>
    <tokenizer-model url="https://huggingface.co/thomasht86/voyage-4-nano-ONNX/raw/main/tokenizer.json"/>
    <pooling-strategy>mean</pooling-strategy>
    <normalize>true</normalize>
    <prepend>
        <query>Represent the query for retrieving supporting documents: </query>
    </prepend>
</component>

Schema

schema doc {
    document doc {
        field text type string {
            indexing: summary | index
        }
    }

    field embedding type tensor<float>(x[2048]) {
        indexing: input text | embed voyage | attribute | index
        attribute {
            distance-metric: angular
        }
        index {
            hnsw {
                max-links-per-node: 16
                neighbors-to-explore-at-insert: 200
            }
        }
    }
}

Benchmarks

Cosine similarity between FP32 and INT8 embeddings on sample texts:

Text Type Cosine Similarity
Short sentences ~0.96
Long paragraphs ~0.96

The INT8 quantized model maintains high fidelity to the FP32 model while being 4x smaller.

License

Apache 2.0 - See the original voyage-4-nano model card for details.

Acknowledgments

Downloads last month
9,133
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support