ColBERT-AZ

A late-interaction retrieval model for Azerbaijani built on top of mmBERT-base-en-az. Trained via cross-encoder distillation from bge-reranker-v2-m3 on a mix of native Azerbaijani and translated retrieval data.

ColBERT-AZ uses late interaction (token-level MaxSim scoring) rather than dense single-vector retrieval, providing higher precision in retrieval compared to bi-encoder models of similar or larger size.

Model Details

Property Value
Parameters 165M
Embedding dim 128 (per token)
Backbone mmBERT-base-en-az (ModernBERT)
Architecture Late interaction (ColBERT)
Query max length 32 tokens
Document max length 256 tokens
Languages Azerbaijani, English
Training epochs 1

Training

Data

ColBERT-AZ was trained on 3 million triplets sampled from a weighted mix of four reranked datasets:

All datasets include reranker scores from bge-reranker-v2-m3, used as teacher signal for knowledge distillation.

Recipe

Hyperparameter Value
Optimizer AdamW
Learning rate 1e-6
Weight decay 0.01
Warmup ratio 0.10
Schedule Cosine
Batch size 16 (effective 32 via gradient accumulation)
Negatives per query (K) 8
False negative filter threshold 0.9 × pos_score
Distillation alpha (KL weight) 0.7
Contrastive temperature 0.05
Teacher temperature 1.0
Mixed precision BF16
Epochs 1
Hardware NVIDIA RTX 5090 (32GB)

Loss

Combined KL distillation + InfoNCE:

L = α × KL(softmax(student_scores) || softmax(teacher_scores)) + (1 − α) × InfoNCE

where α = 0.7 and student scores are computed via MaxSim over [pos, neg_1, ..., neg_K].

Evaluation

Held-out validation

Evaluated on 4,500 held-out triplets (1,500 per native source). Each query is ranked among 1 positive and 8 hard negatives.

Source R@1 R@3 MRR NDCG@10
Books 0.5387 0.7693 0.6821 0.7584
Legislation 0.6633 0.8433 0.7679 0.8234
Retriever (general) 0.8340 0.9327 0.8901 0.9167
Macro average 0.6787 0.8484 0.7800 0.8328

AZ-MIRAGE benchmark

Evaluated on the AZ-MIRAGE retrieval benchmark (7,373 queries, 40,448 document pool):

Metric Score
P@1 0.3058
R@5 0.7518
R@10 0.8054
NDCG@5 0.5528
NDCG@10 0.5704
MRR@10 0.4930
F1@10 0.1464

Comparison with bi-encoder models on AZ-MIRAGE:

Model Params NDCG@10 MRR@10 P@1
ColBERT-AZ (this model) 165M 0.5704 0.4930 0.3058
BAAI/bge-m3 568M 0.5079 0.4204 0.2310
google/gemini-embedding-2-preview API 0.5309 0.4372 0.2338
perplexity/pplx-embed-v1-4b API 0.5225 0.4361 0.2470
microsoft/harrier-oss-v1-0.6b 600M 0.5168 0.4321 0.2535
intfloat/multilingual-e5-large 560M 0.4875 0.4043 0.2264
intfloat/multilingual-e5-base 278M 0.4672 0.3852 0.2116
sentence-transformers/LaBSE 471M 0.2472 0.1944 0.0943

Usage

This repository contains:

  • config.json, model.safetensors, tokenizer.* — encoder backbone (mmBERT-base-en-az)
  • projection.pt — ColBERT linear projection layer (768 → 128, no bias)

ColBERT requires both the backbone and the projection layer for correct inference.

Loading the model

import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModel

class ColBERT(nn.Module):
    def __init__(self, model_name: str, embedding_dim: int = 128):
        super().__init__()
        self.backbone = AutoModel.from_pretrained(model_name)
        self.projection = nn.Linear(self.backbone.config.hidden_size, embedding_dim, bias=False)

    @torch.no_grad()
    def encode(self, input_ids, attention_mask, keep_mask=None):
        out = self.backbone(input_ids=input_ids, attention_mask=attention_mask, return_dict=True)
        emb = self.projection(out.last_hidden_state)
        emb = F.normalize(emb, p=2, dim=-1)
        eff_mask = attention_mask if keep_mask is None else attention_mask * keep_mask
        emb = emb * eff_mask.unsqueeze(-1).float()
        return emb, eff_mask

# Load
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("LocalDoc/colbert-az")

# Add ColBERT special tokens
tokenizer.add_special_tokens({"additional_special_tokens": ["[Q]", "[D]"]})

model = ColBERT("LocalDoc/colbert-az")
model.backbone.resize_token_embeddings(len(tokenizer))

# Load projection layer
from huggingface_hub import hf_hub_download
proj_path = hf_hub_download(repo_id="LocalDoc/colbert-az", filename="projection.pt")
model.projection.load_state_dict(torch.load(proj_path, map_location="cpu"))

model = model.to(device).eval()

Encoding queries and documents

# Tokenization helpers
def tokenize_query(text: str, max_len: int = 32):
    text = f"[Q] {text}"
    enc = tokenizer(text, padding="max_length", truncation=True,
                    max_length=max_len, return_tensors="pt")
    # ColBERT trick: replace pad with mask for query expansion
    pad_mask = enc["input_ids"] == tokenizer.pad_token_id
    enc["input_ids"][pad_mask] = tokenizer.mask_token_id
    enc["attention_mask"] = torch.ones_like(enc["input_ids"])
    return enc

def tokenize_doc(text: str, max_len: int = 256):
    text = f"[D] {text}"
    return tokenizer(text, padding=True, truncation=True,
                     max_length=max_len, return_tensors="pt")

# Compute MaxSim score between query and a single document
def maxsim_score(query: str, document: str) -> float:
    q_enc = {k: v.to(device) for k, v in tokenize_query(query).items()}
    d_enc = {k: v.to(device) for k, v in tokenize_doc(document).items()}

    q_emb, _ = model.encode(q_enc["input_ids"], q_enc["attention_mask"])
    d_emb, d_mask = model.encode(d_enc["input_ids"], d_enc["attention_mask"])

    # MaxSim: for each query token, take max similarity over doc tokens, then sum
    sim = torch.einsum("qld,bnd->qlbn", q_emb, d_emb)
    sim = sim.masked_fill(~d_mask.unsqueeze(0).unsqueeze(0).bool(), float("-inf"))
    max_per_token, _ = sim.max(dim=-1)
    score = max_per_token.sum(dim=1).item()
    return score

# Example
query = "Azərbaycan mədəniyyətinin tarixi"
doc = "Azərbaycan mədəniyyəti zəngin tarixə malikdir və qədim dövrlərdən başlayaraq inkişaf edib."
print(f"Score: {maxsim_score(query, doc):.4f}")

Recommended retrieval pipeline

For production retrieval, use ColBERT-AZ with a proper indexing library that supports late interaction:

  • PLAID — official ColBERT indexing
  • pylate — modern ColBERT framework

These libraries handle efficient indexing, scalable MaxSim retrieval, and quantization for production deployment.

Citation

@misc{colbert-az-2026,
  title  = {ColBERT-AZ: Late-Interaction Retrieval for Azerbaijani},
  author = {LocalDoc},
  year   = {2026},
  url    = {https://huggingface.co/LocalDoc/colbert-az}
}

License

Apache 2.0

Acknowledgements

Downloads last month
20
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for LocalDoc/colbert-az

Finetuned
(1)
this model

Papers for LocalDoc/colbert-az