Text Classification
Transformers
Safetensors
English
modernbert
translation-source
bifrost
Eval Results (legacy)
text-embeddings-inference
🇪🇺 Region: EU
Instructions to use NbAiLab/bifrost-translation-source-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NbAiLab/bifrost-translation-source-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="NbAiLab/bifrost-translation-source-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("NbAiLab/bifrost-translation-source-classifier") model = AutoModelForSequenceClassification.from_pretrained("NbAiLab/bifrost-translation-source-classifier") - Notebooks
- Google Colab
- Kaggle
metadata
language: en
license: apache-2.0
library_name: transformers
tags:
- text-classification
- translation-source
- bifrost
datasets:
- HuggingFaceFW/finetranslations
- HuggingFaceFW/fineweb
base_model: jhu-clsp/mmBERT-base
pipeline_tag: text-classification
pretty_name: Bifrost Translation-Source Classifier
model-index:
- name: bifrost-translation-source-classifier
results:
- task:
type: text-classification
name: Translation Source Classification
metrics:
- type: accuracy
value: 63.0%
name: Test Accuracy
- type: loss
value: 1.4607
name: Test Loss
Bifrost Translation-Source Classifier
Predicts which language an English text was originally translated from. Given English text, the model detects cultural and stylistic traces of the original source language.
Intended Use
This classifier is part of the Bifrost pipeline. It identifies culturally relevant content for translation into target languages.
Training
- Base model:
jhu-clsp/mmBERT-base - Frozen base: True (only classification head trained)
- Training samples per language: 10,000
- Validation samples per language: 1,000
- Max sequence length: 512
- Learning rate: 0.001
- Epochs: 20 (with early stopping, patience 3)
During training, a random 512-token window is sampled from each document, exposing the model to different parts of longer texts across epochs. Validation uses a deterministic window per document for comparable losses.
Performance (held-out test set)
- Test loss: 1.4607
- Test accuracy: 63.0%
Labels (180 classes)
aebafralsamhanpapcarbargarsaryarzasmastazbazjbakbarbelbenbewbhobodbosbulcatcebceschechvckbcmncnhcoscrhcymdandeudivdzoekkellengepoeusfaofasfijfilfinfrafryfurgazglagleglgglkgrcgswgujhachathauhawhbohebhifhilhinhnehrvhsbhunhyehywibaiboiloindislitajavjpnkalkankatkazkhakhkkhmkinkirkiukmrkorlaolatlimlinlitltzlugluslvsmaimalmarmhrmkdmltmrimwwmyanapndendsnewnldnnonobnpinrmnyaocioryosspanpappbtpltpnbpolporrohronruerunrussahsanscnsdhsinslkslvsmesmosnasndsomsotspasrdsrpsunsweswhtamtatteltgkthatirtukturtyvudmuigukrurduznuzsviexhoyddyoryuezeazsmzul
Training Data
Built from HuggingFaceFW/finetranslations (translated texts) and HuggingFaceFW/fineweb (native English). 10,000 train + 1,000 val samples per language.