Text Classification
Transformers
Safetensors
English
modernbert
translation-source
bifrost
Eval Results (legacy)
text-embeddings-inference
🇪🇺 Region: EU
Instructions to use NbAiLab/bifrost-translation-source-classifier with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use NbAiLab/bifrost-translation-source-classifier with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-classification", model="NbAiLab/bifrost-translation-source-classifier")# Load model directly from transformers import AutoTokenizer, AutoModelForSequenceClassification tokenizer = AutoTokenizer.from_pretrained("NbAiLab/bifrost-translation-source-classifier") model = AutoModelForSequenceClassification.from_pretrained("NbAiLab/bifrost-translation-source-classifier") - Notebooks
- Google Colab
- Kaggle
| language: en | |
| license: apache-2.0 | |
| library_name: transformers | |
| tags: | |
| - text-classification | |
| - translation-source | |
| - bifrost | |
| datasets: | |
| - HuggingFaceFW/finetranslations | |
| - HuggingFaceFW/fineweb | |
| base_model: jhu-clsp/mmBERT-base | |
| pipeline_tag: text-classification | |
| pretty_name: "Bifrost Translation-Source Classifier" | |
| model-index: | |
| - name: bifrost-translation-source-classifier | |
| results: | |
| - task: | |
| type: text-classification | |
| name: Translation Source Classification | |
| metrics: | |
| - type: accuracy | |
| value: 63.0% | |
| name: Test Accuracy | |
| - type: loss | |
| value: 1.4607 | |
| name: Test Loss | |
| # Bifrost Translation-Source Classifier | |
| Predicts which language an English text was originally translated from. | |
| Given English text, the model detects cultural and stylistic traces of the | |
| original source language. | |
| ## Intended Use | |
| This classifier is part of the [Bifrost](https://github.com/NationalLibraryOfNorway/bifrost) pipeline. | |
| It identifies culturally relevant content for translation into target languages. | |
| ## Training | |
| - **Base model**: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base) | |
| - **Frozen base**: True (only classification head trained) | |
| - **Training samples per language**: 10,000 | |
| - **Validation samples per language**: 1,000 | |
| - **Max sequence length**: 512 | |
| - **Learning rate**: 0.001 | |
| - **Epochs**: 20 (with early stopping, patience 3) | |
| During training, a random 512-token window is sampled from each document, | |
| exposing the model to different parts of longer texts across epochs. | |
| Validation uses a deterministic window per document for comparable losses. | |
| ## Performance (held-out test set) | |
| - **Test loss**: 1.4607 | |
| - **Test accuracy**: 63.0% | |
| ## Labels (180 classes) | |
| - `aeb` | |
| - `afr` | |
| - `als` | |
| - `amh` | |
| - `anp` | |
| - `apc` | |
| - `arb` | |
| - `arg` | |
| - `ars` | |
| - `ary` | |
| - `arz` | |
| - `asm` | |
| - `ast` | |
| - `azb` | |
| - `azj` | |
| - `bak` | |
| - `bar` | |
| - `bel` | |
| - `ben` | |
| - `bew` | |
| - `bho` | |
| - `bod` | |
| - `bos` | |
| - `bul` | |
| - `cat` | |
| - `ceb` | |
| - `ces` | |
| - `che` | |
| - `chv` | |
| - `ckb` | |
| - `cmn` | |
| - `cnh` | |
| - `cos` | |
| - `crh` | |
| - `cym` | |
| - `dan` | |
| - `deu` | |
| - `div` | |
| - `dzo` | |
| - `ekk` | |
| - `ell` | |
| - `eng` | |
| - `epo` | |
| - `eus` | |
| - `fao` | |
| - `fas` | |
| - `fij` | |
| - `fil` | |
| - `fin` | |
| - `fra` | |
| - `fry` | |
| - `fur` | |
| - `gaz` | |
| - `gla` | |
| - `gle` | |
| - `glg` | |
| - `glk` | |
| - `grc` | |
| - `gsw` | |
| - `guj` | |
| - `hac` | |
| - `hat` | |
| - `hau` | |
| - `haw` | |
| - `hbo` | |
| - `heb` | |
| - `hif` | |
| - `hil` | |
| - `hin` | |
| - `hne` | |
| - `hrv` | |
| - `hsb` | |
| - `hun` | |
| - `hye` | |
| - `hyw` | |
| - `iba` | |
| - `ibo` | |
| - `ilo` | |
| - `ind` | |
| - `isl` | |
| - `ita` | |
| - `jav` | |
| - `jpn` | |
| - `kal` | |
| - `kan` | |
| - `kat` | |
| - `kaz` | |
| - `kha` | |
| - `khk` | |
| - `khm` | |
| - `kin` | |
| - `kir` | |
| - `kiu` | |
| - `kmr` | |
| - `kor` | |
| - `lao` | |
| - `lat` | |
| - `lim` | |
| - `lin` | |
| - `lit` | |
| - `ltz` | |
| - `lug` | |
| - `lus` | |
| - `lvs` | |
| - `mai` | |
| - `mal` | |
| - `mar` | |
| - `mhr` | |
| - `mkd` | |
| - `mlt` | |
| - `mri` | |
| - `mww` | |
| - `mya` | |
| - `nap` | |
| - `nde` | |
| - `nds` | |
| - `new` | |
| - `nld` | |
| - `nno` | |
| - `nob` | |
| - `npi` | |
| - `nrm` | |
| - `nya` | |
| - `oci` | |
| - `ory` | |
| - `oss` | |
| - `pan` | |
| - `pap` | |
| - `pbt` | |
| - `plt` | |
| - `pnb` | |
| - `pol` | |
| - `por` | |
| - `roh` | |
| - `ron` | |
| - `rue` | |
| - `run` | |
| - `rus` | |
| - `sah` | |
| - `san` | |
| - `scn` | |
| - `sdh` | |
| - `sin` | |
| - `slk` | |
| - `slv` | |
| - `sme` | |
| - `smo` | |
| - `sna` | |
| - `snd` | |
| - `som` | |
| - `sot` | |
| - `spa` | |
| - `srd` | |
| - `srp` | |
| - `sun` | |
| - `swe` | |
| - `swh` | |
| - `tam` | |
| - `tat` | |
| - `tel` | |
| - `tgk` | |
| - `tha` | |
| - `tir` | |
| - `tuk` | |
| - `tur` | |
| - `tyv` | |
| - `udm` | |
| - `uig` | |
| - `ukr` | |
| - `urd` | |
| - `uzn` | |
| - `uzs` | |
| - `vie` | |
| - `xho` | |
| - `ydd` | |
| - `yor` | |
| - `yue` | |
| - `zea` | |
| - `zsm` | |
| - `zul` | |
| ## Training Data | |
| Built from [HuggingFaceFW/finetranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations) | |
| (translated texts) and [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb) | |
| (native English). 10,000 train + 1,000 val samples per language. | |