NbAiLab
/

bifrost-translation-source-classifier

Text Classification

translation-source

Eval Results (legacy)

text-embeddings-inference

🇪🇺 Region: EU

Model card Files Files and versions

bifrost-translation-source-classifier / README.md

Rolv-Arild's picture

Upload folder using huggingface_hub

182efc8 verified about 1 month ago

|

history blame contribute delete

3.88 kB

	---
	language: en
	license: apache-2.0
	library_name: transformers
	tags:
	- text-classification
	- translation-source
	- bifrost
	datasets:
	- HuggingFaceFW/finetranslations
	- HuggingFaceFW/fineweb
	base_model: jhu-clsp/mmBERT-base
	pipeline_tag: text-classification
	pretty_name: "Bifrost Translation-Source Classifier"
	model-index:
	- name: bifrost-translation-source-classifier
	results:
	- task:
	type: text-classification
	name: Translation Source Classification
	metrics:
	- type: accuracy
	value: 63.0%
	name: Test Accuracy
	- type: loss
	value: 1.4607
	name: Test Loss
	---

	# Bifrost Translation-Source Classifier

	Predicts which language an English text was originally translated from.
	Given English text, the model detects cultural and stylistic traces of the
	original source language.

	## Intended Use

	This classifier is part of the [Bifrost](https://github.com/NationalLibraryOfNorway/bifrost) pipeline.
	It identifies culturally relevant content for translation into target languages.

	## Training

	- Base model: [`jhu-clsp/mmBERT-base`](https://huggingface.co/jhu-clsp/mmBERT-base)
	- Frozen base: True (only classification head trained)
	- Training samples per language: 10,000
	- Validation samples per language: 1,000
	- Max sequence length: 512
	- Learning rate: 0.001
	- Epochs: 20 (with early stopping, patience 3)

	During training, a random 512-token window is sampled from each document,
	exposing the model to different parts of longer texts across epochs.
	Validation uses a deterministic window per document for comparable losses.

	## Performance (held-out test set)

	- Test loss: 1.4607
	- Test accuracy: 63.0%

	## Labels (180 classes)

	- `aeb`
	- `afr`
	- `als`
	- `amh`
	- `anp`
	- `apc`
	- `arb`
	- `arg`
	- `ars`
	- `ary`
	- `arz`
	- `asm`
	- `ast`
	- `azb`
	- `azj`
	- `bak`
	- `bar`
	- `bel`
	- `ben`
	- `bew`
	- `bho`
	- `bod`
	- `bos`
	- `bul`
	- `cat`
	- `ceb`
	- `ces`
	- `che`
	- `chv`
	- `ckb`
	- `cmn`
	- `cnh`
	- `cos`
	- `crh`
	- `cym`
	- `dan`
	- `deu`
	- `div`
	- `dzo`
	- `ekk`
	- `ell`
	- `eng`
	- `epo`
	- `eus`
	- `fao`
	- `fas`
	- `fij`
	- `fil`
	- `fin`
	- `fra`
	- `fry`
	- `fur`
	- `gaz`
	- `gla`
	- `gle`
	- `glg`
	- `glk`
	- `grc`
	- `gsw`
	- `guj`
	- `hac`
	- `hat`
	- `hau`
	- `haw`
	- `hbo`
	- `heb`
	- `hif`
	- `hil`
	- `hin`
	- `hne`
	- `hrv`
	- `hsb`
	- `hun`
	- `hye`
	- `hyw`
	- `iba`
	- `ibo`
	- `ilo`
	- `ind`
	- `isl`
	- `ita`
	- `jav`
	- `jpn`
	- `kal`
	- `kan`
	- `kat`
	- `kaz`
	- `kha`
	- `khk`
	- `khm`
	- `kin`
	- `kir`
	- `kiu`
	- `kmr`
	- `kor`
	- `lao`
	- `lat`
	- `lim`
	- `lin`
	- `lit`
	- `ltz`
	- `lug`
	- `lus`
	- `lvs`
	- `mai`
	- `mal`
	- `mar`
	- `mhr`
	- `mkd`
	- `mlt`
	- `mri`
	- `mww`
	- `mya`
	- `nap`
	- `nde`
	- `nds`
	- `new`
	- `nld`
	- `nno`
	- `nob`
	- `npi`
	- `nrm`
	- `nya`
	- `oci`
	- `ory`
	- `oss`
	- `pan`
	- `pap`
	- `pbt`
	- `plt`
	- `pnb`
	- `pol`
	- `por`
	- `roh`
	- `ron`
	- `rue`
	- `run`
	- `rus`
	- `sah`
	- `san`
	- `scn`
	- `sdh`
	- `sin`
	- `slk`
	- `slv`
	- `sme`
	- `smo`
	- `sna`
	- `snd`
	- `som`
	- `sot`
	- `spa`
	- `srd`
	- `srp`
	- `sun`
	- `swe`
	- `swh`
	- `tam`
	- `tat`
	- `tel`
	- `tgk`
	- `tha`
	- `tir`
	- `tuk`
	- `tur`
	- `tyv`
	- `udm`
	- `uig`
	- `ukr`
	- `urd`
	- `uzn`
	- `uzs`
	- `vie`
	- `xho`
	- `ydd`
	- `yor`
	- `yue`
	- `zea`
	- `zsm`
	- `zul`

	## Training Data

	Built from [HuggingFaceFW/finetranslations](https://huggingface.co/datasets/HuggingFaceFW/finetranslations)
	(translated texts) and [HuggingFaceFW/fineweb](https://huggingface.co/datasets/HuggingFaceFW/fineweb)
	(native English). 10,000 train + 1,000 val samples per language.