Add `"add_prefix_space": true,`; this allows for much stronger token-level performance (e.g. NER, ColBERT)

#10

by tomaarsen HF Staff - opened Jan 9, 2025

base: refs/heads/main

←

from: refs/pr/10

Discussion Files changed

-0

tomaarsen

Answer.AI org Jan 9, 2025

Hello!

Pull Request overview

Add "add_prefix_space": true, to the tokenizer config

Details

This allows for much stronger token-level performance (e.g. NER, ColBERT), because otherwise each token will not be prepended by a space, while our model is trained with data where each token is prepended by a space.
We will need to explain that users can set add_prefix_space to False in the model card somewhere.
cc @bclavie @bwarner @NohTow could one of you take care of that?

P.s. feel free to hold off on merging for now, this PR can also be used to run some tests first (with revision="refs/pr/...").

Note that you need to use transformers after https://github.com/huggingface/transformers/pull/35593 was merged.

Tom Aarsen

Add `"add_prefix_space": true,`; this allows for much stronger token-level performance (e.g. NER, ColBERT)8ae8af35

bclavie changed pull request status to merged Jan 11, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment