damfle
mod: improve compression and remove garbage
a2385fe unverified
# Multistral Tokenizer
Training completed successfully!
## Configuration
- Vocabulary size: 127,989
- Special tokens: 13
- Min frequency: 32
- Training samples: up to 1,000,000
## Dataset
- Source: dataset/
## Special Tokens
<|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|>
## Enforced Vocabulary
analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml
## Usage
```python
from multistral.multistraltokenizer import MultistralTokenizer
tokenizer = MultistralTokenizer.from_pretrained("models/multistral-tokenizer")
tokens = tokenizer.encode("Your text here")
text = tokenizer.decode(tokens)
```