| # Multistral Tokenizer | |
| Training completed successfully! | |
| ## Configuration | |
| - Vocabulary size: 127,989 | |
| - Special tokens: 13 | |
| - Min frequency: 32 | |
| - Training samples: up to 1,000,000 | |
| ## Dataset | |
| - Source: dataset/ | |
| ## Special Tokens | |
| <|begin|>, <|return|>, <|pad|>, <|start|>, <|channel|>, <|end|>, <|message|>, <|image|>, <|video|>, <|audio|>, <|call|>, <|constrain|>, <|unknown|> | |
| ## Enforced Vocabulary | |
| analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml | |
| ## Usage | |
| ```python | |
| from multistral.multistraltokenizer import MultistralTokenizer | |
| tokenizer = MultistralTokenizer.from_pretrained("models/multistral-tokenizer") | |
| tokens = tokenizer.encode("Your text here") | |
| text = tokenizer.decode(tokens) | |
| ``` | |