damfle
/

multistral-tokenizer

Model card Files Files and versions

multistral-tokenizer / README.md

damfle

mod: improve compression and remove garbage

a2385fe unverified 3 months ago

|

history blame contribute delete

733 Bytes

	# Multistral Tokenizer

	Training completed successfully!

	## Configuration
	- Vocabulary size: 127,989
	- Special tokens: 13
	- Min frequency: 32
	- Training samples: up to 1,000,000

	## Dataset
	- Source: dataset/

	## Special Tokens
	<\|begin\|>, <\|return\|>, <\|pad\|>, <\|start\|>, <\|channel\|>, <\|end\|>, <\|message\|>, <\|image\|>, <\|video\|>, <\|audio\|>, <\|call\|>, <\|constrain\|>, <\|unknown\|>

	## Enforced Vocabulary
	analysis, assistant, commentary, developer, final, json, system, tool, toon, user, yaml

	## Usage

	```python
	from multistral.multistraltokenizer import MultistralTokenizer

	tokenizer = MultistralTokenizer.from_pretrained("models/multistral-tokenizer")
	tokens = tokenizer.encode("Your text here")
	text = tokenizer.decode(tokens)
	```