angeluriot/chess_games
Viewer β’ Updated β’ 14.2M β’ 26k β’ 3
A Byte Pair Encoding (BPE) tokenizer trained on chess moves in custom notation format.
w.βg1βf3..)The tokenizer is trained on a custom chess move notation:
| Component | Description | Example |
|---|---|---|
| Player prefix | w. (white) or b. (black) |
w. |
| Piece + Source | Unicode piece + square | βg1 |
| Piece + Destination | Unicode piece + square | βf3 |
| Flags | .x. (capture), ..+ (check), ..# (checkmate) |
.. |
| Move | Meaning |
|---|---|
w.βg1βf3.. |
White knight from g1 to f3 |
b.βc7βc5.. |
Black pawn from c7 to c5 |
b.βc5βd4.x. |
Black pawn captures on d4 |
w.βe1βg1βh1βf1.. |
White kingside castle |
b.βd7βd5..+ |
Black queen to d5 with check |
| White | Black | Piece |
|---|---|---|
| β | β | King |
| β | β | Queen |
| β | β | Rook |
| β | β | Bishop |
| β | β | Knight |
| β | β | Pawn |
pip install rustbpe huggingface_hub
import json
from huggingface_hub import hf_hub_download
# Download tokenizer files
vocab_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="vocab.json")
config_path = hf_hub_download(repo_id="YOUR_USERNAME/chess-bpe-tokenizer", filename="tokenizer_config.json")
# Load vocabulary
with open(vocab_path, 'r') as f:
vocab = json.load(f)
with open(config_path, 'r') as f:
config = json.load(f)
print(f"Vocab size: {len(vocab)}")
print(f"Pattern: {config['pattern']}")
import rustbpe
# Note: rustbpe tokenizer needs to be retrained or loaded from merges
# See the training script for details
from bpess.main import train_chess_tokenizer, push_to_hub
# Train
tokenizer = train_chess_tokenizer(
vocab_size=4096,
dataset_fraction="train",
moves_key='moves_custom'
)
# Push to HuggingFace
push_to_hub(
tokenizer=tokenizer,
repo_id="your-username/chess-bpe-tokenizer",
config={
"vocab_size": 4096,
"dataset_fraction": "train",
"moves_key": "moves_custom"
}
)
This tokenizer is designed for:
MIT License