YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
NedoTurkishTokenizer
Self-contained Turkish morphological tokenizer.
Zero external dependencies — tokenizes Turkish text into morphologically meaningful units using a candidate-based segmentation engine, a bundled TDK dictionary, and 260+ suffix patterns.
This is a standalone tokenizer. It does not wrap
turkish-tokenizer,zemberek-python,requests, ortransformers. There are no hidden fallbacks or optional dependency paths. Install and use immediately.
Installation
pip install .
No additional packages required. Everything is bundled.
Quick Start
from nedo_turkish_tokenizer import NedoTurkishTokenizer
tok = NedoTurkishTokenizer()
tokens = tok.tokenize("İstanbul'da toplantıya katılamadım")
for t in tokens:
print(f"{t['token']:15s} {t['token_type']:10s} pos={t['morph_pos']}")
Output:
İstanbul ROOT pos=0
' PUNCT pos=0
da SUFFIX pos=1
toplantı ROOT pos=0
ya SUFFIX pos=1
katıl ROOT pos=0
a SUFFIX pos=1
ma SUFFIX pos=2
dım SUFFIX pos=3
API
NedoTurkishTokenizer()
tok = NedoTurkishTokenizer()
# Single text
tokens = tok.tokenize("Merhaba dünya")
# Callable shorthand
tokens = tok("Merhaba dünya")
# Batch (parallel, uses multiprocessing)
results = tok.batch_tokenize(["text1", "text2", "text3"])
# Statistics
stats = tok.stats(tokens)
Token Output Format
Each token is a dict with these guaranteed fields:
| Field | Type | Description |
|---|---|---|
token |
str |
Clean token text — no leading/trailing whitespace. |
token_type |
str |
One of the types below. |
morph_pos |
int |
0 = root/word-initial, 1+ = suffix position. |
Token text does not encode spacing. The token field contains only the clean surface form. Whether a token starts a new word is indicated by morph_pos == 0, not by whitespace in the string.
Token Types
| Type | Description | Example |
|---|---|---|
ROOT |
Turkish word root | ev, gel, kitap |
SUFFIX |
Morphological suffix | de, ler, yor |
FOREIGN |
Non-Turkish word | meeting, cloud |
PUNCT |
Punctuation | ., ,, ' |
NUM |
Number | 42, %85, 3.14 |
DATE |
Date | 14.03.2026 |
UNIT |
Unit | kg, km, TL |
URL |
URL | https://... |
MENTION |
Social mention | @user |
HASHTAG |
Hashtag | #topic |
EMOJI |
Emoji | 😀, :) |
ACRONYM |
Acronym | NATO, TBMM |
Optional Metadata Fields
Tokens may include _-prefixed metadata fields:
| Field | Type | Description |
|---|---|---|
_suffix_label |
str |
Morphological label (e.g. -LOC, -PL, -PST) |
_canonical |
str |
Canonical morpheme (e.g. LOC, PL, PAST) |
_caps |
bool |
Word was originally ALL CAPS |
_foreign |
bool |
Word detected as foreign |
_acronym |
bool |
Token is an acronym |
_expansion |
str |
Acronym expansion (e.g. NATO → Kuzey Atlantik...) |
_compound |
bool |
Root is a compound word |
_parts |
list[str] |
Compound decomposition |
_apo_suffix |
bool |
Suffix follows an apostrophe |
_domain |
bool |
Root from domain vocabulary |
Architecture
The tokenizer uses a candidate-based segmentation pipeline:
Text → Normalize → Special Spans → Word Split → Per-Word Segmentation → Annotate → Strip
│
Generate Candidates
Score & Select Best
For each word, the engine:
- Generates 2–5 segmentation candidates (whole ROOT, suffix chains, foreign)
- Scores each candidate deterministically (TDK validation, root length, suffix recognition)
- Selects the highest-scoring segmentation
- Strips internal whitespace markers from the output
Scoring Rules
| Factor | Score |
|---|---|
| Root in TDK dictionary | +10 |
| Whole word in TDK (unsplit) | +5 bonus |
| Root in domain vocabulary | +8 |
| Root length | +2 per character |
| Each recognised suffix | +2 |
| Short root penalty (≤2 chars) | −4 |
| Foreign root (fallback) | +3 base |
| Unknown root | +1 base |
Known-Intact Words
A curated set of common Turkish words (inflected forms of demek, yemek, and discourse particles) bypass candidate generation entirely and are always kept whole. This prevents false splits like dedi → de + di where the root de is a valid TDK conjunction.
Bundled Resources
| Resource | Size | Purpose |
|---|---|---|
tdk_words.txt |
~746 KB | TDK dictionary (64K+ lemmas + derived verb stems) |
turkish_proper_nouns.txt |
~1 KB | Proper nouns (cities, regions, names) |
| Suffix table | 260+ entries | Turkish suffix patterns with morphological labels |
| Acronym table | 80+ entries | Acronym → Turkish expansion mappings |
| Domain vocabulary | 200+ entries | Medical, sports, tourism terms |
Known Limitations
- Not a full morphological analyzer. This is a heuristic segmenter, not a Zemberek/TRMorph replacement. Morphological labels may be incorrect for ambiguous suffixes (e.g.
-ACCvs-GENfor "in"). - No disambiguation. The tokenizer does not use sentence-level context to resolve ambiguous words (e.g. "gelir" = income vs. aorist).
- Verb stem derivation is simple. Only
-mak/-mekinfinitive stripping (for stems ≥3 characters) is used; vowel harmony alternations in stems are not modelled. - TDK dictionary coverage. The bundled TDK list has ~64K entries. Words absent from TDK default to ROOT (unknown) or FOREIGN.
- Not backward-compatible with v1.x. The old wrapper relied on
turkish-tokenizerBPE andzemberek-python. Output format and token boundaries will differ.
Running Tests
pip install -e ".[dev]"
pytest tests/ -v
License
MIT