NedoTurkishTokenizer

Self-contained Turkish morphological tokenizer.
Zero external dependencies — tokenizes Turkish text into morphologically meaningful units using a candidate-based segmentation engine, a bundled TDK dictionary, and 260+ suffix patterns.

This is a standalone tokenizer. It does not wrap turkish-tokenizer, zemberek-python, requests, or transformers. There are no hidden fallbacks or optional dependency paths. Install and use immediately.

Installation

pip install .

No additional packages required. Everything is bundled.

Quick Start

from nedo_turkish_tokenizer import NedoTurkishTokenizer

tok = NedoTurkishTokenizer()

tokens = tok.tokenize("İstanbul'da toplantıya katılamadım")
for t in tokens:
    print(f"{t['token']:15s}  {t['token_type']:10s}  pos={t['morph_pos']}")

Output:

İstanbul        ROOT        pos=0
'               PUNCT       pos=0
da              SUFFIX      pos=1
toplantı        ROOT        pos=0
ya              SUFFIX      pos=1
katıl           ROOT        pos=0
a               SUFFIX      pos=1
ma              SUFFIX      pos=2
dım             SUFFIX      pos=3

API

`NedoTurkishTokenizer()`

tok = NedoTurkishTokenizer()

# Single text
tokens = tok.tokenize("Merhaba dünya")

# Callable shorthand
tokens = tok("Merhaba dünya")

# Batch (parallel, uses multiprocessing)
results = tok.batch_tokenize(["text1", "text2", "text3"])

# Statistics
stats = tok.stats(tokens)

Token Output Format

Each token is a dict with these guaranteed fields:

Field	Type	Description
`token`	`str`	Clean token text — no leading/trailing whitespace.
`token_type`	`str`	One of the types below.
`morph_pos`	`int`	0 = root/word-initial, 1+ = suffix position.

Token text does not encode spacing. The token field contains only the clean surface form. Whether a token starts a new word is indicated by morph_pos == 0, not by whitespace in the string.

Token Types

Type	Description	Example
`ROOT`	Turkish word root	`ev`, `gel`, `kitap`
`SUFFIX`	Morphological suffix	`de`, `ler`, `yor`
`FOREIGN`	Non-Turkish word	`meeting`, `cloud`
`PUNCT`	Punctuation	`.`, `,`, `'`
`NUM`	Number	`42`, `%85`, `3.14`
`DATE`	Date	`14.03.2026`
`UNIT`	Unit	`kg`, `km`, `TL`
`URL`	URL	`https://...`
`MENTION`	Social mention	`@user`
`HASHTAG`	Hashtag	`#topic`
`EMOJI`	Emoji	😀, :)
`ACRONYM`	Acronym	`NATO`, `TBMM`

Optional Metadata Fields

Tokens may include _-prefixed metadata fields:

Field	Type	Description
`_suffix_label`	`str`	Morphological label (e.g. `-LOC`, `-PL`, `-PST`)
`_canonical`	`str`	Canonical morpheme (e.g. `LOC`, `PL`, `PAST`)
`_caps`	`bool`	Word was originally ALL CAPS
`_foreign`	`bool`	Word detected as foreign
`_acronym`	`bool`	Token is an acronym
`_expansion`	`str`	Acronym expansion (e.g. `NATO` → `Kuzey Atlantik...`)
`_compound`	`bool`	Root is a compound word
`_parts`	`list[str]`	Compound decomposition
`_apo_suffix`	`bool`	Suffix follows an apostrophe
`_domain`	`bool`	Root from domain vocabulary

Architecture

The tokenizer uses a candidate-based segmentation pipeline:

Text → Normalize → Special Spans → Word Split → Per-Word Segmentation → Annotate → Strip
                                                      │
                                              Generate Candidates
                                              Score & Select Best

For each word, the engine:

Generates 2–5 segmentation candidates (whole ROOT, suffix chains, foreign)
Scores each candidate deterministically (TDK validation, root length, suffix recognition)
Selects the highest-scoring segmentation
Strips internal whitespace markers from the output

Scoring Rules

Factor	Score
Root in TDK dictionary	+10
Whole word in TDK (unsplit)	+5 bonus
Root in domain vocabulary	+8
Root length	+2 per character
Each recognised suffix	+2
Short root penalty (≤2 chars)	−4
Foreign root (fallback)	+3 base
Unknown root	+1 base

Known-Intact Words

A curated set of common Turkish words (inflected forms of demek, yemek, and discourse particles) bypass candidate generation entirely and are always kept whole. This prevents false splits like dedi → de + di where the root de is a valid TDK conjunction.

Bundled Resources

Resource	Size	Purpose
`tdk_words.txt`	~746 KB	TDK dictionary (64K+ lemmas + derived verb stems)
`turkish_proper_nouns.txt`	~1 KB	Proper nouns (cities, regions, names)
Suffix table	260+ entries	Turkish suffix patterns with morphological labels
Acronym table	80+ entries	Acronym → Turkish expansion mappings
Domain vocabulary	200+ entries	Medical, sports, tourism terms

Known Limitations

Not a full morphological analyzer. This is a heuristic segmenter, not a Zemberek/TRMorph replacement. Morphological labels may be incorrect for ambiguous suffixes (e.g. -ACC vs -GEN for "in").
No disambiguation. The tokenizer does not use sentence-level context to resolve ambiguous words (e.g. "gelir" = income vs. aorist).
Verb stem derivation is simple. Only -mak/-mek infinitive stripping (for stems ≥3 characters) is used; vowel harmony alternations in stems are not modelled.
TDK dictionary coverage. The bundled TDK list has ~64K entries. Words absent from TDK default to ROOT (unknown) or FOREIGN.
Not backward-compatible with v1.x. The old wrapper relied on turkish-tokenizer BPE and zemberek-python. Output format and token boundaries will differ.

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support