YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

NedoTurkishTokenizer

Self-contained Turkish morphological tokenizer.
Zero external dependencies — tokenizes Turkish text into morphologically meaningful units using a candidate-based segmentation engine, a bundled TDK dictionary, and 260+ suffix patterns.

This is a standalone tokenizer. It does not wrap turkish-tokenizer, zemberek-python, requests, or transformers. There are no hidden fallbacks or optional dependency paths. Install and use immediately.

Installation

pip install .

No additional packages required. Everything is bundled.

Quick Start

from nedo_turkish_tokenizer import NedoTurkishTokenizer

tok = NedoTurkishTokenizer()

tokens = tok.tokenize("İstanbul'da toplantıya katılamadım")
for t in tokens:
    print(f"{t['token']:15s}  {t['token_type']:10s}  pos={t['morph_pos']}")

Output:

İstanbul        ROOT        pos=0
'               PUNCT       pos=0
da              SUFFIX      pos=1
toplantı        ROOT        pos=0
ya              SUFFIX      pos=1
katıl           ROOT        pos=0
a               SUFFIX      pos=1
ma              SUFFIX      pos=2
dım             SUFFIX      pos=3

API

NedoTurkishTokenizer()

tok = NedoTurkishTokenizer()

# Single text
tokens = tok.tokenize("Merhaba dünya")

# Callable shorthand
tokens = tok("Merhaba dünya")

# Batch (parallel, uses multiprocessing)
results = tok.batch_tokenize(["text1", "text2", "text3"])

# Statistics
stats = tok.stats(tokens)

Token Output Format

Each token is a dict with these guaranteed fields:

Field Type Description
token str Clean token text — no leading/trailing whitespace.
token_type str One of the types below.
morph_pos int 0 = root/word-initial, 1+ = suffix position.

Token text does not encode spacing. The token field contains only the clean surface form. Whether a token starts a new word is indicated by morph_pos == 0, not by whitespace in the string.

Token Types

Type Description Example
ROOT Turkish word root ev, gel, kitap
SUFFIX Morphological suffix de, ler, yor
FOREIGN Non-Turkish word meeting, cloud
PUNCT Punctuation ., ,, '
NUM Number 42, %85, 3.14
DATE Date 14.03.2026
UNIT Unit kg, km, TL
URL URL https://...
MENTION Social mention @user
HASHTAG Hashtag #topic
EMOJI Emoji 😀, :)
ACRONYM Acronym NATO, TBMM

Optional Metadata Fields

Tokens may include _-prefixed metadata fields:

Field Type Description
_suffix_label str Morphological label (e.g. -LOC, -PL, -PST)
_canonical str Canonical morpheme (e.g. LOC, PL, PAST)
_caps bool Word was originally ALL CAPS
_foreign bool Word detected as foreign
_acronym bool Token is an acronym
_expansion str Acronym expansion (e.g. NATOKuzey Atlantik...)
_compound bool Root is a compound word
_parts list[str] Compound decomposition
_apo_suffix bool Suffix follows an apostrophe
_domain bool Root from domain vocabulary

Architecture

The tokenizer uses a candidate-based segmentation pipeline:

Text → Normalize → Special Spans → Word Split → Per-Word Segmentation → Annotate → Strip
                                                      │
                                              Generate Candidates
                                              Score & Select Best

For each word, the engine:

  1. Generates 2–5 segmentation candidates (whole ROOT, suffix chains, foreign)
  2. Scores each candidate deterministically (TDK validation, root length, suffix recognition)
  3. Selects the highest-scoring segmentation
  4. Strips internal whitespace markers from the output

Scoring Rules

Factor Score
Root in TDK dictionary +10
Whole word in TDK (unsplit) +5 bonus
Root in domain vocabulary +8
Root length +2 per character
Each recognised suffix +2
Short root penalty (≤2 chars) −4
Foreign root (fallback) +3 base
Unknown root +1 base

Known-Intact Words

A curated set of common Turkish words (inflected forms of demek, yemek, and discourse particles) bypass candidate generation entirely and are always kept whole. This prevents false splits like dedide + di where the root de is a valid TDK conjunction.

Bundled Resources

Resource Size Purpose
tdk_words.txt ~746 KB TDK dictionary (64K+ lemmas + derived verb stems)
turkish_proper_nouns.txt ~1 KB Proper nouns (cities, regions, names)
Suffix table 260+ entries Turkish suffix patterns with morphological labels
Acronym table 80+ entries Acronym → Turkish expansion mappings
Domain vocabulary 200+ entries Medical, sports, tourism terms

Known Limitations

  • Not a full morphological analyzer. This is a heuristic segmenter, not a Zemberek/TRMorph replacement. Morphological labels may be incorrect for ambiguous suffixes (e.g. -ACC vs -GEN for "in").
  • No disambiguation. The tokenizer does not use sentence-level context to resolve ambiguous words (e.g. "gelir" = income vs. aorist).
  • Verb stem derivation is simple. Only -mak/-mek infinitive stripping (for stems ≥3 characters) is used; vowel harmony alternations in stems are not modelled.
  • TDK dictionary coverage. The bundled TDK list has ~64K entries. Words absent from TDK default to ROOT (unknown) or FOREIGN.
  • Not backward-compatible with v1.x. The old wrapper relied on turkish-tokenizer BPE and zemberek-python. Output format and token boundaries will differ.

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support