AI & ML interests

Arabic language, Semitic NLP, low-resource dialects, machine translation, language models, linguistic resources

Recent Activity

ArabovMK  updated a Space 1 day ago
ArabicNLPWorld/README
ArabovMK  published a Space 1 day ago
ArabicNLPWorld/README
View all activity

Organization Card

ArabicNLPWorld – Arabic MSA, Dialects & Low‑Resource NLP Research Hub

Status Focus Focus Focus Focus Focus

ArabicNLPWorld is a research organization dedicated to natural language processing for Modern Standard Arabic (MSA) — a well‑resourced language — as well as under‑resourced Arabic dialects, low‑resource language pairs involving Arabic, Islamic religious texts, and Arabic–Russian translation. We develop and share open‑source models, datasets, and educational tools to bridge the digital divide across all varieties and modalities of Arabic.

📌 This is an organization card. Our models, datasets, and demos are available on our Hugging Face Organization Page.


🎯 Our Mission

  • Build state‑of‑the‑art language models for Modern Standard Arabic (MSA) — leveraging its rich existing resources.
  • Create resources and models for under‑resourced Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, Sudanese, etc.).
  • Advance Arabic–Russian machine translation using our 15.8M parallel corpus.
  • Support low‑resource language pairs where Arabic is one side (e.g., Arabic ↔ Tatar, Arabic ↔ Chechen, Arabic ↔ Bashkir, Arabic ↔ Hausa, Arabic ↔ Somali).
  • Develop specialised NLP tools for Islamic religious texts:
    • The Quran with Russian translation (Elmir Kuliev)
    • Sahih al-Bukhari — the most authentic hadith collection
    • Sahih Muslim — the second most authentic collection
    • 40 Hadith of al-Nawawi (41 in some editions)
    • Kutub al-Sittah (The Six Major Hadith Collections) — including Sunan Abu Dawud, Jami` at-Tirmidhi, Sunan an-Nasa'i, and Sunan Ibn Majah
  • Foster a community of researchers, developers, native speakers, dialect speakers, and Islamic scholars working together on inclusive Arabic NLP.

🧠 Clarification: MSA vs. Dialects vs. Low‑Resource

Variety / Pair Resource Status Description
Modern Standard Arabic (MSA) ✅ Well‑resourced Hundreds of billions of tokens, many pretrained models (AraBERT, MARBERT, AraT5, CAMeLBERT), large parallel corpora with English and other major languages.
Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, etc.) ⚠️ Under‑resourced to low‑resource Limited annotated data, few pretrained models, scarce parallel corpora with MSA or English. Egyptian is best‑resourced among dialects but still far behind MSA.
Arabic ↔ Russian translation 🔄 Mid‑resource Our 15.8M corpus is the largest publicly available for this pair, but still modest compared to English‑Arabic (100M+).
Low‑resource pairs (Arabic ↔ Turkic, Caucasian, African languages) ❌ Low‑resource Very few (often zero) parallel datasets; requires transfer learning, data augmentation, and zero‑shot techniques.
Islamic religious texts 📖 Domain‑specific Rich but specialised vocabulary (classical Arabic). Includes Quran, Sahih al-Bukhari, Sahih Muslim, 40 Hadith of al-Nawawi, and Kutub al-Sittah with curated parallel translations.

🚀 Interactive Demos

Explore our live Hugging Face Spaces and try out our models directly in your browser:

🔤 Language Models

🌐 Machine Translation

📚 Linguistic Tools

📊 Data & Benchmarks

Click on any demo to start experimenting – no installation required!


🧠 Research Focus Areas

🇸🇦 Modern Standard Arabic (MSA) – Well‑Resourced

  • Continued pretraining and fine‑tuning of MSA models (AraBERT, AraT5, MARBERT)
  • Benchmarking on standard tasks (POS, NER, sentiment, QA)
  • Leveraging MSA as a source for transfer learning to dialects

🗣️ Arabic Dialects – Under‑Resourced to Low‑Resource

Focus on: Egyptian (arz), Levantine (apc), Gulf (afb), Maghrebi (ary), Sudanese (apd)

Challenges we address:

  • Lack of annotated data → data augmentation, semi‑supervised learning
  • Few parallel corpora (dialect ↔ MSA, dialect ↔ English)
  • Absence of dialect‑specific pretrained models

Our approach:

  • Cross‑lingual transfer from MSA to dialects
  • Few‑shot and zero‑shot learning for dialect tasks
  • Crowdsourced annotation and validation with native speakers

🔄 Arabic–Russian Bilingual NLP – Mid‑Resource

  • 15,801,992 parallel sentences (our flagship corpus)
  • Sources: OPUS, TED, Baranov dictionary, Borisov dictionary, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba
  • Length correlation: 0.925
  • Applications: translation, cross‑lingual retrieval, bilingual lexicography

🌍 Low‑Resource Pairs Involving Arabic – Low‑Resource

We focus on language pairs with minimal or no parallel data:

Pair Resource Status Our Work
Arabic ↔ Tatar Very low Data collection, transfer learning from Arabic–Russian + Russian–Tatar
Arabic ↔ Chechen Extremely low Zero‑shot translation via English or Russian pivot
Arabic ↔ Bashkir Extremely low Cross‑lingual embeddings
Arabic ↔ Hausa Very low Leveraging NLLB model
Arabic ↔ Somali Very low Data collection and annotation

🕌 Islamic Religious Texts – Domain‑Specific

We provide digitised, aligned, and machine‑readable versions of major Islamic texts:

Text Description Parallel Translation
The Quran The holy book of Islam, 114 surahs Russian (Elmir Kuliev), English (Sahih International)
Sahih al-Bukhari Most authentic hadith collection (c. 7,000+ hadith) Russian translation
Sahih Muslim Second most authentic collection (c. 7,000+ hadith) Russian translation
40 Hadith of al-Nawawi Concise collection of 40 (or 41) essential hadith Russian translation
Sunan Abu Dawud One of the six major collections (Kutub al-Sittah) Russian (in progress)
Jami` at-Tirmidhi One of the six major collections Russian (in progress)
Sunan an-Nasa'i One of the six major collections Russian (in progress)
Sunan Ibn Majah One of the six major collections Russian (in progress)

Applications:

  • Semantic search over hadith corpora
  • Question answering on Islamic texts
  • Classical Arabic morphological analysis
  • Cross‑collection hadith matching (e.g., finding the same hadith in Bukhari and Muslim)
  • Alignment of multiple translations for linguistic study

📖 Lexicographic Resources

  • Arabic‑Russian Dictionary – Kh.K. Baranov (latest edition) – digitised and aligned
  • Russian‑Arabic Dictionary – V.M. Borisov (latest edition) – bidirectional coverage
  • Machine‑readable formats for NLP integration

📚 Educational Resources

We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.

  • Interactive Notebooks – Arabic NLP, dialect processing, Arabic–Russian MT, low‑resource techniques (in Python, using Hugging Face libraries)
  • Video Lectures – Recorded talks on Arabic morphology, dialect identification, and Islamic text processing
  • Course Materials – Slides, readings, and assignments from our university courses
  • Blog Posts – Deep dives into challenges and solutions for Arabic dialects and low‑resource pairs

🤝 Get Involved

We welcome contributions from the community – researchers, developers, students, native speakers, dialect speakers, and Islamic scholars.

For Researchers

  • Use our models and datasets (and cite us!)
  • Collaborate on dialect annotation or low‑resource pair projects
  • Contribute new benchmarks for dialects or Arabic–Russian MT

For Developers

  • Integrate our models into translation, search, or chatbot applications
  • Report bugs or suggest improvements via GitHub Issues
  • Submit pull requests to our open‑source repositories

For Native & Dialect Speakers

  • Help us validate dialect annotations and translations
  • Share dialect texts (with permission) to enrich our data
  • Provide feedback on model outputs to reduce errors

For Islamic Scholars & Students

  • Help verify Quranic verse alignments and hadith translations
  • Suggest improvements for religious text processing
  • Use our tools for digital Islamic studies

For Students

  • Use our demos and tutorials for learning
  • Participate in our mentorship program or summer schools
  • Start your own research project with our support

📊 Corpus Highlights

Our flagship resource – the Arabic–Russian Translation Corpus:

Statistic Value
Total pairs 15,801,992
Length correlation 0.925
Arabic tokens 357.7M
Russian tokens 366.0M
Unique Arabic tokens 1,848,317
Unique Russian tokens 933,467
Sources OPUS, TED, Baranov, Borisov, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba

Most frequent Arabic words: في (13.68M), من (8.45M), على (5.59M)

Most frequent Russian words: и (15.88M), в (15.52M), пО (5.38M)


🌐 Connect With Us


🔄 Ecosystem Integration

Our work is integrated with the broader Hugging Face ecosystem:

  • Models on the Hub with easy‑to‑use pipelines
  • Datasets with streaming and evaluation scripts
  • Spaces for interactive demos and educational tools
  • Gradio apps for user‑friendly interfaces

Empowering Arabic MSA, dialects, low‑resource pairs, and Islamic texts through open science and community collaboration.

Hugging Face GitHub Dataset

© 2026 ArabicNLPWorld – Open science for Arabic, dialects, low‑resource pairs, and beyond.

models 0

None public yet