AI & ML interests
Arabic language, Semitic NLP, low-resource dialects, machine translation, language models, linguistic resources
Recent Activity
ArabicNLPWorld â Arabic MSA, Dialects & LowâResource NLP Research Hub
ArabicNLPWorld is a research organization dedicated to natural language processing for Modern Standard Arabic (MSA) â a wellâresourced language â as well as underâresourced Arabic dialects, lowâresource language pairs involving Arabic, Islamic religious texts, and ArabicâRussian translation. We develop and share openâsource models, datasets, and educational tools to bridge the digital divide across all varieties and modalities of Arabic.
đ This is an organization card. Our models, datasets, and demos are available on our Hugging Face Organization Page.
đŻ Our Mission
- Build stateâofâtheâart language models for Modern Standard Arabic (MSA) â leveraging its rich existing resources.
- Create resources and models for underâresourced Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, Sudanese, etc.).
- Advance ArabicâRussian machine translation using our 15.8M parallel corpus.
- Support lowâresource language pairs where Arabic is one side (e.g., Arabic â Tatar, Arabic â Chechen, Arabic â Bashkir, Arabic â Hausa, Arabic â Somali).
- Develop specialised NLP tools for Islamic religious texts:
- The Quran with Russian translation (Elmir Kuliev)
- Sahih al-Bukhari â the most authentic hadith collection
- Sahih Muslim â the second most authentic collection
- 40 Hadith of al-Nawawi (41 in some editions)
- Kutub al-Sittah (The Six Major Hadith Collections) â including Sunan Abu Dawud, Jami` at-Tirmidhi, Sunan an-Nasa'i, and Sunan Ibn Majah
- Foster a community of researchers, developers, native speakers, dialect speakers, and Islamic scholars working together on inclusive Arabic NLP.
đ§ Clarification: MSA vs. Dialects vs. LowâResource
| Variety / Pair | Resource Status | Description |
|---|---|---|
| Modern Standard Arabic (MSA) | â Wellâresourced | Hundreds of billions of tokens, many pretrained models (AraBERT, MARBERT, AraT5, CAMeLBERT), large parallel corpora with English and other major languages. |
| Arabic dialects (Egyptian, Levantine, Gulf, Maghrebi, etc.) | â ď¸ Underâresourced to lowâresource | Limited annotated data, few pretrained models, scarce parallel corpora with MSA or English. Egyptian is bestâresourced among dialects but still far behind MSA. |
| Arabic â Russian translation | đ Midâresource | Our 15.8M corpus is the largest publicly available for this pair, but still modest compared to EnglishâArabic (100M+). |
| Lowâresource pairs (Arabic â Turkic, Caucasian, African languages) | â Lowâresource | Very few (often zero) parallel datasets; requires transfer learning, data augmentation, and zeroâshot techniques. |
| Islamic religious texts | đ Domainâspecific | Rich but specialised vocabulary (classical Arabic). Includes Quran, Sahih al-Bukhari, Sahih Muslim, 40 Hadith of al-Nawawi, and Kutub al-Sittah with curated parallel translations. |
đ Interactive Demos
Explore our live Hugging Face Spaces and try out our models directly in your browser:
đ¤ Language Models
- AraBERT Playground â Generate and analyze MSA text.
- DialectBERT Explorer â Pretrained model for Egyptian, Levantine, and Gulf Arabic.
- ArabicâRussian Embeddings â Crossâlingual word vectors for translation.
đ Machine Translation
- Arabic â Russian Translator â Neural translation demo (15.8M parallel pairs).
- MSA â Dialect Translator â Convert between Modern Standard Arabic and Egyptian/Levantine.
- Quran & Hadith Translation Explorer â Arabic originals with Russian (Kuliev) and English parallels.
đ Linguistic Tools
- Arabic Morphological Analyzer â Rootâbased segmentation and POS tagging.
- Dialect Identifier â Detect MSA vs. Egyptian, Levantine, Gulf, Maghrebi.
- Named Entity Recognition for Arabic â Identify persons, locations, organizations.
đ Data & Benchmarks
- ArabicâRussian Corpus Explorer â Browse 15.8M parallel sentences.
- Dialect NLP Leaderboard â Compare model performance on dialect tasks.
- Islamic Text Annotation Tool â Help us improve Quran/hadith alignments.
Click on any demo to start experimenting â no installation required!
đ§ Research Focus Areas
đ¸đŚ Modern Standard Arabic (MSA) â WellâResourced
- Continued pretraining and fineâtuning of MSA models (AraBERT, AraT5, MARBERT)
- Benchmarking on standard tasks (POS, NER, sentiment, QA)
- Leveraging MSA as a source for transfer learning to dialects
đŁď¸ Arabic Dialects â UnderâResourced to LowâResource
Focus on: Egyptian (arz), Levantine (apc), Gulf (afb), Maghrebi (ary), Sudanese (apd)
Challenges we address:
- Lack of annotated data â data augmentation, semiâsupervised learning
- Few parallel corpora (dialect â MSA, dialect â English)
- Absence of dialectâspecific pretrained models
Our approach:
- Crossâlingual transfer from MSA to dialects
- Fewâshot and zeroâshot learning for dialect tasks
- Crowdsourced annotation and validation with native speakers
đ ArabicâRussian Bilingual NLP â MidâResource
- 15,801,992 parallel sentences (our flagship corpus)
- Sources: OPUS, TED, Baranov dictionary, Borisov dictionary, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba
- Length correlation: 0.925
- Applications: translation, crossâlingual retrieval, bilingual lexicography
đ LowâResource Pairs Involving Arabic â LowâResource
We focus on language pairs with minimal or no parallel data:
| Pair | Resource Status | Our Work |
|---|---|---|
| Arabic â Tatar | Very low | Data collection, transfer learning from ArabicâRussian + RussianâTatar |
| Arabic â Chechen | Extremely low | Zeroâshot translation via English or Russian pivot |
| Arabic â Bashkir | Extremely low | Crossâlingual embeddings |
| Arabic â Hausa | Very low | Leveraging NLLB model |
| Arabic â Somali | Very low | Data collection and annotation |
đ Islamic Religious Texts â DomainâSpecific
We provide digitised, aligned, and machineâreadable versions of major Islamic texts:
| Text | Description | Parallel Translation |
|---|---|---|
| The Quran | The holy book of Islam, 114 surahs | Russian (Elmir Kuliev), English (Sahih International) |
| Sahih al-Bukhari | Most authentic hadith collection (c. 7,000+ hadith) | Russian translation |
| Sahih Muslim | Second most authentic collection (c. 7,000+ hadith) | Russian translation |
| 40 Hadith of al-Nawawi | Concise collection of 40 (or 41) essential hadith | Russian translation |
| Sunan Abu Dawud | One of the six major collections (Kutub al-Sittah) | Russian (in progress) |
| Jami` at-Tirmidhi | One of the six major collections | Russian (in progress) |
| Sunan an-Nasa'i | One of the six major collections | Russian (in progress) |
| Sunan Ibn Majah | One of the six major collections | Russian (in progress) |
Applications:
- Semantic search over hadith corpora
- Question answering on Islamic texts
- Classical Arabic morphological analysis
- Crossâcollection hadith matching (e.g., finding the same hadith in Bukhari and Muslim)
- Alignment of multiple translations for linguistic study
đ Lexicographic Resources
- ArabicâRussian Dictionary â Kh.K. Baranov (latest edition) â digitised and aligned
- RussianâArabic Dictionary â V.M. Borisov (latest edition) â bidirectional coverage
- Machineâreadable formats for NLP integration
đ Educational Resources
We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.
- Interactive Notebooks â Arabic NLP, dialect processing, ArabicâRussian MT, lowâresource techniques (in Python, using Hugging Face libraries)
- Video Lectures â Recorded talks on Arabic morphology, dialect identification, and Islamic text processing
- Course Materials â Slides, readings, and assignments from our university courses
- Blog Posts â Deep dives into challenges and solutions for Arabic dialects and lowâresource pairs
đ¤ Get Involved
We welcome contributions from the community â researchers, developers, students, native speakers, dialect speakers, and Islamic scholars.
For Researchers
- Use our models and datasets (and cite us!)
- Collaborate on dialect annotation or lowâresource pair projects
- Contribute new benchmarks for dialects or ArabicâRussian MT
For Developers
- Integrate our models into translation, search, or chatbot applications
- Report bugs or suggest improvements via GitHub Issues
- Submit pull requests to our openâsource repositories
For Native & Dialect Speakers
- Help us validate dialect annotations and translations
- Share dialect texts (with permission) to enrich our data
- Provide feedback on model outputs to reduce errors
For Islamic Scholars & Students
- Help verify Quranic verse alignments and hadith translations
- Suggest improvements for religious text processing
- Use our tools for digital Islamic studies
For Students
- Use our demos and tutorials for learning
- Participate in our mentorship program or summer schools
- Start your own research project with our support
đ Corpus Highlights
Our flagship resource â the ArabicâRussian Translation Corpus:
| Statistic | Value |
|---|---|
| Total pairs | 15,801,992 |
| Length correlation | 0.925 |
| Arabic tokens | 357.7M |
| Russian tokens | 366.0M |
| Unique Arabic tokens | 1,848,317 |
| Unique Russian tokens | 933,467 |
| Sources | OPUS, TED, Baranov, Borisov, Sahih al-Bukhari, Sahih Muslim, 40 Hadith, Quran (Kuliev), phrasebook, Tatoeba |
Most frequent Arabic words: ŮŮ (13.68M), Ů Ů (8.45M), ŘšŮŮ (5.59M)
Most frequent Russian words: и (15.88M), в (15.52M), пО (5.38M)
đ Connect With Us
- đ¤ Hugging Face: ArabicNLPWorld â Models, datasets, and spaces
- đ§ Contact: arabicnlpworld@example.com
đ Ecosystem Integration
Our work is integrated with the broader Hugging Face ecosystem:
- Models on the Hub with easyâtoâuse pipelines
- Datasets with streaming and evaluation scripts
- Spaces for interactive demos and educational tools
- Gradio apps for userâfriendly interfaces
Empowering Arabic MSA, dialects, lowâresource pairs, and Islamic texts through open science and community collaboration.