ArabicMMLU: Assessing Massive Multitask Language Understanding in Arabic Paper • 2402.12840 • Published Feb 20, 2024 • 2
The Good, the Bad and the Constructive: Automatically Measuring Peer Review's Utility for Authors Paper • 2509.04484 • Published Aug 31, 2025 • 1
Instruction-Guided Poetry Generation in Arabic and Its Dialects Paper • 2604.27766 • Published 6 days ago • 2
Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures Paper • 2510.24081 • Published Oct 28, 2025 • 22
Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph Paper • 2406.15627 • Published Jun 21, 2024