SWE-CI: Evaluating Agent Capabilities in Maintaining Codebases via Continuous Integration Paper • 2603.03823 • Published 14 days ago • 6
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning Paper • 2602.16742 • Published 28 days ago • 12
DeepMath Collection A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning • 5 items • Updated May 22, 2025 • 5
DeepVision-103K: A Visually Diverse, Broad-Coverage, and Verifiable Mathematical Dataset for Multimodal Reasoning Paper • 2602.16742 • Published 28 days ago • 12
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Paper • 2602.13964 • Published about 1 month ago • 10
HLE-Verified: A Systematic Verification and Structured Revision of Humanity's Last Exam Paper • 2602.13964 • Published about 1 month ago • 10