Spaces:
Configuration error
Configuration error
metadata
title: >-
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol
Understanding and Reasoning
license: cc-by-4.0
task_categories:
- text-generation
- question-answering
language:
- en
tags:
- biology
- protocol
- benchmark
- ai4science
BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning
π’ Latest News
- β¨ [2026-03-31] Data Split Update! We have officially released the Train/Test splits for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
- π₯ [2026-03-18] Our BioProAgent is now live on AI4S LAB! Try it out and order wet-lab experiments here.
- π [2026-03-03] Our BioProAgent has been accepted by the ICLR 2026 LLA Workshop!
- π [2026-01-21] BioProBench paper has been updated with new experimental results.arXiv.
- π [2025-12-01] Code and dataset (v1.0) are released on GitHub.
π Introduction
BioProBench is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences.
Key Features:
- π Large-scale Data: Built upon 27K original biological protocols, yielding nearly 556K high-quality structured instances.
- π― Comprehensive Tasks: 5 core tasks: PQA (Question Answering), ORD (Step Ordering), ERR (Error Correction), GEN (Generation), and REA (Reasoning).
- 𧬠Broad Domain Coverage: Covers 16 biological subdomains from 6 major repositories.
- π¬ Standardized Evaluation: A robust framework combining NLP metrics with novel domain-specific measures.
π Dataset Structure & Tasks
We provide standardized JSON files for each task, now including Train and Test splits:
| Task | Description | Files |
|---|---|---|
| PQA | Protocol Question Answering | PQA_train.json, PQA_test.json |
| ORD | Step Ordering | ORD_train.json, ORD_test.json |
| ERR | Error Correction | ERR_train.json, ERR_test.json |
| GEN | Protocol Generation | GEN_train.json, GEN_test.json |
| Raw | Full Protocol Corpus | Bio-protocol.json, Protocol-io.json, etc. |
π Useful Links
- Official Website: BioPro-Project Page
- GitHub Repository: bioprotocolbench (Code for evaluation & training)
π¬ Key Findings
We evaluated 12 mainstream LLMs. Our findings reveal:
- Surface vs. Deep Understanding: Models perform well on QA (~70% Acc) but struggle with deep procedural logic.
- Reasoning Bottleneck: Performance drops significantly on Step Ordering and Protocol Generation (BLEU < 15%), highlighting the difficulty of managing temporal dependencies.
- Bio-specific Models: Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity.
π€ Contributing & Contact
We welcome contributions such as new protocol sources, additional domains, or novel tasks!
- Email: sunshineliuyuyang@gmail.com
- Issues: Feel free to open an issue on our GitHub.
π Citation
@misc{bioprotocolbench2025,
title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian},
year={2025},
url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)}
}