Spaces:
Configuration error
Configuration error
| title: >- | |
| BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol | |
| Understanding and Reasoning | |
| license: cc-by-4.0 | |
| task_categories: | |
| - text-generation | |
| - question-answering | |
| language: | |
| - en | |
| tags: | |
| - biology | |
| - protocol | |
| - benchmark | |
| - ai4science | |
| <div align="center"> | |
| <img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/logo-v3.png?raw=true" alt="BioProBench Logo" width="300"/> | |
| </div> | |
| # BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning | |
| [](https://arxiv.org/pdf/2505.07889) | |
| [](https://huggingface.co/BioProBench) | |
| [](https://github.com/YuyangSunshine/bioprotocolbench) | |
| [](https://yuyangsunshine.github.io/BioPro-Project/) | |
| [](https://creativecommons.org/licenses/by/4.0/) | |
| --- | |
| ## π’ Latest News | |
| * β¨ **[2026-03-31]** **Data Split Update!** We have officially released the **Train/Test splits** for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently. | |
| * π₯ **[2026-03-18]** Our **BioProAgent** is now live on AI4S LAB! [Try it out and order wet-lab experiments here](https://yuyangsunshine.github.io/BioPro-Project/). | |
| * π **[2026-03-03]** Our BioProAgent has been accepted by the **ICLR 2026 LLA Workshop!** | |
| * π **[2026-01-21]** BioProBench paper has been updated with new experimental results.[arXiv](https://arxiv.org/pdf/2505.07889). | |
| * π **[2025-12-01]** Code and dataset (v1.0) are released on GitHub. | |
| --- | |
| ## π Introduction | |
| **BioProBench** is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences. | |
| <div align="center"> | |
| <img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/overview.png?raw=true" alt="BioProBench Overview" width="1000"/> | |
| </div> | |
| ### Key Features: | |
| * π **Large-scale Data:** Built upon **27K original biological protocols**, yielding nearly **556K high-quality structured instances**. | |
| * π― **Comprehensive Tasks:** 5 core tasks: **PQA** (Question Answering), **ORD** (Step Ordering), **ERR** (Error Correction), **GEN** (Generation), and **REA** (Reasoning). | |
| * 𧬠**Broad Domain Coverage:** Covers **16 biological subdomains** from 6 major repositories. | |
| * π¬ **Standardized Evaluation:** A robust framework combining NLP metrics with novel domain-specific measures. | |
| --- | |
| ## π Dataset Structure & Tasks | |
| <div align="center"> | |
| <img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/samples.jpg?raw=true" alt="BioProBench Samples" width="1000"/> | |
| </div> | |
| We provide standardized JSON files for each task, now including **Train** and **Test** splits: | |
| | Task | Description | Files | | |
| | :--- | :--- | :--- | | |
| | **PQA** | Protocol Question Answering | `PQA_train.json`, `PQA_test.json` | | |
| | **ORD** | Step Ordering | `ORD_train.json`, `ORD_test.json` | | |
| | **ERR** | Error Correction | `ERR_train.json`, `ERR_test.json` | | |
| | **GEN** | Protocol Generation | `GEN_train.json`, `GEN_test.json` | | |
| | **Raw** | Full Protocol Corpus | `Bio-protocol.json`, `Protocol-io.json`, etc. | | |
| ### π Useful Links | |
| * **Official Website:** [BioPro-Project Page](https://yuyangsunshine.github.io/BioPro-Project/) | |
| * **GitHub Repository:** [bioprotocolbench](https://github.com/YuyangSunshine/bioprotocolbench) (Code for evaluation & training) | |
| --- | |
| ## π¬ Key Findings | |
| We evaluated 12 mainstream LLMs. Our findings reveal: | |
| * **Surface vs. Deep Understanding:** Models perform well on QA (~70% Acc) but struggle with deep procedural logic. | |
| * **Reasoning Bottleneck:** Performance drops significantly on **Step Ordering** and **Protocol Generation** (BLEU < 15%), highlighting the difficulty of managing temporal dependencies. | |
| * **Bio-specific Models:** Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity. | |
| --- | |
| ## π€ Contributing & Contact | |
| We welcome contributions such as new protocol sources, additional domains, or novel tasks! | |
| - **Email:** sunshineliuyuyang@gmail.com | |
| - **Issues:** Feel free to open an issue on our [GitHub](https://github.com/YuyangSunshine/bioprotocolbench). | |
| ## π Citation | |
| ```bibtex | |
| @misc{bioprotocolbench2025, | |
| title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning}, | |
| author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian}, | |
| year={2025}, | |
| url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)} | |
| } |