Spaces:

BioProBench
/

README

Configuration error

App Files Files Community

README / README.md

sunshinepku

Update README.md

84da07b verified about 1 month ago

preview code

raw

history blame contribute delete

5.21 kB

metadata

title: >-
  BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol
  Understanding and Reasoning
license: cc-by-4.0
task_categories:
  - text-generation
  - question-answering
language:
  - en
tags:
  - biology
  - protocol
  - benchmark
  - ai4science

BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

📢 Latest News

✨ [2026-03-31] Data Split Update! We have officially released the Train/Test splits for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
🔥 [2026-03-18] Our BioProAgent is now live on AI4S LAB! Try it out and order wet-lab experiments here.
🎉 [2026-03-03] Our BioProAgent has been accepted by the ICLR 2026 LLA Workshop!
📝 [2026-01-21] BioProBench paper has been updated with new experimental results.arXiv.
🚀 [2025-12-01] Code and dataset (v1.0) are released on GitHub.

🌟 Introduction

BioProBench is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences.

Key Features:

📚 Large-scale Data: Built upon 27K original biological protocols, yielding nearly 556K high-quality structured instances.
🎯 Comprehensive Tasks: 5 core tasks: PQA (Question Answering), ORD (Step Ordering), ERR (Error Correction), GEN (Generation), and REA (Reasoning).
🧬 Broad Domain Coverage: Covers 16 biological subdomains from 6 major repositories.
🔬 Standardized Evaluation: A robust framework combining NLP metrics with novel domain-specific measures.

📊 Dataset Structure & Tasks

We provide standardized JSON files for each task, now including Train and Test splits:

Task	Description	Files
PQA	Protocol Question Answering	`PQA_train.json`, `PQA_test.json`
ORD	Step Ordering	`ORD_train.json`, `ORD_test.json`
ERR	Error Correction	`ERR_train.json`, `ERR_test.json`
GEN	Protocol Generation	`GEN_train.json`, `GEN_test.json`
Raw	Full Protocol Corpus	`Bio-protocol.json`, `Protocol-io.json`, etc.

🔗 Useful Links

Official Website: BioPro-Project Page
GitHub Repository: bioprotocolbench (Code for evaluation & training)

🔬 Key Findings

We evaluated 12 mainstream LLMs. Our findings reveal:

Surface vs. Deep Understanding: Models perform well on QA (~70% Acc) but struggle with deep procedural logic.
Reasoning Bottleneck: Performance drops significantly on Step Ordering and Protocol Generation (BLEU < 15%), highlighting the difficulty of managing temporal dependencies.
Bio-specific Models: Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity.

🤝 Contributing & Contact

We welcome contributions such as new protocol sources, additional domains, or novel tasks!

Email: sunshineliuyuyang@gmail.com
Issues: Feel free to open an issue on our GitHub.

📜 Citation

@misc{bioprotocolbench2025,
  title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
  author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian},
  year={2025},
  url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)}
}