Spaces:

BioProBench
/

README

Configuration error

App Files Files Community

README / README.md

sunshinepku

Update README.md

84da07b verified about 1 month ago

preview code

raw

history blame contribute delete

5.21 kB

	---
	title: >-
	BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol
	Understanding and Reasoning
	license: cc-by-4.0
	task_categories:
	- text-generation
	- question-answering
	language:
	- en
	tags:
	- biology
	- protocol
	- benchmark
	- ai4science
	---

	<div align="center">
	<img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/logo-v3.png?raw=true" alt="BioProBench Logo" width="300"/>
	</div>

	# BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

	[![ArXiv](https://img.shields.io/badge/ArXiv-paper-B31B1B.svg?logo=arXiv&logoColor=Red)](https://arxiv.org/pdf/2505.07889)
	[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Dataset-FFD210.svg?logo=HuggingFace&logoColor=black)](https://huggingface.co/BioProBench)
	[![GitHub](https://img.shields.io/badge/GitHub-Code-181717.svg?logo=github&logoColor=white)](https://github.com/YuyangSunshine/bioprotocolbench)
	[![Project Page](https://img.shields.io/badge/Project-Page-blue.svg?logo=google-chrome&logoColor=white)](https://yuyangsunshine.github.io/BioPro-Project/)
	[![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)

	---

	## 📢 Latest News
	* ✨ [2026-03-31] Data Split Update! We have officially released the Train/Test splits for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
	* 🔥 [2026-03-18] Our BioProAgent is now live on AI4S LAB! [Try it out and order wet-lab experiments here](https://yuyangsunshine.github.io/BioPro-Project/).
	* 🎉 [2026-03-03] Our BioProAgent has been accepted by the ICLR 2026 LLA Workshop!
	* 📝 [2026-01-21] BioProBench paper has been updated with new experimental results.[arXiv](https://arxiv.org/pdf/2505.07889).
	* 🚀 [2025-12-01] Code and dataset (v1.0) are released on GitHub.

	---

	## 🌟 Introduction
	BioProBench is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences.

	<div align="center">
	<img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/overview.png?raw=true" alt="BioProBench Overview" width="1000"/>
	</div>

	### Key Features:
	* 📚 Large-scale Data: Built upon 27K original biological protocols, yielding nearly 556K high-quality structured instances.
	* 🎯 Comprehensive Tasks: 5 core tasks: PQA (Question Answering), ORD (Step Ordering), ERR (Error Correction), GEN (Generation), and REA (Reasoning).
	* 🧬 Broad Domain Coverage: Covers 16 biological subdomains from 6 major repositories.
	* 🔬 Standardized Evaluation: A robust framework combining NLP metrics with novel domain-specific measures.

	---

	## 📊 Dataset Structure & Tasks

	<div align="center">
	<img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/samples.jpg?raw=true" alt="BioProBench Samples" width="1000"/>
	</div>

	We provide standardized JSON files for each task, now including Train and Test splits:

	\| Task \| Description \| Files \|
	\| :--- \| :--- \| :--- \|
	\| PQA \| Protocol Question Answering \| `PQA_train.json`, `PQA_test.json` \|
	\| ORD \| Step Ordering \| `ORD_train.json`, `ORD_test.json` \|
	\| ERR \| Error Correction \| `ERR_train.json`, `ERR_test.json` \|
	\| GEN \| Protocol Generation \| `GEN_train.json`, `GEN_test.json` \|
	\| Raw \| Full Protocol Corpus \| `Bio-protocol.json`, `Protocol-io.json`, etc. \|

	### 🔗 Useful Links
	* Official Website: [BioPro-Project Page](https://yuyangsunshine.github.io/BioPro-Project/)
	* GitHub Repository: [bioprotocolbench](https://github.com/YuyangSunshine/bioprotocolbench) (Code for evaluation & training)

	---

	## 🔬 Key Findings
	We evaluated 12 mainstream LLMs. Our findings reveal:
	* Surface vs. Deep Understanding: Models perform well on QA (~70% Acc) but struggle with deep procedural logic.
	* Reasoning Bottleneck: Performance drops significantly on Step Ordering and Protocol Generation (BLEU < 15%), highlighting the difficulty of managing temporal dependencies.
	* Bio-specific Models: Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity.

	---

	## 🤝 Contributing & Contact
	We welcome contributions such as new protocol sources, additional domains, or novel tasks!
	- Email: sunshineliuyuyang@gmail.com
	- Issues: Feel free to open an issue on our [GitHub](https://github.com/YuyangSunshine/bioprotocolbench).

	## 📜 Citation
	```bibtex
	@misc{bioprotocolbench2025,
	title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
	author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian},
	year={2025},
	url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)}
	}