File size: 5,206 Bytes
71edcb9
 
 
 
9ed2254
 
 
 
 
 
 
 
 
 
 
71edcb9
1acfdd6
 
 
 
 
c25cb29
a55c955
0d84286
 
9ed2254
 
c25cb29
 
9ed2254
 
 
 
 
 
84da07b
9ed2254
c25cb29
 
 
9ed2254
 
c25cb29
0d84286
9ed2254
0d84286
c25cb29
9ed2254
c25cb29
9ed2254
 
 
c25cb29
 
 
9ed2254
c25cb29
0d84286
9ed2254
0d84286
 
9ed2254
c25cb29
9ed2254
 
 
 
 
 
 
c25cb29
9ed2254
 
 
c25cb29
 
 
9ed2254
 
 
 
 
c25cb29
 
 
9ed2254
 
 
 
c25cb29
 
 
 
 
9ed2254
c25cb29
9ed2254
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
---
title: >-
  BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol
  Understanding and Reasoning
license: cc-by-4.0
task_categories:
  - text-generation
  - question-answering
language:
  - en
tags:
  - biology
  - protocol
  - benchmark
  - ai4science
---

<div align="center">
  <img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/logo-v3.png?raw=true" alt="BioProBench Logo" width="300"/>
</div>

# BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning

[![ArXiv](https://img.shields.io/badge/ArXiv-paper-B31B1B.svg?logo=arXiv&logoColor=Red)](https://arxiv.org/pdf/2505.07889)
[![Hugging Face](https://img.shields.io/badge/Hugging%20Face-Dataset-FFD210.svg?logo=HuggingFace&logoColor=black)](https://huggingface.co/BioProBench)
[![GitHub](https://img.shields.io/badge/GitHub-Code-181717.svg?logo=github&logoColor=white)](https://github.com/YuyangSunshine/bioprotocolbench)
[![Project Page](https://img.shields.io/badge/Project-Page-blue.svg?logo=google-chrome&logoColor=white)](https://yuyangsunshine.github.io/BioPro-Project/)
[![License: CC BY 4.0](https://img.shields.io/badge/License-CC_BY_4.0-lightgrey.svg)](https://creativecommons.org/licenses/by/4.0/)

---

## πŸ“’ Latest News 
* ✨ **[2026-03-31]** **Data Split Update!** We have officially released the **Train/Test splits** for each task (PQA, ORD, ERR, GEN, REA), making it easier for the community to train and evaluate models consistently.
* πŸ”₯ **[2026-03-18]** Our **BioProAgent** is now live on AI4S LAB! [Try it out and order wet-lab experiments here](https://yuyangsunshine.github.io/BioPro-Project/).
* πŸŽ‰ **[2026-03-03]** Our BioProAgent has been accepted by the **ICLR 2026 LLA Workshop!**
* πŸ“ **[2026-01-21]** BioProBench paper has been updated with new experimental results.[arXiv](https://arxiv.org/pdf/2505.07889).
* πŸš€ **[2025-12-01]** Code and dataset (v1.0) are released on GitHub.

---

## 🌟 Introduction
**BioProBench** is the first large-scale, integrated multi-task benchmark for biological protocol understanding and reasoning, specifically designed for Large Language Models (LLMs). It moves beyond simple QA to encompass a comprehensive suite of tasks critical for procedural text comprehension in life sciences.

<div align="center">
  <img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/overview.png?raw=true" alt="BioProBench Overview" width="1000"/>
</div>

### Key Features:
* πŸ“š **Large-scale Data:** Built upon **27K original biological protocols**, yielding nearly **556K high-quality structured instances**.
* 🎯 **Comprehensive Tasks:** 5 core tasks: **PQA** (Question Answering), **ORD** (Step Ordering), **ERR** (Error Correction), **GEN** (Generation), and **REA** (Reasoning).
* 🧬 **Broad Domain Coverage:** Covers **16 biological subdomains** from 6 major repositories.
* πŸ”¬ **Standardized Evaluation:** A robust framework combining NLP metrics with novel domain-specific measures.

---

## πŸ“Š Dataset Structure & Tasks

<div align="center">
  <img src="https://github.com/YuyangSunshine/bioprotocolbench/blob/main/figures/samples.jpg?raw=true" alt="BioProBench Samples" width="1000"/>
</div>

We provide standardized JSON files for each task, now including **Train** and **Test** splits:

| Task | Description | Files |
| :--- | :--- | :--- |
| **PQA** | Protocol Question Answering | `PQA_train.json`, `PQA_test.json` |
| **ORD** | Step Ordering | `ORD_train.json`, `ORD_test.json` |
| **ERR** | Error Correction | `ERR_train.json`, `ERR_test.json` |
| **GEN** | Protocol Generation | `GEN_train.json`, `GEN_test.json` |
| **Raw** | Full Protocol Corpus | `Bio-protocol.json`, `Protocol-io.json`, etc. |

### πŸ”— Useful Links
* **Official Website:** [BioPro-Project Page](https://yuyangsunshine.github.io/BioPro-Project/)
* **GitHub Repository:** [bioprotocolbench](https://github.com/YuyangSunshine/bioprotocolbench) (Code for evaluation & training)

---

## πŸ”¬ Key Findings
We evaluated 12 mainstream LLMs. Our findings reveal:
* **Surface vs. Deep Understanding:** Models perform well on QA (~70% Acc) but struggle with deep procedural logic.
* **Reasoning Bottleneck:** Performance drops significantly on **Step Ordering** and **Protocol Generation** (BLEU < 15%), highlighting the difficulty of managing temporal dependencies.
* **Bio-specific Models:** Interestingly, some bio-specific models lag behind general LLMs in capturing intricate procedural dependencies, suggesting a need for larger reasoning capacity.

---

## 🀝 Contributing & Contact
We welcome contributions such as new protocol sources, additional domains, or novel tasks!
- **Email:** sunshineliuyuyang@gmail.com
- **Issues:** Feel free to open an issue on our [GitHub](https://github.com/YuyangSunshine/bioprotocolbench).

## πŸ“œ Citation
```bibtex
@misc{bioprotocolbench2025,
  title={BioProBench: Comprehensive Dataset and Benchmark in Biological Protocol Understanding and Reasoning},
  author={Yuyang Liu, Liuzhenghao Lv, Xiancheng Zhang, Jingya Wang, Li Yuan, Yonghong Tian},
  year={2025},
  url={[https://arxiv.org/pdf/2505.07889](https://arxiv.org/pdf/2505.07889)}
}