ClinBench-HPB

Evaluating LLMs in Hepato-Pancreato-Biliary Diseases

About

ClinBench-HPB, a clinically oriented benchmark designed to assess the knowledge and practical diagnostic capabilities of large language models (LLMs) in Hepato-Pancreato-Biliary (HPB) diseases.

For more details about ClinBench-HPB, please refer to this paper:

Dataset

Our ClinBench-HPB contains 3535 closed-ended medical exam multiple-choice questions and 337 open-ended clinical diagnosis cases, divided into 5 subsets:

1. CN-QA: 2000 Chinese multiple-choice questions.

2. EN-QA: 1535 English multiple-choice questions.

3. Journal: 120 clinical cases from medical journals.

4. Website: 167 clinical cases from case-sharing websites.

5. Hospital: 50 clinical cases from the collaborative hospital.

Please visit our GitHub repository to download the dataset:

Submission

To submit your model, please follow the instructions in the GitHub repository.

Citation

If you use ClinBench-HPB in your research, please cite our paper by:

@misc{li2025clinbenchhpbclinicalbenchmarkevaluating,
  title={ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases},
  author={Yuchong Li, Xiaojun Zeng, Chihua Fang, Jian Yang, Fucang Jia},
  year={2025},
  eprint={2506.00095},
  archivePrefix={arXiv},
  primaryClass={cs.CY},
  url={https://arxiv.org/abs/2506.00095}, 
}
Leaderboard
We use accuracy metrics for the EN-QA and CN-QA subsets, and patient-level recall / disease-level recall for Journal, Website, and Hospital subsets. The calculation method of the metrics and the model version of the evaluation can be found in the paper.

Commercial LLMs

Model EN-QA CN-QA Journal Website Hospital Overall

OpenAI-o1
76.0 91.7 49.0/58.7 52.5/80.2 11.0/64.1 63.9
1
DeepSeek-R1
79.9 89.3 43.3/55.3 51.2/78.5 10.8/64.8 63.2
2
DeepSeekV3-0324
80.3 88.7 38.1/51.3 52.7/80.9 17.5/71.6 63.1
3
Gemini2.5-pro
72.4 89.7 50.0/60.3 50.4/80.9 13.5/71.2 62.4
3
OpenAI-o3mini
66.6 89.4 55.8/63.7 53.9/80.5 11.5/64.1 62.4
5
DeepSeekV3-1226
68.2 88.1 32.7/47.6 54.8/82.6 22.0/74.0 59.5
6
Claude3.5-sonnet
62.3 87.8 39.4/51.4 54.6/82.8 5.0/57.0 57.6
7
Qwen2.5-Max
68.7 87.6 32.9/46.0 42.1/76.2 13.0/68.7 55.7
8
GPT-4o
59.4 88.9 31.9/44.3 37.9/72.0 9.0/59.8 51.8
"†" means the result on the sampled subset.

Open-source general-purpose LLMs

Model EN-QA CN-QA Journal Website Hospital Overall
1
Llama3.1-70B
75.5 87.9 35.6/48.3 50.9/80.7 10.0/67.3 60.1
2
Qwen2.5-72B
65.5 86.7 30.4/43.9 47.5/77.2 36.9/69.1 55.8
3
Qwen2.5-32B
60.2 86.5 26.3/42.1 48.4/78.8 9.0/64.6 53.1
4
Llama3.1-8B
56.5 83.1 32.1/44.1 45.8/77.0 12.0/66.2 52.0
5
Qwen2.5-7B
61.8 83.0 25.6/39.9 43.9/73.9 10.5/64.9 51.7
6
Qwen2.5-14B
53.1 85.2 26.5/40.4 39.2/72.3 6.5/62.5 48.4

Medical LLMs

Model EN-QA CN-QA Journal Website Hospital Overall
1
Baichuan-M1-14B
65.1 87.9 32.7/45.7 46.3/78.3 14.0/64.4 55.8
2
HuatouGPT-o1-70B
69.9 87.2 36.5/47.0 25.6/60.5 5.5/50.8 52.0
3
HuatouGPT-o1-72B
68.9 86.3 33.8/44.7 26.3/60.2 4.0/45.5 51.0
4
HuatouGPT-o1-7B
68.6 82.9 23.5/37.4 25.3/56.5 1.5/40.3 48.0
5
HuatouGPT-o1-8B
50.9 83.9 28.1/37.4 15.9/52.2 1.5/44.1 41.6

Reasoning-enchanced LLMs

Model EN-QA CN-QA Journal Website Hospital Overall
1
QwQ-32B
72.5 86.7 41.2/52.9 51.1/80.7 10.9/65.6 60.1
2
DsR1D-Qwen-32B
69.5 86.6 31.5/44.6 38.5/71.8 10.3/64.0 54.4
3
DsR1D-Llama-70B
69.9 88.9 34.7/46.8 33.8/69.2 6.3/59.3 54.1
4
DsR1D-Qwen-14B
64.4 84.6 29.2/42.3 37.5/70.9 7.1/57.0 51.6
5
DsR1D-Llama-8B
37.3 78.0 25.2/36.9 26.9/62.0 3.5/52.7 38.9
6
DsR1D-Qwen-7B
25.0 66.1 20.1/32.4 23.2/57.3 1.9/43.8 30.9