About
ClinBench-HPB, a clinically oriented benchmark designed to assess the knowledge and practical diagnostic capabilities of large language models (LLMs) in Hepato-Pancreato-Biliary (HPB) diseases.
For more details about ClinBench-HPB, please refer to this paper:
Dataset
Our ClinBench-HPB contains 3535 closed-ended medical exam multiple-choice questions and 337 open-ended clinical diagnosis cases, divided into 5 subsets:
1. CN-QA: 2000 Chinese multiple-choice questions.
2. EN-QA: 1535 English multiple-choice questions.
3. Journal: 120 clinical cases from medical journals.
4. Website: 167 clinical cases from case-sharing websites.
5. Hospital: 50 clinical cases from the collaborative hospital.
Please visit our GitHub repository to download the dataset:
Submission
To submit your model, please follow the instructions in the GitHub repository.
Citation
If you use ClinBench-HPB in your research, please cite our paper by:
@misc{li2025clinbenchhpbclinicalbenchmarkevaluating, title={ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases}, author={Yuchong Li, Xiaojun Zeng, Chihua Fang, Jian Yang, Fucang Jia}, year={2025}, eprint={2506.00095}, archivePrefix={arXiv}, primaryClass={cs.CY}, url={https://arxiv.org/abs/2506.00095}, }
Commercial LLMs
Model | EN-QA | CN-QA | Journal | Website | Hospital | Overall | |
---|---|---|---|---|---|---|---|
OpenAI-o1 | 76.0† | 91.7† | 49.0/58.7 | 52.5/80.2 | 11.0/64.1 | 63.9† | |
1 | DeepSeek-R1 | 79.9 | 89.3 | 43.3/55.3 | 51.2/78.5 | 10.8/64.8 | 63.2 |
2 | DeepSeekV3-0324 | 80.3 | 88.7 | 38.1/51.3 | 52.7/80.9 | 17.5/71.6 | 63.1 |
3 | Gemini2.5-pro | 72.4 | 89.7 | 50.0/60.3 | 50.4/80.9 | 13.5/71.2 | 62.4 |
3 | OpenAI-o3mini | 66.6 | 89.4 | 55.8/63.7 | 53.9/80.5 | 11.5/64.1 | 62.4 |
5 | DeepSeekV3-1226 | 68.2 | 88.1 | 32.7/47.6 | 54.8/82.6 | 22.0/74.0 | 59.5 |
6 | Claude3.5-sonnet | 62.3 | 87.8 | 39.4/51.4 | 54.6/82.8 | 5.0/57.0 | 57.6 |
7 | Qwen2.5-Max | 68.7 | 87.6 | 32.9/46.0 | 42.1/76.2 | 13.0/68.7 | 55.7 |
8 | GPT-4o | 59.4 | 88.9 | 31.9/44.3 | 37.9/72.0 | 9.0/59.8 | 51.8 |
Open-source general-purpose LLMs
Model | EN-QA | CN-QA | Journal | Website | Hospital | Overall | |
---|---|---|---|---|---|---|---|
1 | Llama3.1-70B | 75.5 | 87.9 | 35.6/48.3 | 50.9/80.7 | 10.0/67.3 | 60.1 |
2 | Qwen2.5-72B | 65.5 | 86.7 | 30.4/43.9 | 47.5/77.2 | 36.9/69.1 | 55.8 |
3 | Qwen2.5-32B | 60.2 | 86.5 | 26.3/42.1 | 48.4/78.8 | 9.0/64.6 | 53.1 |
4 | Llama3.1-8B | 56.5 | 83.1 | 32.1/44.1 | 45.8/77.0 | 12.0/66.2 | 52.0 |
5 | Qwen2.5-7B | 61.8 | 83.0 | 25.6/39.9 | 43.9/73.9 | 10.5/64.9 | 51.7 |
6 | Qwen2.5-14B | 53.1 | 85.2 | 26.5/40.4 | 39.2/72.3 | 6.5/62.5 | 48.4 |
Medical LLMs
Model | EN-QA | CN-QA | Journal | Website | Hospital | Overall | |
---|---|---|---|---|---|---|---|
1 | Baichuan-M1-14B | 65.1 | 87.9 | 32.7/45.7 | 46.3/78.3 | 14.0/64.4 | 55.8 |
2 | HuatouGPT-o1-70B | 69.9 | 87.2 | 36.5/47.0 | 25.6/60.5 | 5.5/50.8 | 52.0 |
3 | HuatouGPT-o1-72B | 68.9 | 86.3 | 33.8/44.7 | 26.3/60.2 | 4.0/45.5 | 51.0 |
4 | HuatouGPT-o1-7B | 68.6 | 82.9 | 23.5/37.4 | 25.3/56.5 | 1.5/40.3 | 48.0 |
5 | HuatouGPT-o1-8B | 50.9 | 83.9 | 28.1/37.4 | 15.9/52.2 | 1.5/44.1 | 41.6 |
Reasoning-enchanced LLMs
Model | EN-QA | CN-QA | Journal | Website | Hospital | Overall | |
---|---|---|---|---|---|---|---|
1 | QwQ-32B | 72.5 | 86.7 | 41.2/52.9 | 51.1/80.7 | 10.9/65.6 | 60.1 |
2 | DsR1D-Qwen-32B | 69.5 | 86.6 | 31.5/44.6 | 38.5/71.8 | 10.3/64.0 | 54.4 |
3 | DsR1D-Llama-70B | 69.9 | 88.9 | 34.7/46.8 | 33.8/69.2 | 6.3/59.3 | 54.1 |
4 | DsR1D-Qwen-14B | 64.4 | 84.6 | 29.2/42.3 | 37.5/70.9 | 7.1/57.0 | 51.6 |
5 | DsR1D-Llama-8B | 37.3 | 78.0 | 25.2/36.9 | 26.9/62.0 | 3.5/52.7 | 38.9 |
6 | DsR1D-Qwen-7B | 25.0 | 66.1 | 20.1/32.4 | 23.2/57.3 | 1.9/43.8 | 30.9 |