ClinBench-HPB Homepage

About

ClinBench-HPB, a clinically oriented benchmark designed to assess the knowledge and practical diagnostic capabilities of large language models (LLMs) in Hepato-Pancreato-Biliary (HPB) diseases.

For more details about ClinBench-HPB, please refer to this paper:

Dataset

Our ClinBench-HPB contains 3535 closed-ended medical exam multiple-choice questions and 337 open-ended clinical diagnosis cases, divided into 5 subsets:

1. CN-QA: 2000 Chinese multiple-choice questions.

2. EN-QA: 1535 English multiple-choice questions.

3. Journal: 120 clinical cases from medical journals.

4. Website: 167 clinical cases from case-sharing websites.

5. Hospital: 50 clinical cases from the collaborative hospital.

Please visit our GitHub repository to download the dataset:

Submission

To submit your model, please follow the instructions in the GitHub repository.

Citation

If you use ClinBench-HPB in your research, please cite our paper by:

@misc{li2025clinbenchhpbclinicalbenchmarkevaluating,
  title={ClinBench-HPB: A Clinical Benchmark for Evaluating LLMs in Hepato-Pancreato-Biliary Diseases},
  author={Yuchong Li, Xiaojun Zeng, Chihua Fang, Jian Yang, Fucang Jia},
  year={2025},
  eprint={2506.00095},
  archivePrefix={arXiv},
  primaryClass={cs.CY},
  url={https://arxiv.org/abs/2506.00095}, 
}

Leaderboard

We use accuracy metrics for the EN-QA and CN-QA subsets, and patient-level recall / disease-level recall for Journal, Website, and Hospital subsets. The calculation method of the metrics and the model version of the evaluation can be found in the paper.

Commercial LLMs

	Model	EN-QA	CN-QA	Journal	Website	Hospital	Overall
	OpenAI-o1	76.0^†	91.7^†	49.0/58.7	52.5/80.2	11.0/64.1	63.9^†
1	DeepSeek-R1	79.9	89.3	43.3/55.3	51.2/78.5	10.8/64.8	63.2
2	DeepSeekV3-0324	80.3	88.7	38.1/51.3	52.7/80.9	17.5/71.6	63.1
3	Gemini2.5-pro	72.4	89.7	50.0/60.3	50.4/80.9	13.5/71.2	62.4
3	OpenAI-o3mini	66.6	89.4	55.8/63.7	53.9/80.5	11.5/64.1	62.4
5	DeepSeekV3-1226	68.2	88.1	32.7/47.6	54.8/82.6	22.0/74.0	59.5
6	Claude3.5-sonnet	62.3	87.8	39.4/51.4	54.6/82.8	5.0/57.0	57.6
7	Qwen2.5-Max	68.7	87.6	32.9/46.0	42.1/76.2	13.0/68.7	55.7
8	GPT-4o	59.4	88.9	31.9/44.3	37.9/72.0	9.0/59.8	51.8

"†" means the result on the sampled subset.

Open-source general-purpose LLMs

	Model	EN-QA	CN-QA	Journal	Website	Hospital	Overall
1	Llama3.1-70B	75.5	87.9	35.6/48.3	50.9/80.7	10.0/67.3	60.1
2	Qwen2.5-72B	65.5	86.7	30.4/43.9	47.5/77.2	36.9/69.1	55.8
3	Qwen2.5-32B	60.2	86.5	26.3/42.1	48.4/78.8	9.0/64.6	53.1
4	Llama3.1-8B	56.5	83.1	32.1/44.1	45.8/77.0	12.0/66.2	52.0
5	Qwen2.5-7B	61.8	83.0	25.6/39.9	43.9/73.9	10.5/64.9	51.7
6	Qwen2.5-14B	53.1	85.2	26.5/40.4	39.2/72.3	6.5/62.5	48.4

Medical LLMs

	Model	EN-QA	CN-QA	Journal	Website	Hospital	Overall
1	Baichuan-M1-14B	65.1	87.9	32.7/45.7	46.3/78.3	14.0/64.4	55.8
2	HuatouGPT-o1-70B	69.9	87.2	36.5/47.0	25.6/60.5	5.5/50.8	52.0
3	HuatouGPT-o1-72B	68.9	86.3	33.8/44.7	26.3/60.2	4.0/45.5	51.0
4	HuatouGPT-o1-7B	68.6	82.9	23.5/37.4	25.3/56.5	1.5/40.3	48.0
5	HuatouGPT-o1-8B	50.9	83.9	28.1/37.4	15.9/52.2	1.5/44.1	41.6

Reasoning-enchanced LLMs

	Model	EN-QA	CN-QA	Journal	Website	Hospital	Overall
1	QwQ-32B	72.5	86.7	41.2/52.9	51.1/80.7	10.9/65.6	60.1
2	DsR1D-Qwen-32B	69.5	86.6	31.5/44.6	38.5/71.8	10.3/64.0	54.4
3	DsR1D-Llama-70B	69.9	88.9	34.7/46.8	33.8/69.2	6.3/59.3	54.1
4	DsR1D-Qwen-14B	64.4	84.6	29.2/42.3	37.5/70.9	7.1/57.0	51.6
5	DsR1D-Llama-8B	37.3	78.0	25.2/36.9	26.9/62.0	3.5/52.7	38.9
6	DsR1D-Qwen-7B	25.0	66.1	20.1/32.4	23.2/57.3	1.9/43.8	30.9