ChemCoTBench

🔔News

🚀[2025-06-09]: We establish the ChemCoTBench and evaluate scores for open-source LLMs, commercial LLMs, and Chemical LLMs on the Leaderboard!🌟

Introduction

Overview of the ChemCoTBench and ChemCoT-Dataset. ChemCoTBench presents four key strengths: 1) Novel Setup: Beyond ChemQA, it’s the first step-wise chemical reasoning benchmark; 2) Broad Coverage: Includes foundational tasks (molecule understanding, editing) and two critical applications: molecular optimization (for drug design) and reaction equations (for organic synthesis).; 3) Large Scale: Provides 1,495 samples across 22 subtasks for benchmarking, plus 17K high-quality Chain-of-Thought samples for model training; 4) High Quality: Curated by 13 chemistry Ph.D.s from Tsinghua and Peking University, with strict prompt constraints to ensure CoT correctness.

Overview

Despite recent advances in LLM reasoning capabilities, chemistry, a discipline fundamental to areas like drug discovery and materials science, still lacks a benchmark that assesses whether these improvements extend to its complex, domain-specific problem-solving needs. While several benchmarks have been proposed for LLMs in chemistry, they primarily focus on domain-specific question answering, which suffers from several key limitations:

1. Lack of Structured, Stepwise Reasoning and Real-World Relevance: Current evaluations often reduce chemistry assessment to factual recall (e.g., naming compounds or reactions), neglecting the need for operational reasoning akin to arithmetic or coding. Unlike mathematical problems, where solutions demand explicit, verifiable steps, chemistry QA tasks fail to simulate how experts decompose challenges. For instance, they don't capture the process of iteratively refining a molecule’s substructure to optimize properties, considering crucial real-world factors like synthesizability or toxicity, or deducing reaction mechanisms through intermediate transformations. This gap means we're not fully evaluating the analytical depth required in real-world chemistry. Therefore, evaluations must shift from these textbook-like problems to challenges that better reflect practical applications.

2. Ambiguous Skill Attribution in Hybrid Evaluations: Existing benchmarks often conflate reasoning, knowledge recall, and numerical computation into single "exam-style" metrics—for instance, asking LLMs to calculate reaction yields while simultaneously recalling reagent properties. This obscures whether strong performance stems from structured reasoning (e.g., analyzing reaction pathways) or memorized facts (e.g., solvent boiling points). Such ambiguity hinders targeted model improvement and misaligns evaluations with downstream tasks like drug discovery, where success depends on modular reasoning (e.g., decoupling molecular design from synthesizability checks) rather than monolithic problem-solving.

To address these limitations, we introduce ChemCoTBench, a step-by-step, application-oriented, and high-quality benchmark for evaluating LLM reasoning in chemical applications. A core innovation of ChemCoTBench is its formulation of complex chemical tasks, specifically targeting molecular modeling and design, into explicit sequences of verifiable modular chemical operations on SMILES structures (e.g., substructure addition, deletion, or substitution). This approach allows for a granular assessment of an LLM's ability to execute and chain together fundamental chemical transformations. The benchmark features progressively challenging tasks, spanning from basic molecular understanding and editing to property-guided structure optimization and complex multi-molecule chemical reactions. High-quality evaluation is ensured through a dual validation process combining LLM judgment with expert review from 13 chemists.

Pipeline-and-Statistics

Distribution analysis for ChemCoTBench. (a) The data samples in two application tasks (molecule optimization, 38%, and reaction prediction, 37%) are slightly larger than the foundation tasks (molecule editing and understanding, 25%) for more challenging evaluation. (b). Samples from both molecular understanding and editing tasks achieved exceptionally high accuracy in chemical expert evaluations of chemical entities, including function group names, molecule names, chemical operation names, reaction information, etc. (c). Samples from molecule optimization and reaction prediction also show high accuracy (> 89%) in chemical expert evaluations.

Data construction pipeline for ChemCoTBench and ChemCoTDataset.

Leaderboard

Our evaluation includes three LLM model categories: (1) Reasoning LLMs with explicit step-by-step reasoning, including Deepseek-R1, o1-mini, o3-mini, Gemini-2.5-pro, Claude-3.7-Sonnet-thinking, Qwen-3-thinking, Llama-Nemotron-thinking; (2) General-purpose non-reasoning LLMs without specialized reasoning mechanisms, including GPT-4o, Qwen-2.5/3~\cite{yang2024qwen2}, Llama-3.3, Gemma-2, Phi-4, OLMo2 (3) Biomolecular LLMs including BioMedGPT, BioMistral, and Text+Chem T5. This comprehensive comparison evaluates whether reasoning-specific capabilities provide advantages over domain-specific models in challenging chemical reasoning tasks.

Molecule Understanding and Editing Evaluation

Model Comparison			Func-Group		Scaffold		SMILES	Molecule-Edit
Name	Type	Version	FG↓	Ring↓	Murcko↑	Ring-sys↑	Eq.↑	Add	Delete	Sub
W/ Thinking
Gemini-2.5-pro	think	2.5	0.11	0.60	0.51	87.5	82	100	85	81.7
Claude3.7-sonnet	think	3.7	0.21	1.60	0.40	80.0	84	85	80	83.4
DeepSeek-R1	think	R1	0.27	1.55	0.34	45.0	65	70	70	68.3
o3-mini	think	20250103	0.13	0.60	0.39	75.0	78	65	55	80.0
o1-mini	think	20240912	0.21	1.25	0.25	61.7	66	55	80	58.3
Qwen3-235B-A22B	think	3-235B	0.42	1.00	0.38	82.5	72	40	75	71.7
Qwen3-32B	think	3-32B	0.25	0.95	0.21	75.0	68	20	55	20.0
Llama-Nemo-49B	think	49B	0.80	1.90	0.09	86.8	46	0	80	8.0
W/o Thinking
GPT-4o	base	20241120	0.17	1.35	0.21	80.0	72	80	80	65.0
Deepseek-V3	base	V3	0.15	1.50	0.24	76.7	77	70	75	76.7
Gemini-2.0-flash	base	2.0	0.19	1.65	0.43	75.0	76	65	75	66.7
Qwen3-235B-A22B	base	3-235B	0.42	1.00	0.34	82.5	75	40	75	66.7
Qwen3-32B	base	3-32B	0.26	0.95	0.22	68.3	67	30	55	25.0
Qwen2.5-72B-Instruct	base	2.5-72B	0.26	0.60	0.24	70.0	61	70	80	56.7
Qwen2.5-32B-Instruct	base	2.5-32B	0.36	0.65	0.12	53.3	62	50	50	48.3
Llama-3.1-70B-Instruct	base	3.1-70B	0.52	1.80	0.12	68.3	67	60	80	50.0
Llama-Nemo-49B	base	49B	0.72	1.77	0.11	65.0	54	30	55	30.5
Gemma-2-27b-it	base	2-27b	0.19	1.65	0.43	66.7	76	75	70	35.0
Phi-4-14B	base	4-14B	0.28	1.65	0.15	70.0	65	60	80	38.3
OLMo2-32B-Instruct	base	2-32B	0.19	1.05	0.07	63.3	50	15	30	11.7
BioMedGPT-7B	base	7B	1.6	2.43	0.18	53.3	39	10	12	10
BioMistral-7B	base	7B	1.0	1.85	0.04	32.5	50	0	10	0

Overall results of different models on molecular tasks. The best-performing model in each category is in-bold. ↓ indicates lower is better, ↑ indicates higher is better.

Reasoning-LLM None-Reasoning-LLM Chemical-LLM

Molecule Optimization Evaluation

Models	LogP		Solubility		QED		DRD2		JNK3		GSK3-β
Models	Δ	SR%	Δ	SR%	Δ	SR%	Δ	SR%	Δ	SR%	Δ	SR%
W/ Thinking
Gemini-2.5-pro-think	-0.28	81	1.91	92	0.21	84	0.35	74	-0.04	35	0.04	68
Claude3.7-sonnet-think	0.41	81	0.59	77	0.09	73	0.18	66	-0.01	49	0.01	57
DeepSeek-R1	0.36	74	1.48	97	0.05	72	0.10	62	-0.06	29	-0.02	41
o3-mini@20250103	0.29	68	1.15	85	0.17	86	0.18	69	-0.08	23	-0.03	45
o1-mini@20240912	-0.42	52	1.78	95	0.07	70	-0.03	37	-0.10	15	-0.08	31
Qwen3-235B-A22B-think	-0.01	41	0.27	42	0.01	24	0.03	31	-0.01	23	0.01	31
Qwen3-32B-think	0.0	2	0.11	23	0.02	14	0.0	6	-0.02	6	-0.02	5
Llama-Nemo-49B-think	-0.64	24	0.20	24	-0.16	41	-0.05	30	-0.15	7	-0.12	11
W/o Thinking
GPT-4o@20241120	-0.20	42	0.82	80	0.05	70	0.05	48	-0.05	30	-0.04	39
DeepSeek-V3	0.08	34	0.47	93	0.08	46	0.02	28	0.0	18	0.0	29
Gemini-2.0-flash	0.35	75	0.19	54	0.10	79	0.15	63	0.03	34	0.0	38
Qwen3-235B-A22B	0.02	41	0.51	45	0.01	26	0.01	31	-0.01	23	0.0	34
Qwen3-32B	-0.03	2	0.17	23	0.02	14	-0.01	6	-0.02	6	-0.02	5
Qwen2.5-72B-Instruct	-0.12	42	0.28	60	0.03	57	0.04	40	-0.02	26	-0.01	40
Qwen2.5-32B-Instruct	0.03	47	0.42	66	-0.01	54	0.04	32	-0.04	19	-0.02	31
Llama-3.3-70B-Instruct	-0.16	42	0.61	80	0.07	61	-0.02	31	-0.04	30	-0.02	40
Llama-Nemo-Super-49B	-0.14	27	0.31	41	0.02	50	-0.02	18	-0.04	16	-0.03	27
Gemma-2-27b-it	-0.03	34	0.34	66	0.05	56	-0.03	15	0.0	16	-0.01	17
Phi-4-14B	-0.10	45	0.28	54	0.11	74	-0.04	18	-0.05	14	-0.04	22
OLMo2-32B-Instruct	-2.03	22	1.03	46	-0.13	40	-0.11	7	-0.12	8	-0.11	12

Results of different models on chemistry benchmarks. The best-performing model in each category is in-bold.

Chemical Reactions Evaluation

Models			Fwd major		Fwd by		Retro		Condition		NEPP		MechSel
Name	Type	Version	Top-1	FTS↑	Top-1	FTS↑	Top-1	FTS↑	Top-1	FTS↑	Top-1	FTS↑	Acc.↑
W/ Thinking
Gemini-2.5-pro	think	2.5	0.72	0.89	0.20	0.51	0.20	0.45	0.20	0.33	0.58	0.53	0.62
Claude3.7-sonnet	think	3.7	0.73	0.87	0.25	0.31	0.12	0.27	0.14	0.22	0.24	0.79	0.49
DeepSeek-R1	think	R1	0.48	0.71	0.21	0.45	0.07	0.41	0.23	0.30	0.15	0.55	0.46
o3-mini	think	20250103	0.52	0.71	0.20	0.27	0.11	0.39	0.19	0.19	0.18	0.58	0.49
o1-mini	think	20240912	0.26	0.31	0.11	0.17	0.02	0.15	0.08	0.22	0.09	0.33	0.44
Qwen3-235B-A22B	think	3-235B	0.03	0.54	0.0	0.07	0.01	0.42	0.20	0.27	0.09	0.63	0.41
Qwen3-32B	think	3-32B	0.11	0.33	0.09	0.18	0.02	0.24	0.14	0.20	0.08	0.67	0.46
Llama-Nemo-49B	think	49B	0.09	0.18	0.04	0.18	0.0	0.05	0.18	0.19	0.04	0.21	0.47
W/o Thinking
GPT-4o	base	20241120	0.28	0.58	0.04	0.20	0.03	0.43	0.0	0.08	0.12	0.71	0.43
DeepSeek-V3	base	V3	0.36	0.62	0.04	0.30	0.03	0.44	0.08	0.16	0.20	0.70	0.45
Gemini-2.0-flash	base	2.0	0.19	0.56	0.01	0.07	0.05	0.41	0.07	0.08	0.13	0.68	0.53
Qwen3-235B-A22B	base	3-235B	0.04	0.57	0.0	0.06	0.0	0.30	0.07	0.14	0.07	0.59	0.40
Qwen3-32B	base	3-32B	0.06	0.57	0.0	0.13	0.0	0.43	0.01	0.10	0.08	0.67	0.46
Qwen2.5-72B-Instruct	base	2.5-72B	0.04	0.49	0.0	0.13	0.01	0.35	0.01	0.07	0.06	0.60	0.46
Qwen2.5-32B-Instruct	base	2.5-32B	0.01	0.43	0.0	0.12	0.0	0.29	0.02	0.10	0.05	0.50	0.45
Llama-3.3-70B-Instruct	base	3.3-70B	0.02	0.35	0.0	0.08	0.0	0.34	0.06	0.13	0.06	0.41	0.39
Llama-Nemo-49B	base	49B	0.04	0.40	0.0	0.08	0.0	0.30	0.03	0.05	0.05	0.41	0.46
Gemma-2-27b-it	base	2-27b	0.01	0.55	0.0	0.04	0.0	0.48	0.03	0.10	0.04	0.53	0.43
Phi-4-14B	base	4-14B	0.01	0.27	0.03	0.10	0.0	0.39	0.0	0.03	0.05	0.57	0.39
OLMo2-32B-Instruct	base	2-32B	0.0	0.10	0.0	0.07	0.0	0.10	0.0	0.03	0.01	0.13	0.32
Text+Chem T5	special	T5	0.44	0.74	0.0	0.07	0.06	0.24	0.0	0.09	0.0	0.0	0.10

Performance on chemistry tasks. The best-performing model in each category is in-bold. ↑ indicates higher is better.

BibTeX


          @article{li2025beyond,
            title={Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations},
            author={Li, Hao and Cao, He and Feng, Bin and Shao, Yanjun and Tang, Xiangru and Yan, Zhiyuan and Yuan, Li and Tian, Yonghong and Li, Yu},
            journal={arXiv preprint arXiv:2505.21318},
            year={2025}
          }