Logo ChemCoTBench

Beyond Chemical QA, Evaluating LLM's Chemical Reasoning with Modular Chemical Operations

Hao Li*, He Cao*, Bin Feng, Yanjun Shao, Xiangru Tang, Zhiyuan Yan, Li Yuan†, Yonghong Tian†, Yu Li†

Peking University, International Digital Economy Academy (IDEA), Yale University

*Equal Contributors, †Corresponding Authors
Contact: lihao1984@pku.edu.cn, caohe@idea.edu.cn, yuanli-ece@pku.edu.cn, liyu@idea.edu.cn
geometric reasoning

Sample Overview of the ChemCoTBench and ChemCoT-Dataset, which contains high-quality chain-of-thought samples from two chemical foundation tasks (molecule understanding, molecule editing) and two application tasks (molecule optimization, reaction prediction)

🔔News

🚀[2025-06-09]: We establish the ChemCoTBench and evaluate scores for open-source LLMs, commercial LLMs, and Chemical LLMs on the Leaderboard!🌟

Introduction

Overview of the ChemCoTBench and ChemCoT-Dataset. ChemCoTBench presents four key strengths: 1) Novel Setup: Beyond ChemQA, it’s the first step-wise chemical reasoning benchmark; 2) Broad Coverage: Includes foundational tasks (molecule understanding, editing) and two critical applications: molecular optimization (for drug design) and reaction equations (for organic synthesis).; 3) Large Scale: Provides 1,495 samples across 22 subtasks for benchmarking, plus 17K high-quality Chain-of-Thought samples for model training; 4) High Quality: Curated by 13 chemistry Ph.D.s from Tsinghua and Peking University, with strict prompt constraints to ensure CoT correctness.

ChemCoT-Benchmark

Overview

Despite recent advances in LLM reasoning capabilities, chemistry, a discipline fundamental to areas like drug discovery and materials science, still lacks a benchmark that assesses whether these improvements extend to its complex, domain-specific problem-solving needs. While several benchmarks have been proposed for LLMs in chemistry, they primarily focus on domain-specific question answering, which suffers from several key limitations:

1. Lack of Structured, Stepwise Reasoning and Real-World Relevance: Current evaluations often reduce chemistry assessment to factual recall (e.g., naming compounds or reactions), neglecting the need for operational reasoning akin to arithmetic or coding. Unlike mathematical problems, where solutions demand explicit, verifiable steps, chemistry QA tasks fail to simulate how experts decompose challenges. For instance, they don't capture the process of iteratively refining a molecule’s substructure to optimize properties, considering crucial real-world factors like synthesizability or toxicity, or deducing reaction mechanisms through intermediate transformations. This gap means we're not fully evaluating the analytical depth required in real-world chemistry. Therefore, evaluations must shift from these textbook-like problems to challenges that better reflect practical applications.

2. Ambiguous Skill Attribution in Hybrid Evaluations: Existing benchmarks often conflate reasoning, knowledge recall, and numerical computation into single "exam-style" metrics—for instance, asking LLMs to calculate reaction yields while simultaneously recalling reagent properties. This obscures whether strong performance stems from structured reasoning (e.g., analyzing reaction pathways) or memorized facts (e.g., solvent boiling points). Such ambiguity hinders targeted model improvement and misaligns evaluations with downstream tasks like drug discovery, where success depends on modular reasoning (e.g., decoupling molecular design from synthesizability checks) rather than monolithic problem-solving.

algebraic reasoning

To address these limitations, we introduce ChemCoTBench, a step-by-step, application-oriented, and high-quality benchmark for evaluating LLM reasoning in chemical applications. A core innovation of ChemCoTBench is its formulation of complex chemical tasks, specifically targeting molecular modeling and design, into explicit sequences of verifiable modular chemical operations on SMILES structures (e.g., substructure addition, deletion, or substitution). This approach allows for a granular assessment of an LLM's ability to execute and chain together fundamental chemical transformations. The benchmark features progressively challenging tasks, spanning from basic molecular understanding and editing to property-guided structure optimization and complex multi-molecule chemical reactions. High-quality evaluation is ensured through a dual validation process combining LLM judgment with expert review from 13 chemists.

Pipeline-and-Statistics

Experiment Results

Leaderboard

Our evaluation includes three LLM model categories: (1) Reasoning LLMs with explicit step-by-step reasoning, including Deepseek-R1, o1-mini, o3-mini, Gemini-2.5-pro, Claude-3.7-Sonnet-thinking, Qwen-3-thinking, Llama-Nemotron-thinking; (2) General-purpose non-reasoning LLMs without specialized reasoning mechanisms, including GPT-4o, Qwen-2.5/3~\cite{yang2024qwen2}, Llama-3.3, Gemma-2, Phi-4, OLMo2 (3) Biomolecular LLMs including BioMedGPT, BioMistral, and Text+Chem T5. This comprehensive comparison evaluates whether reasoning-specific capabilities provide advantages over domain-specific models in challenging chemical reasoning tasks.


Molecule Understanding and Editing Evaluation

Model Comparison Func-Group Scaffold SMILES Molecule-Edit
Name Type Version FG↓ Ring↓ Murcko↑ Ring-sys↑ Eq.↑ Add Delete Sub
W/ Thinking
Gemini-2.5-pro think 2.5 0.11 0.60 0.51 87.5 82 100 85 81.7
Claude3.7-sonnet think 3.7 0.21 1.60 0.40 80.0 84 85 80 83.4
DeepSeek-R1 think R1 0.27 1.55 0.34 45.0 65 70 70 68.3
o3-mini think 20250103 0.13 0.60 0.39 75.0 78 65 55 80.0
o1-mini think 20240912 0.21 1.25 0.25 61.7 66 55 80 58.3
Qwen3-235B-A22B think 3-235B 0.42 1.00 0.38 82.5 72 40 75 71.7
Qwen3-32B think 3-32B 0.25 0.95 0.21 75.0 68 20 55 20.0
Llama-Nemo-49B think 49B 0.80 1.90 0.09 86.8 46 0 80 8.0
W/o Thinking
GPT-4o base 20241120 0.17 1.35 0.21 80.0 72 80 80 65.0
Deepseek-V3 base V3 0.15 1.50 0.24 76.7 77 70 75 76.7
Gemini-2.0-flash base 2.0 0.19 1.65 0.43 75.0 76 65 75 66.7
Qwen3-235B-A22B base 3-235B 0.42 1.00 0.34 82.5 75 40 75 66.7
Qwen3-32B base 3-32B 0.26 0.95 0.22 68.3 67 30 55 25.0
Qwen2.5-72B-Instruct base 2.5-72B 0.26 0.60 0.24 70.0 61 70 80 56.7
Qwen2.5-32B-Instruct base 2.5-32B 0.36 0.65 0.12 53.3 62 50 50 48.3
Llama-3.1-70B-Instruct base 3.1-70B 0.52 1.80 0.12 68.3 67 60 80 50.0
Llama-Nemo-49B base 49B 0.72 1.77 0.11 65.0 54 30 55 30.5
Gemma-2-27b-it base 2-27b 0.19 1.65 0.43 66.7 76 75 70 35.0
Phi-4-14B base 4-14B 0.28 1.65 0.15 70.0 65 60 80 38.3
OLMo2-32B-Instruct base 2-32B 0.19 1.05 0.07 63.3 50 15 30 11.7
BioMedGPT-7B base 7B 1.6 2.43 0.18 53.3 39 10 12 10
BioMistral-7B base 7B 1.0 1.85 0.04 32.5 50 0 10 0

Overall results of different models on molecular tasks. The best-performing model in each category is in-bold. ↓ indicates lower is better, ↑ indicates higher is better.

Reasoning-LLM None-Reasoning-LLM Chemical-LLM

Molecule Optimization Evaluation

Models LogP Solubility QED DRD2 JNK3 GSK3-β
Δ SR% Δ SR% Δ SR% Δ SR% Δ SR% Δ SR%
W/ Thinking
Gemini-2.5-pro-think -0.2881 1.9192 0.2184 0.3574 -0.0435 0.0468
Claude3.7-sonnet-think 0.4181 0.5977 0.0973 0.1866 -0.0149 0.0157
DeepSeek-R1 0.3674 1.4897 0.0572 0.1062 -0.0629 -0.0241
o3-mini@20250103 0.2968 1.1585 0.1786 0.1869 -0.0823 -0.0345
o1-mini@20240912 -0.4252 1.7895 0.0770 -0.0337 -0.1015 -0.0831
Qwen3-235B-A22B-think -0.0141 0.2742 0.0124 0.0331 -0.0123 0.0131
Qwen3-32B-think 0.02 0.1123 0.0214 0.06 -0.026 -0.025
Llama-Nemo-49B-think -0.6424 0.2024 -0.1641 -0.0530 -0.157 -0.1211
W/o Thinking
GPT-4o@20241120 -0.2042 0.8280 0.0570 0.0548 -0.0530 -0.0439
DeepSeek-V3 0.0834 0.4793 0.0846 0.0228 0.018 0.029
Gemini-2.0-flash 0.3575 0.1954 0.1079 0.1563 0.0334 0.038
Qwen3-235B-A22B 0.0241 0.5145 0.0126 0.0131 -0.0123 0.034
Qwen3-32B -0.032 0.1723 0.0214 -0.016 -0.026 -0.025
Qwen2.5-72B-Instruct -0.1242 0.2860 0.0357 0.0440 -0.0226 -0.0140
Qwen2.5-32B-Instruct 0.0347 0.4266 -0.0154 0.0432 -0.0419 -0.0231
Llama-3.3-70B-Instruct -0.1642 0.6180 0.0761 -0.0231 -0.0430 -0.0240
Llama-Nemo-Super-49B -0.1427 0.3141 0.0250 -0.0218 -0.0416 -0.0327
Gemma-2-27b-it -0.0334 0.3466 0.0556 -0.0315 0.016 -0.0117
Phi-4-14B -0.1045 0.2854 0.1174 -0.0418 -0.0514 -0.0422
OLMo2-32B-Instruct -2.0322 1.0346 -0.1340 -0.117 -0.128 -0.1112

Results of different models on chemistry benchmarks. The best-performing model in each category is in-bold.

Chemical Reactions Evaluation

Models Fwd major Fwd by Retro Condition NEPP MechSel
Name Type Version Top-1 FTS↑ Top-1 FTS↑ Top-1 FTS↑ Top-1 FTS↑ Top-1 FTS↑ Acc.↑
W/ Thinking
Gemini-2.5-pro think 2.5 0.72 0.89 0.20 0.51 0.20 0.45 0.20 0.33 0.58 0.53 0.62
Claude3.7-sonnet think 3.7 0.73 0.87 0.25 0.31 0.12 0.27 0.14 0.22 0.24 0.79 0.49
DeepSeek-R1 think R1 0.48 0.71 0.21 0.45 0.07 0.41 0.23 0.30 0.15 0.55 0.46
o3-mini think 20250103 0.52 0.71 0.20 0.27 0.11 0.39 0.19 0.19 0.18 0.58 0.49
o1-mini think 20240912 0.26 0.31 0.11 0.17 0.02 0.15 0.08 0.22 0.09 0.33 0.44
Qwen3-235B-A22B think 3-235B 0.03 0.54 0.0 0.07 0.01 0.42 0.20 0.27 0.09 0.63 0.41
Qwen3-32B think 3-32B 0.11 0.33 0.09 0.18 0.02 0.24 0.14 0.20 0.08 0.67 0.46
Llama-Nemo-49B think 49B 0.09 0.18 0.04 0.18 0.0 0.05 0.18 0.19 0.04 0.21 0.47
W/o Thinking
GPT-4o base 20241120 0.28 0.58 0.04 0.20 0.03 0.43 0.0 0.08 0.12 0.71 0.43
DeepSeek-V3 base V3 0.36 0.62 0.04 0.30 0.03 0.44 0.08 0.16 0.20 0.70 0.45
Gemini-2.0-flash base 2.0 0.19 0.56 0.01 0.07 0.05 0.41 0.07 0.08 0.13 0.68 0.53
Qwen3-235B-A22B base 3-235B 0.04 0.57 0.0 0.06 0.0 0.30 0.07 0.14 0.07 0.59 0.40
Qwen3-32B base 3-32B 0.06 0.57 0.0 0.13 0.0 0.43 0.01 0.10 0.08 0.67 0.46
Qwen2.5-72B-Instruct base 2.5-72B 0.04 0.49 0.0 0.13 0.01 0.35 0.01 0.07 0.06 0.60 0.46
Qwen2.5-32B-Instruct base 2.5-32B 0.01 0.43 0.0 0.12 0.0 0.29 0.02 0.10 0.05 0.50 0.45
Llama-3.3-70B-Instruct base 3.3-70B 0.02 0.35 0.0 0.08 0.0 0.34 0.06 0.13 0.06 0.41 0.39
Llama-Nemo-49B base 49B 0.04 0.40 0.0 0.08 0.0 0.30 0.03 0.05 0.05 0.41 0.46
Gemma-2-27b-it base 2-27b 0.01 0.55 0.0 0.04 0.0 0.48 0.03 0.10 0.04 0.53 0.43
Phi-4-14B base 4-14B 0.01 0.27 0.03 0.10 0.0 0.39 0.0 0.03 0.05 0.57 0.39
OLMo2-32B-Instruct base 2-32B 0.0 0.10 0.0 0.07 0.0 0.10 0.0 0.03 0.01 0.13 0.32
Text+Chem T5 special T5 0.44 0.74 0.0 0.07 0.06 0.24 0.0 0.09 0.0 0.0 0.10

Performance on chemistry tasks. The best-performing model in each category is in-bold. ↑ indicates higher is better.

BibTeX


          @article{li2025beyond,
            title={Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations},
            author={Li, Hao and Cao, He and Feng, Bin and Shao, Yanjun and Tang, Xiangru and Yan, Zhiyuan and Yuan, Li and Tian, Yonghong and Li, Yu},
            journal={arXiv preprint arXiv:2505.21318},
            year={2025}
          }