🚀[2025-06-09]: We establish the ChemCoTBench and evaluate scores for open-source LLMs, commercial LLMs, and Chemical LLMs on the Leaderboard!🌟
Overview of the ChemCoTBench and ChemCoT-Dataset. ChemCoTBench presents four key strengths: 1) Novel Setup: Beyond ChemQA, it’s the first step-wise chemical reasoning benchmark; 2) Broad Coverage: Includes foundational tasks (molecule understanding, editing) and two critical applications: molecular optimization (for drug design) and reaction equations (for organic synthesis).; 3) Large Scale: Provides 1,495 samples across 22 subtasks for benchmarking, plus 17K high-quality Chain-of-Thought samples for model training; 4) High Quality: Curated by 13 chemistry Ph.D.s from Tsinghua and Peking University, with strict prompt constraints to ensure CoT correctness.
Despite recent advances in LLM reasoning capabilities, chemistry, a discipline fundamental to areas like drug discovery and materials science, still lacks a benchmark that assesses whether these improvements extend to its complex, domain-specific problem-solving needs. While several benchmarks have been proposed for LLMs in chemistry, they primarily focus on domain-specific question answering, which suffers from several key limitations:
1. Lack of Structured, Stepwise Reasoning and Real-World Relevance: Current evaluations often reduce chemistry assessment to factual recall (e.g., naming compounds or reactions), neglecting the need for operational reasoning akin to arithmetic or coding. Unlike mathematical problems, where solutions demand explicit, verifiable steps, chemistry QA tasks fail to simulate how experts decompose challenges. For instance, they don't capture the process of iteratively refining a molecule’s substructure to optimize properties, considering crucial real-world factors like synthesizability or toxicity, or deducing reaction mechanisms through intermediate transformations. This gap means we're not fully evaluating the analytical depth required in real-world chemistry. Therefore, evaluations must shift from these textbook-like problems to challenges that better reflect practical applications.
2. Ambiguous Skill Attribution in Hybrid Evaluations: Existing benchmarks often conflate reasoning, knowledge recall, and numerical computation into single "exam-style" metrics—for instance, asking LLMs to calculate reaction yields while simultaneously recalling reagent properties. This obscures whether strong performance stems from structured reasoning (e.g., analyzing reaction pathways) or memorized facts (e.g., solvent boiling points). Such ambiguity hinders targeted model improvement and misaligns evaluations with downstream tasks like drug discovery, where success depends on modular reasoning (e.g., decoupling molecular design from synthesizability checks) rather than monolithic problem-solving.
To address these limitations, we introduce ChemCoTBench, a step-by-step, application-oriented, and high-quality benchmark for evaluating LLM reasoning in chemical applications. A core innovation of ChemCoTBench is its formulation of complex chemical tasks, specifically targeting molecular modeling and design, into explicit sequences of verifiable modular chemical operations on SMILES structures (e.g., substructure addition, deletion, or substitution). This approach allows for a granular assessment of an LLM's ability to execute and chain together fundamental chemical transformations. The benchmark features progressively challenging tasks, spanning from basic molecular understanding and editing to property-guided structure optimization and complex multi-molecule chemical reactions. High-quality evaluation is ensured through a dual validation process combining LLM judgment with expert review from 13 chemists.
Distribution analysis for ChemCoTBench. (a) The data samples in two application tasks (molecule optimization, 38%, and reaction prediction, 37%) are slightly larger than the foundation tasks (molecule editing and understanding, 25%) for more challenging evaluation. (b). Samples from both molecular understanding and editing tasks achieved exceptionally high accuracy in chemical expert evaluations of chemical entities, including function group names, molecule names, chemical operation names, reaction information, etc. (c). Samples from molecule optimization and reaction prediction also show high accuracy (> 89%) in chemical expert evaluations.
Data construction pipeline for ChemCoTBench and ChemCoTDataset.
Our evaluation includes three LLM model categories: (1) Reasoning LLMs with explicit step-by-step reasoning, including Deepseek-R1, o1-mini, o3-mini, Gemini-2.5-pro, Claude-3.7-Sonnet-thinking, Qwen-3-thinking, Llama-Nemotron-thinking; (2) General-purpose non-reasoning LLMs without specialized reasoning mechanisms, including GPT-4o, Qwen-2.5/3~\cite{yang2024qwen2}, Llama-3.3, Gemma-2, Phi-4, OLMo2 (3) Biomolecular LLMs including BioMedGPT, BioMistral, and Text+Chem T5. This comprehensive comparison evaluates whether reasoning-specific capabilities provide advantages over domain-specific models in challenging chemical reasoning tasks.
Molecule Understanding and Editing Evaluation
Model Comparison | Func-Group | Scaffold | SMILES | Molecule-Edit | ||||||
---|---|---|---|---|---|---|---|---|---|---|
Name | Type | Version | FG↓ | Ring↓ | Murcko↑ | Ring-sys↑ | Eq.↑ | Add | Delete | Sub |
W/ Thinking | ||||||||||
Gemini-2.5-pro | think | 2.5 | 0.11 | 0.60 | 0.51 | 87.5 | 82 | 100 | 85 | 81.7 |
Claude3.7-sonnet | think | 3.7 | 0.21 | 1.60 | 0.40 | 80.0 | 84 | 85 | 80 | 83.4 |
DeepSeek-R1 | think | R1 | 0.27 | 1.55 | 0.34 | 45.0 | 65 | 70 | 70 | 68.3 |
o3-mini | think | 20250103 | 0.13 | 0.60 | 0.39 | 75.0 | 78 | 65 | 55 | 80.0 |
o1-mini | think | 20240912 | 0.21 | 1.25 | 0.25 | 61.7 | 66 | 55 | 80 | 58.3 |
Qwen3-235B-A22B | think | 3-235B | 0.42 | 1.00 | 0.38 | 82.5 | 72 | 40 | 75 | 71.7 |
Qwen3-32B | think | 3-32B | 0.25 | 0.95 | 0.21 | 75.0 | 68 | 20 | 55 | 20.0 |
Llama-Nemo-49B | think | 49B | 0.80 | 1.90 | 0.09 | 86.8 | 46 | 0 | 80 | 8.0 |
W/o Thinking | ||||||||||
GPT-4o | base | 20241120 | 0.17 | 1.35 | 0.21 | 80.0 | 72 | 80 | 80 | 65.0 |
Deepseek-V3 | base | V3 | 0.15 | 1.50 | 0.24 | 76.7 | 77 | 70 | 75 | 76.7 |
Gemini-2.0-flash | base | 2.0 | 0.19 | 1.65 | 0.43 | 75.0 | 76 | 65 | 75 | 66.7 |
Qwen3-235B-A22B | base | 3-235B | 0.42 | 1.00 | 0.34 | 82.5 | 75 | 40 | 75 | 66.7 |
Qwen3-32B | base | 3-32B | 0.26 | 0.95 | 0.22 | 68.3 | 67 | 30 | 55 | 25.0 |
Qwen2.5-72B-Instruct | base | 2.5-72B | 0.26 | 0.60 | 0.24 | 70.0 | 61 | 70 | 80 | 56.7 |
Qwen2.5-32B-Instruct | base | 2.5-32B | 0.36 | 0.65 | 0.12 | 53.3 | 62 | 50 | 50 | 48.3 |
Llama-3.1-70B-Instruct | base | 3.1-70B | 0.52 | 1.80 | 0.12 | 68.3 | 67 | 60 | 80 | 50.0 |
Llama-Nemo-49B | base | 49B | 0.72 | 1.77 | 0.11 | 65.0 | 54 | 30 | 55 | 30.5 |
Gemma-2-27b-it | base | 2-27b | 0.19 | 1.65 | 0.43 | 66.7 | 76 | 75 | 70 | 35.0 |
Phi-4-14B | base | 4-14B | 0.28 | 1.65 | 0.15 | 70.0 | 65 | 60 | 80 | 38.3 |
OLMo2-32B-Instruct | base | 2-32B | 0.19 | 1.05 | 0.07 | 63.3 | 50 | 15 | 30 | 11.7 |
BioMedGPT-7B | base | 7B | 1.6 | 2.43 | 0.18 | 53.3 | 39 | 10 | 12 | 10 |
BioMistral-7B | base | 7B | 1.0 | 1.85 | 0.04 | 32.5 | 50 | 0 | 10 | 0 |
Overall results of different models on molecular tasks. The best-performing model in each category is in-bold. ↓ indicates lower is better, ↑ indicates higher is better.
Molecule Optimization Evaluation
Models | LogP | Solubility | QED | DRD2 | JNK3 | GSK3-β | ||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Δ | SR% | Δ | SR% | Δ | SR% | Δ | SR% | Δ | SR% | Δ | SR% | |
W/ Thinking | ||||||||||||
Gemini-2.5-pro-think | -0.28 | 81 | 1.91 | 92 | 0.21 | 84 | 0.35 | 74 | -0.04 | 35 | 0.04 | 68 |
Claude3.7-sonnet-think | 0.41 | 81 | 0.59 | 77 | 0.09 | 73 | 0.18 | 66 | -0.01 | 49 | 0.01 | 57 |
DeepSeek-R1 | 0.36 | 74 | 1.48 | 97 | 0.05 | 72 | 0.10 | 62 | -0.06 | 29 | -0.02 | 41 |
o3-mini@20250103 | 0.29 | 68 | 1.15 | 85 | 0.17 | 86 | 0.18 | 69 | -0.08 | 23 | -0.03 | 45 |
o1-mini@20240912 | -0.42 | 52 | 1.78 | 95 | 0.07 | 70 | -0.03 | 37 | -0.10 | 15 | -0.08 | 31 |
Qwen3-235B-A22B-think | -0.01 | 41 | 0.27 | 42 | 0.01 | 24 | 0.03 | 31 | -0.01 | 23 | 0.01 | 31 |
Qwen3-32B-think | 0.0 | 2 | 0.11 | 23 | 0.02 | 14 | 0.0 | 6 | -0.02 | 6 | -0.02 | 5 |
Llama-Nemo-49B-think | -0.64 | 24 | 0.20 | 24 | -0.16 | 41 | -0.05 | 30 | -0.15 | 7 | -0.12 | 11 |
W/o Thinking | ||||||||||||
GPT-4o@20241120 | -0.20 | 42 | 0.82 | 80 | 0.05 | 70 | 0.05 | 48 | -0.05 | 30 | -0.04 | 39 |
DeepSeek-V3 | 0.08 | 34 | 0.47 | 93 | 0.08 | 46 | 0.02 | 28 | 0.0 | 18 | 0.0 | 29 |
Gemini-2.0-flash | 0.35 | 75 | 0.19 | 54 | 0.10 | 79 | 0.15 | 63 | 0.03 | 34 | 0.0 | 38 |
Qwen3-235B-A22B | 0.02 | 41 | 0.51 | 45 | 0.01 | 26 | 0.01 | 31 | -0.01 | 23 | 0.0 | 34 |
Qwen3-32B | -0.03 | 2 | 0.17 | 23 | 0.02 | 14 | -0.01 | 6 | -0.02 | 6 | -0.02 | 5 |
Qwen2.5-72B-Instruct | -0.12 | 42 | 0.28 | 60 | 0.03 | 57 | 0.04 | 40 | -0.02 | 26 | -0.01 | 40 |
Qwen2.5-32B-Instruct | 0.03 | 47 | 0.42 | 66 | -0.01 | 54 | 0.04 | 32 | -0.04 | 19 | -0.02 | 31 |
Llama-3.3-70B-Instruct | -0.16 | 42 | 0.61 | 80 | 0.07 | 61 | -0.02 | 31 | -0.04 | 30 | -0.02 | 40 |
Llama-Nemo-Super-49B | -0.14 | 27 | 0.31 | 41 | 0.02 | 50 | -0.02 | 18 | -0.04 | 16 | -0.03 | 27 |
Gemma-2-27b-it | -0.03 | 34 | 0.34 | 66 | 0.05 | 56 | -0.03 | 15 | 0.0 | 16 | -0.01 | 17 |
Phi-4-14B | -0.10 | 45 | 0.28 | 54 | 0.11 | 74 | -0.04 | 18 | -0.05 | 14 | -0.04 | 22 |
OLMo2-32B-Instruct | -2.03 | 22 | 1.03 | 46 | -0.13 | 40 | -0.11 | 7 | -0.12 | 8 | -0.11 | 12 |
Results of different models on chemistry benchmarks. The best-performing model in each category is in-bold.
Chemical Reactions Evaluation
Models | Fwd major | Fwd by | Retro | Condition | NEPP | MechSel | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Name | Type | Version | Top-1 | FTS↑ | Top-1 | FTS↑ | Top-1 | FTS↑ | Top-1 | FTS↑ | Top-1 | FTS↑ | Acc.↑ |
W/ Thinking | |||||||||||||
Gemini-2.5-pro | think | 2.5 | 0.72 | 0.89 | 0.20 | 0.51 | 0.20 | 0.45 | 0.20 | 0.33 | 0.58 | 0.53 | 0.62 |
Claude3.7-sonnet | think | 3.7 | 0.73 | 0.87 | 0.25 | 0.31 | 0.12 | 0.27 | 0.14 | 0.22 | 0.24 | 0.79 | 0.49 |
DeepSeek-R1 | think | R1 | 0.48 | 0.71 | 0.21 | 0.45 | 0.07 | 0.41 | 0.23 | 0.30 | 0.15 | 0.55 | 0.46 |
o3-mini | think | 20250103 | 0.52 | 0.71 | 0.20 | 0.27 | 0.11 | 0.39 | 0.19 | 0.19 | 0.18 | 0.58 | 0.49 |
o1-mini | think | 20240912 | 0.26 | 0.31 | 0.11 | 0.17 | 0.02 | 0.15 | 0.08 | 0.22 | 0.09 | 0.33 | 0.44 |
Qwen3-235B-A22B | think | 3-235B | 0.03 | 0.54 | 0.0 | 0.07 | 0.01 | 0.42 | 0.20 | 0.27 | 0.09 | 0.63 | 0.41 |
Qwen3-32B | think | 3-32B | 0.11 | 0.33 | 0.09 | 0.18 | 0.02 | 0.24 | 0.14 | 0.20 | 0.08 | 0.67 | 0.46 |
Llama-Nemo-49B | think | 49B | 0.09 | 0.18 | 0.04 | 0.18 | 0.0 | 0.05 | 0.18 | 0.19 | 0.04 | 0.21 | 0.47 |
W/o Thinking | |||||||||||||
GPT-4o | base | 20241120 | 0.28 | 0.58 | 0.04 | 0.20 | 0.03 | 0.43 | 0.0 | 0.08 | 0.12 | 0.71 | 0.43 |
DeepSeek-V3 | base | V3 | 0.36 | 0.62 | 0.04 | 0.30 | 0.03 | 0.44 | 0.08 | 0.16 | 0.20 | 0.70 | 0.45 |
Gemini-2.0-flash | base | 2.0 | 0.19 | 0.56 | 0.01 | 0.07 | 0.05 | 0.41 | 0.07 | 0.08 | 0.13 | 0.68 | 0.53 |
Qwen3-235B-A22B | base | 3-235B | 0.04 | 0.57 | 0.0 | 0.06 | 0.0 | 0.30 | 0.07 | 0.14 | 0.07 | 0.59 | 0.40 |
Qwen3-32B | base | 3-32B | 0.06 | 0.57 | 0.0 | 0.13 | 0.0 | 0.43 | 0.01 | 0.10 | 0.08 | 0.67 | 0.46 |
Qwen2.5-72B-Instruct | base | 2.5-72B | 0.04 | 0.49 | 0.0 | 0.13 | 0.01 | 0.35 | 0.01 | 0.07 | 0.06 | 0.60 | 0.46 |
Qwen2.5-32B-Instruct | base | 2.5-32B | 0.01 | 0.43 | 0.0 | 0.12 | 0.0 | 0.29 | 0.02 | 0.10 | 0.05 | 0.50 | 0.45 |
Llama-3.3-70B-Instruct | base | 3.3-70B | 0.02 | 0.35 | 0.0 | 0.08 | 0.0 | 0.34 | 0.06 | 0.13 | 0.06 | 0.41 | 0.39 |
Llama-Nemo-49B | base | 49B | 0.04 | 0.40 | 0.0 | 0.08 | 0.0 | 0.30 | 0.03 | 0.05 | 0.05 | 0.41 | 0.46 |
Gemma-2-27b-it | base | 2-27b | 0.01 | 0.55 | 0.0 | 0.04 | 0.0 | 0.48 | 0.03 | 0.10 | 0.04 | 0.53 | 0.43 |
Phi-4-14B | base | 4-14B | 0.01 | 0.27 | 0.03 | 0.10 | 0.0 | 0.39 | 0.0 | 0.03 | 0.05 | 0.57 | 0.39 |
OLMo2-32B-Instruct | base | 2-32B | 0.0 | 0.10 | 0.0 | 0.07 | 0.0 | 0.10 | 0.0 | 0.03 | 0.01 | 0.13 | 0.32 |
Text+Chem T5 | special | T5 | 0.44 | 0.74 | 0.0 | 0.07 | 0.06 | 0.24 | 0.0 | 0.09 | 0.0 | 0.0 | 0.10 |
Performance on chemistry tasks. The best-performing model in each category is in-bold. ↑ indicates higher is better.
@article{li2025beyond,
title={Beyond Chemical QA: Evaluating LLM's Chemical Reasoning with Modular Chemical Operations},
author={Li, Hao and Cao, He and Feng, Bin and Shao, Yanjun and Tang, Xiangru and Yan, Zhiyuan and Yuan, Li and Tian, Yonghong and Li, Yu},
journal={arXiv preprint arXiv:2505.21318},
year={2025}
}