all¶
Overview¶
pdb2reaction all runs the entire workflow end-to-end:
Active site model extraction → (optional) staged scan → MEP search (recursive path-search by default) → (optional) TS optimization + IRC (tsopt) → (optional) vibrational analysis / thermochemistry (freq) → (optional) single-point DFT (dft). Use --refine-path False to fall back to single-pass path-opt (GSM/DMF). The default MLIP backend is UMA; select an alternative with -b/--backend.
Important
The all workflow without --tsopt produces TS candidates (Highest-Energy Images from MEP search). Adding --tsopt refines these into optimized TS structures validated by imaginary-frequency check, followed by IRC for endpoint validation. Always inspect the results (imaginary-frequency count + endpoint connectivity) before mechanistic interpretation.
At a glance¶
Use when: You want the entire pipeline (extraction → MEP → TS optimization → IRC → thermo → DFT) end-to-end from PDB(s).
Method: Three modes — multi-structure MEP, single-structure + staged scan, or TSOPT-only — selected by the inputs and flags you provide.
Outputs:
summary.log,summary.json, andpath_search/mep.pdb(orpath_opt/when--refine-path False); per-segmentseg_XX/and post-processingpath_search/post_seg_XX/when--tsopt/--thermo/--dftare enabled.Defaults: Backend
uma,--mep-mode gsm,--opt-mode grad,--refine-path True,--preopt True,--thresh gau,--thresh-post baker;--tsopt/--thermo/--dftare off.Next step: Without
--tsopt, results are TS candidates (HEIs); add--tsopt(imaginary-frequency check) and IRC for validation, then optionally--thermoand--dft.
Workflow at a glance¶
Most workflows follow this flow:
Full system(s) (PDB/XYZ/GJF)
│
├─ (optional) active site model extraction [`extract`](extract.md) ← requires PDB when you use --center/-c
│ ↓
│ Active site model/cluster model(s) (PDB)
│ │
│ ├─ (optional) staged scan [`scan`](scan.md) ← single-structure workflows
│ │ ↓
│ │ Ordered intermediates
│ │ ↓
│ └─ MEP search [`path-search`](path-search.md) or [`path-opt`](path-opt.md)
│ ↓
│ MEP trajectory (mep_trj.xyz) + energy diagrams
│ ↓
└─ (optional) TS optimization + IRC [`tsopt`](tsopt.md) → [`irc`](irc.md)
└─ (optional) thermo [`freq`](freq.md)
└─ (optional) single-point DFT [`dft`](dft.md)
Each stage is available as an individual subcommand. The pdb2reaction all command runs many stages end-to-end.
It supports three modes:
Multi-structure workflow — Provide ≥2 structures (PDB/GJF/XYZ) in reaction order plus a substrate definition.
allextracts active site models, runs GSM/DMF MEP search, merges the optimized path back into the full-system template(s), and optionally runs TSOPT+IRC/freq/DFT per reactive segment.Single-structure + staged scan — Provide one structure plus one or more
--scan-lists/-s. The (staged) scan generates an ordered set of intermediates that become MEP endpoints.One
--scan-lists/-sliteral runs a single scan stage.Multiple stages are passed as multiple arguments to a single
--scan-lists/-s(e.g.-s '[(…)]' '[(…)]').
TSOPT-only active site model TS optimization — Provide a single input structure, omit
--scan-lists/-s, and set--tsopt.allextracts the active site model (if-c/--centeris given) and runs TS optimization + IRC, with optional freq/DFT, on that single system.
Tip
For large active site models, the single-structure scan workflow (--scan-lists/-s) tends to produce more reliable reaction barriers than the multi-structure MEP workflow. When multiple full PDB structures are provided, structural differences in regions unrelated to the reaction coordinate can accumulate, leading to overestimated barriers. The scan workflow avoids this by starting from a single structure and driving only the relevant coordinates, minimizing irrelevant structural noise. This effect becomes more pronounced as the model size increases.
Working examples: The
examples/directory contains completeallworkflow scripts for GPP C6-methyltransferase BezA (Tsutsumi et al., Angew. Chem. Int. Ed. 2022, 61, e202111217), covering both multi-structure MEP and scan-based pipelines.
Minimal example¶
pdb2reaction all -i 1.R.pdb 3.P.pdb -c "SAM,GPP,MG" -l "SAM:1,GPP:-3" \
--out-dir ./result_all
Output checklist¶
result_all/summary.logresult_all/summary.jsonresult_all/path_search/mep.pdb(orresult_all/path_opt/when--refine-path Falseis used)
Common examples¶
Run full post-processing in one command.
pdb2reaction all -i 1.R.pdb 3.P.pdb -c "SAM,GPP,MG" -l "SAM:1,GPP:-3" \
--tsopt --thermo --dft --out-dir ./result_mep
Single-structure staged scan route.
pdb2reaction all -i 1.R.pdb -c "SAM,GPP,MG" -l "SAM:1,GPP:-3" \
-s '[("CS1 SAM 320","GPP 321 C7",1.60)]' '[("GPP 321 H11","GLU 186 OE2",0.90)]' \
--tsopt --thermo --out-dir ./result_scan
PDB/GJF companion files are generated when templates are available, controlled by --convert-files (enabled by default).
Usage¶
pdb2reaction all -i INPUT1 [INPUT2 ...] -c SUBSTRATE [-b/--backend uma|orb|mace|aimnet2] [--solvent SOLVENT] [--solvent-model alpb|cpcmx] [options]
For help output, pdb2reaction all --help shows core options and pdb2reaction all --help-advanced shows the full option list.
Examples¶
# Multi-structure MEP with explicit ligand charges and post-processing
pdb2reaction all -i 1.R.pdb 3.P.pdb -c 'SAM,GPP,MG' \
-l 'SAM:1,GPP:-3' --multiplicity 1 --freeze-links \
--max-nodes 10 --max-cycles 100 --climb --opt-mode grad \
--out-dir ./result_mep --tsopt --thermo --dft
# Single-structure staged scan followed by GSM/DMF + TSOPT/freq/DFT
pdb2reaction all -i 1.R.pdb -c 'SAM,GPP,MG' -l 'SAM:1,GPP:-3' \
-s '[("CS1 SAM 320","GPP 321 C7",1.60)]' '[("GPP 321 H11","GLU 186 OE2",0.90)]' \
--opt-mode hess --tsopt --thermo --dft
# TSOPT-only workflow (no path search)
pdb2reaction all -i TS_candidate.pdb -c 'SAM,GPP,MG' \
-l 'SAM:1,GPP:-3' --tsopt --thermo --dft
Workflow¶
Preflight checks (automatic)
allautomatically runsadd-elem-info(fills missing element symbols in PDB columns 77–78) andfix-altloc(resolves alternate conformations) on every PDB input before any other processing. When using individual subcommands (e.g.,extract,opt), you must run these manually if needed.
Active site model (binding pocket) extraction (if
-c/--centeris provided)
Substrates may be specified via PDB paths, residue IDs (
123,124orA:123,B:456), or residue names (GPP,SAM).Optional toggles forward to the extractor:
--radius,--radius-het2het,--include-h2o,--exclude-backbone,--add-linkh,--selected-resn, and--verbose.Per-input active site model PDBs are saved under
<out-dir>/models/. When multiple structures are supplied, their active site models are unioned per residue selection.The first active site model’s net charge is propagated to scan/MEP/TSOPT.
Optional staged scan (single-input only)
Each
--scan-lists/-sargument is a Python-like list of(i,j,target_Å)tuples describing an MLIP scan stage. Atom indices refer to the original input ordering (1-based) and are remapped to the active site model ordering. For PDB inputs,i/jcan be integer indices or selector strings like'TYR,285,CA'; selectors accept spaces/commas/slashes/backticks/backslashes (,/`\) as delimiters and allow unordered tokens (fallback assumes resname, resseq, atom).A single literal runs a one-stage scan; multiple literals run sequentially so stage 2 begins from stage 1’s result, and so on. Supply multiple literals as arguments to a single
--scan-lists/-s(e.g.-s '[(…)]' '[(…)]').Stage endpoints (
stage_XX/result.pdb) become the ordered intermediates that feed the subsequent MEP step.
MEP search on active site models (recursive
path-search)
By default, runs recursive
path-search, which automatically detects multistep reactions and builds a detailed MEP for each elementary step (outputs go to<out-dir>/path_search/). Complex multistep mechanisms may require manual trial-and-error to obtain a satisfactory pathway.Use
--refine-path Falseto fall back to single-passpath-optGSM/DMF on each adjacent pair (outputs go to<out-dir>/path_opt/).For multi-input PDB runs, the full-system templates are automatically passed to
path-searchfor reference merging. Single-structure scan runs reuse the original full PDB template for every stage.
Merge active site models back to the full systems (default with
--refine-path)
When
--refine-pathis True (default) and reference PDB templates exist, mergedmep_w_ref*.pdband per-segmentmep_w_ref_seg_XX.pdbfiles are emitted under<out-dir>/path_search/. With--refine-path False(path-optmode), full-system merge is not performed.
Optional per-segment post-processing (only for reactive segments — segments with bond changes; bridge segments are skipped)
--tsopt: run TS optimization on each HEI active site model, follow with EulerPC-based IRC, then re-optimize IRC endpoints with--thresh-post(defaultbaker). The endpoint optimization working directory is automatically deleted after completion.--thermo: callfreqon (R, TS, P) to obtain vibrational/thermochemistry data and an MLIP Gibbs diagram.--dft: launch single-point DFT on (R, TS, P) and build a DFT diagram. When combined with--thermo, a DFT//MLIP Gibbs diagram (DFT energies + MLIP thermal correction) is also produced.Shared overrides include
--opt-mode,--opt-mode-post(overrides TSOPT/post-IRC optimization mode),--flatten/--no-flatten,--hessian-calc-mode,--tsopt-max-cycles,--tsopt-out-dir,--freq-*,--dft-*, and--dft-engine(GPU-first by default).For Hessian evaluation modes, see Hessian evaluation mode.
TSOPT-only mode (single input,
--tsopt, no--scan-lists/-s)
Skips the MEP/merge stages. Runs
tsopton the active site model (or full input if extraction is skipped), performs EulerPC IRC, identifies the higher-energy endpoint as reactant (R), and generates the same set of energy diagrams plus optional freq/DFT outputs.
Charge and spin precedence¶
Charge is resolved via the standard priority chain (see CLI Conventions: Charge specification for details). In the all command, charge derivation from active site model extraction (when -c is specified) acts as an additional priority layer.
Spin resolution: --multiplicity (CLI) → .gjf template → default (1)
Tip: Always provide
--ligand-charge/-lfor non-standard substrates to ensure correct charge propagation.
Input expectations¶
Extraction enabled (
-c/--center): inputs must be PDB files so residues can be located.Extraction skipped: inputs may be PDB/XYZ/GJF.
Multi-structure runs require ≥2 structures.
CLI Options¶
Note: Default values shown are used when the option is not specified.
Input/Output Options¶
Option |
Description |
Default |
|---|---|---|
|
Two or more full structures in reaction order (single input allowed only with |
Required |
|
Reference PDB for topology when |
None |
|
Top-level output directory. |
|
|
Global toggle for XYZ/TRJ → PDB/GJF companions when templates are available. |
|
|
Dump MEP (GSM/DMF) trajectories. Always forwarded to |
|
|
Base YAML applied first. |
None |
|
Print resolved configuration before execution. |
|
|
Validate and print plan without running stages. |
|
|
Resume a previous run from |
|
Charge/Spin Options¶
Option |
Description |
Default |
|---|---|---|
|
Net charge or per-resname mapping used when |
None |
|
Force the net system charge (overrides |
None |
|
Spin multiplicity forwarded to all downstream steps. |
|
Extraction Options¶
Option |
Description |
Default |
|---|---|---|
|
Substrate specification (PDB path, residue IDs, or residue names). |
Required for extraction |
|
Active site model inclusion cutoff (Å). |
|
|
Independent hetero–hetero cutoff (Å). Passing |
|
|
Include waters (HOH/WAT/TIP3/SOL). |
|
|
Remove backbone atoms on non-substrate amino acids. |
|
|
Add link hydrogens for severed bonds. |
|
|
Residues to force include. Despite the name, this flag accepts residue IDs (colon-separated integers with optional chains/insertion codes, e.g. |
|
|
Comma-separated residue names (with optional charge) to treat as amino acids for backbone truncation and charge assignment (e.g., |
|
|
Freeze link parents in active site model PDBs. |
|
|
Enable INFO-level extractor logging. |
|
MEP Search Options¶
Option |
Description |
Default |
|---|---|---|
|
MEP search algorithm: GSM (Growing String Method) or DMF (Direct Max Flux). |
|
|
MEP internal nodes per segment. GSM: total images = |
|
|
MEP maximum optimization cycles. |
|
|
Enable TS climbing for the first segment. |
|
|
Workflow preset ( |
hess |
|
Convergence preset ( |
|
|
Pre-optimize active site model endpoints before MEP search. Note: |
|
|
If True (default), run recursive |
|
MLIP Calculator Options¶
Option |
Description |
Default |
|---|---|---|
|
MLIP predictor parallelism (workers > 1 disables analytic Hessians; UMA backend only). See workers > 1 silent FD downgrade for diagnostic notes. |
|
|
Shared MLIP Hessian engine. |
|
|
MLIP backend. |
|
|
Implicit solvent name for xTB correction (e.g. |
|
|
xTB solvent model. |
|
Post-Processing Options¶
Option |
Description |
Default |
|---|---|---|
|
Run TS optimization + IRC per reactive segment. |
|
|
Run vibrational analysis ( |
|
|
Run single-point DFT on R/TS/P. |
|
|
Optimizer preset override for TSOPT and post-IRC optimization ( |
|
|
Convergence preset for post-IRC endpoint optimizations ( |
|
|
Enable surplus-imaginary-mode flattening in |
|
Warning
The --dft single-point calculations (powered by PySCF/GPU4PySCF) are very expensive for models exceeding ~300 atoms. For such systems, HPC clusters with high-end GPUs (e.g. A100, H200) are typically required.
TSOPT optimizer selection order: --opt-mode-post (if set) → --opt-mode (only when explicitly provided) → TSOPT default (hess → rsirfo).
Example: --opt-mode grad --opt-mode-post hess uses LBFGS for path optimization and RS-I-RFO for TS refinement.
TSOPT Overrides¶
Option |
Description |
Default |
|---|---|---|
|
Override |
|
|
Custom tsopt subdirectory. |
None |
Freq Overrides¶
Option |
Description |
Default |
|---|---|---|
|
Base directory override for freq outputs. |
None |
|
Maximum modes to write. |
|
|
Mode animation amplitude (Å). |
|
|
Frames per mode animation. |
|
|
Mode sorting behavior. |
|
|
Thermochemistry temperature (K). |
|
|
Thermochemistry pressure (atm). |
|
DFT Overrides¶
Option |
Description |
Default |
|---|---|---|
|
DFT backend: gpu (GPU4PySCF) or cpu (PySCF). In the |
|
|
Base directory override for DFT outputs. |
None |
|
Functional/basis pair. |
|
|
Maximum SCF iterations. |
|
|
SCF convergence tolerance. |
|
|
PySCF grid level. |
|
Scan Options (Single-Input Runs)¶
Option |
Description |
Default |
|---|---|---|
|
Staged scans: |
None |
|
Override the scan output directory. |
None |
|
Force scan indexing to 1-based or 0-based. |
None |
|
Maximum step size (Å). |
|
|
Harmonic bias strength (eV·Å⁻²). |
|
|
Relaxation max cycles per step. |
|
|
Override the scan preoptimization toggle. |
None |
|
Override the scan end-of-stage optimization toggle. |
None |
Outputs¶
out_dir/ (default:./result_all/)
├─ summary.log # Text summary
├─ summary.json # JSON results
├─ models/ # Extracted active site model PDBs when extraction runs
├─ scan/ # Staged scan results (present when --scan-lists is provided)
├─ seg_XX/ # Refined TS and optimized IRC endpoints of segment XX
│ ├─ reactant.{pdb,xyz,gjf} # Output format matches input format
│ ├─ ts.{pdb,xyz,gjf}
│ └─ product.{pdb,xyz,gjf}
├─ path_search/ # MEP results (recursive path-search, default); path_opt/ when --refine-path False
│ └─ post_seg_XX/ # Post-processing: TSOPT, IRC, freq, DFT per segment
│ ├─ structures/ # Optimized R/TS/P structures (IRC endpoints)
│ ├─ irc/ # IRC trajectories and plots
│ ├─ ts/ # TS optimization output and vibrational analysis
│ ├─ freq/ # Frequency and thermochemistry (R, TS, P)
│ └─ dft/ # DFT single-point results (when --dft is enabled)
└─ tsopt_single/ # TSOPT-only outputs with IRC endpoints
Note
seg_XX/ vs path_search/post_seg_XX/. The two per-segment trees serve different purposes and are not duplicates:
seg_XX/(top level ofout_dir) is the segment-level merged result aggregating the final reactant/TS/product structures for reactive segmentXX. It is written after the full post-processing pipeline converges and holds the canonicalreactant.{pdb,xyz,gjf},ts.{pdb,xyz,gjf},product.{pdb,xyz,gjf}you should cite when reporting mechanisms.path_search/post_seg_XX/is the per-segment post-processing workspace. It stores the intermediate products of each stage (ts/,irc/,structures/,freq/,dft/) and is the right place to inspect when debugging a single stage (e.g. checkingts/vib/imag_*_trj.xyzorirc/*_trj.xyz). Reactive segments get populated; bridge segments (no bond change) are skipped.
When --refine-path False is passed, the workspace moves under path_opt/post_seg_XX/ instead.
Console logs summarizing active site model charge resolution, YAML contents, scan stages, MEP progress (GSM/DMF), and per-stage timing.
Energy diagram naming convention¶
Energy diagram files are named by method and scope:
File name |
Generated when |
Content |
|---|---|---|
|
path-opt/path-search completes |
All-segment MEP barriers (raw GSM/DMF values) |
|
per-segment tsopt+IRC completes |
R→TS→P (MLIP energy) |
|
per-segment thermo completes |
R→TS→P (MLIP Gibbs free energy) |
|
per-segment DFT completes |
R→TS→P (DFT energy) |
|
per-segment DFT+thermo completes |
R→TS→P (DFT energy + MLIP thermal correction) |
|
all segments aggregated |
All segments combined (MLIP) |
|
all segments + thermo |
All segments combined (MLIP Gibbs) |
|
all segments + DFT |
All segments combined (DFT) |
|
all segments + DFT + thermo |
All segments combined (DFT//MLIP Gibbs) |
Reading summary.log¶
The log is organized into numbered sections:
[1] Global MEP overview – image/segment counts, MEP trajectory plot paths, and the aggregate MEP energy diagram.
[2] Segment-level MEP summary (MLIP path) – per-segment barriers (
ΔE‡), reaction energies (ΔE), and bond-change summaries.[3] Per-segment post-processing (TSOPT / Thermo / DFT) – per-segment TS imaginary frequency checks, IRC outputs, and MLIP/thermo/DFT energy tables.
[4] Energy diagrams (overview) – diagram tables for MEP/MLIP/Gibbs/DFT series plus an optional cross-method summary table.
[5] Output directory structure – a compact tree of generated files with inline annotations.
Reading summary.json¶
The JSON summary contains structured data. Common top-level keys include:
out_dir,n_images,n_segments– run metadata and total counts.segments– list of per-segment entries withindex,tag,kind,barrier_kcal,delta_kcal, andbond_changes.energy_diagrams(optional) – diagram payloads withlabels,energies_kcal,energies_au,ylabel, andimagepaths.
summary.json intentionally omits the formatted tables and filesystem tree that appear in summary.log.
Notes¶
For symptom-first diagnosis, start with Common Error Recipes, then use Troubleshooting for detailed fixes.
Always provide
--ligand-charge/-l(numeric or per-residue mapping) when formal charges cannot be inferred so the correct net charge propagates to scan/MEP/TSOPT/DFT.Reference PDB templates for merging are derived automatically from the original inputs; the explicit
--ref-full-pdboption ofpath-searchis hidden in this wrapper.Convergence presets:
--threshdefaults togau;--thresh-postdefaults tobaker.Extraction radii: passing
0to--radiusor--radius-het2hetis internally clamped to0.001 Åby the extractor.Energies in diagrams are reported relative to the first state (reactant) in kcal/mol.
Omitting
-c/--centerskips extraction and feeds the entire input structures directly to the MEP/tsopt/freq/DFT stages; single-structure runs still require either--scan-lists/-sor--tsopt.--resume: Re-run the same command with--resumeto skip stages whose output files already exist. Each stage is guarded by sentinel-file checks (e.g.summary.jsonfor MEP,final_geometry.*+finished_irc_trj.xyzfor TSOPT/IRC,R/+TS/+P/directories for freq/DFT). When extraction is skipped on resume, provide-q/--chargeor--ligand-charge/-lexplicitly so the charge can be resolved without re-running the extractor. Sentinel-corruption caveat:--resumeonly checks for the presence of the sentinel files, not their integrity. If a stage was killed mid-write (SIGKILL, OOM, cluster preemption) and the sentinel file was already on disk but is truncated or corrupted,--resumewill still consider the stage complete. Delete the stage’s output directory (e.g.path_search/post_seg_XX/ts/) before resuming if you suspect a partially written result.
all supports layered YAML:
--config FILE: base settings.
defaults < config < CLI
The effective YAML is forwarded to every invoked subcommand. Each tool reads the sections described in its own documentation:
Subcommand |
YAML Sections |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
Minimal example:
calc:
model: uma-s-1p1 # uma-s-1p1 | uma-m-1p1
hessian_calc_mode: Analytical # recommended when VRAM permits
gs:
max_nodes: 12
climb: true
dft:
grid_level: 6
For a complete reference of all YAML options, see YAML Configuration Reference.
See Also¶
Installation — Setup and dependency installation
Getting Started — First run, workflow overview, and key concepts
extract — Standalone active site model extraction (called internally by
all)scan — Standalone staged distance scan
path-opt — Single-pass MEP optimization (GSM/DMF)
path-search — Recursive MEP search (called internally by
all)tsopt — Standalone TS optimization
irc — Standalone IRC calculation
freq — Standalone vibrational analysis
dft — Standalone DFT calculations
Common Error Recipes — Symptom-first failure routing
Troubleshooting — Common errors and fixes
YAML Reference — Complete YAML configuration options
Glossary — Definitions of MEP, TS, IRC, GSM, DMF