Architecture: pdb2reaction


1. Overview

pdb2reaction is a Python CLI that performs pure-MLIP enzymatic reaction-path analysis on an active-site cluster model. From a PDB plus a substrate name, it extracts the active-site cluster, adds cap hydrogens to severed bonds, and runs Hessian-based RS-I-RFO TS optimization on the MLIP potential to produce the reaction path (extract → MEP → tsopt → IRC → freq → dft).

Two bundled forks (pysisyphus/, thermoanalysis/) live at the repo top as repo-internal modules. They are deliberately not the upstream PyPI distributions; reinstalling them from PyPI alongside this package silently breaks the local extensions. See §6.


2. Layered structure (6 physical directories)

2.1 Layer table

layer

dir

responsibility

may depend on

L1 Interface

pdb2reaction/cli/

Click root group, shared option-decorator factories (common_options.py), --help-advanced, bool flag normalization, subcommand resolver

workflows/, core/

L2 Application

pdb2reaction/workflows/

per-subcommand orchestration; one file per stage runner (all.py, path_search.py, tsopt.py, extract.py, irc.py, freq.py, dft.py, …)

domain/, backends/, io/, core/

L3 Domain

pdb2reaction/domain/

chemistry-aware helper logic (bond change detection, bond summary, element-info propagation)

core/

L4a Infra (MLIP)

pdb2reaction/backends/

MLIP backend dispatcher + per-backend adapter (UMA / Orb / MACE / AIMNet2) + xTB ALPB delta correction

core/

L4b Infra (I/O)

pdb2reaction/io/

output layout, summary, trajectory, PDB fix, energy diagram, Hessian cache

core/

L5 Foundation

pdb2reaction/core/

defaults (single source of truth), utils (PDB / XYZ / plot helpers), logging, future errors.py / types.py

(none)

(bundle, not a layer)

<repo>/pysisyphus/, <repo>/thermoanalysis/

repo-internal forks (optimizer / thermochemistry)

(sibling, layer-external)

Dependency direction (one-way): L1 L2 {L3, L4} L5. The directional rule is enforced by CI marker coverage (.github/scripts/check_engineering_markers.py). Bundled forks sit outside the layer graph and may be imported from any layer via their absolute package path (from pysisyphus.X import Y).

2.2 ASCII map of the package tree

pdb2reaction/ [GH: t-0hmura/pdb2reaction]
├── pyproject.toml packages.find = ["pdb2reaction*", ...] (glob, frozen)
├── README.md / CONTRIBUTING.md / CHANGELOG.md
├── docs/
│ ├── architecture.md ← this file
│ └──... (Sphinx site, unchanged)
├── pdb2reaction/ ← package body, 6-layer physical dir
│ ├── __init__.py PEP 562 lazy: _LAZY_SYMBOLS / _LAZY_MODULES + __getattr__
│ ├── __main__.py `from pdb2reaction.cli.app import cli`
│ ├── _version.py / py.typed
│ │
│ ├── cli/ # === L1 Interface ===
│ │ ├── app.py Click group + _LAZY_SUBCOMMANDS registry (absolute paths)
│ │ ├── common_options.py @add_print_every_option / @add_irc_pos_def_option / @add_precision_option / @add_coord_type_option / @add_ml_charge_spin_options
│ │ ├── decorators.py resolve_yaml_sources / load_merged_yaml_cfg / _write_error_json
│ │ ├── help_pages.py --help-advanced pager
│ │ ├── bool_compat.py --flag / --no-flag normalization
│ │ └── default_group.py subcommand resolver, lazy module import
│ │
│ ├── workflows/ # === L2 Application ===
│ │ ├── all.py full pipeline orchestrator (extract → … → DFT)
│ │ ├── path_search.py / path_opt.py MEP search / COS wrapper
│ │ ├── tsopt.py / freq.py / irc.py / dft.py per-stage runners
│ │ ├── opt.py / sp.py / scan.py / scan2d.py /
│ │ │ scan3d.py / scan_common.py geometry opt / single point / scans
│ │ ├── extract.py active-site extraction CLI
│ │ ├── restraints.py restraint helpers
│ │ └── align_freeze.py Kabsch + frozen-subset rmsd
│ │
│ ├── domain/ # === L3 Domain ===
│ │ ├── bond_changes.py R↔P bond detection
│ │ ├── bond_summary.py post-IRC diagnostic
│ │ └── add_elem_info.py PDB element column normalizer
│ │
│ ├── backends/ # === L4a Infra (MLIP) ===
│ │ ├── __init__.py backend dispatch + registry
│ │ ├── base.py MLIPCalculator protocol
│ │ ├── uma.py / orb.py / mace.py / aimnet2.py per-backend adapters
│ │ ├── solvent.py xTB ALPB implicit-solvent helper
│ │ └── xtb_alpb_correction.py xTB ALPB delta correction
│ │
│ ├── io/ # === L4b Infra (I/O) ===
│ │ ├── summary.py summary.json / summary.log writer
│ │ ├── energy_diagram.py Plotly diagram
│ │ ├── trj2fig.py trajectory → PNG / SVG / PDF / HTML / CSV
│ │ ├── pdb_fix.py altloc resolution
│ │ └── hessian_cache.py in-memory Hessian cache
│ │
│ └── core/ # === L5 Foundation ===
│   ├── defaults.py C1 single source of truth for every default
│   └── utils.py PDB / XYZ / plot helpers
│
├── tests/ smoke / unit
├── .github/ workflows/ + scripts/ (docs-quality lint helpers; CI-only)
└── (repo-top sibling, layer-external bundled forks)
 pysisyphus/ ~90 files, repo-internal fork (slimmed; CLI driver + QM backends + wavefunction + dead optimisers / IRC / NEB variants removed)
 thermoanalysis/ 5 files, repo-internal fork

2.3 Per-layer responsibility detail

L1 cli/ (~6 files). Only this layer constructs Click commands and parses argv. app.py holds the root Click.Group plus the _LAZY_SUBCOMMANDS registry — every entry uses an absolute module path (pdb2reaction.workflows.all, pdb2reaction.io.trj2fig, …) so the resolver is independent of where default_group.py itself lives. common_options.py collects the option-decorator factories shared across subcommands (@add_print_every_option, @add_irc_pos_def_option, @add_precision_option, @add_coord_type_option, @add_ml_charge_spin_options); subcommand bodies stack these decorators above @click.pass_context to keep --help text in lock-step.

L2 workflows/ (18 files). One file per subcommand. Each file owns a single @click.command() named cli and its private helpers. Large stage runners (all.py = 5,131 LOC, path_search.py = 2,771 LOC, tsopt.py = 2,121 LOC, extract.py = 2,113 LOC) remain as single files in the current layout.

L3 domain/. Chemistry-aware helper logic that may import torch / numpy / pysisyphus.constants (numeric back-ends), but may not import MLIP runtimes (fairchem, orb_models, mace, aimnet). .github/scripts/check_engineering_markers.py enforces this deny list via an external-library import-scope check across non-backends/ files. (The # DOMAIN_PURE docstring marker itself lives on selected workflow modules — workflows/dft.py, tsopt.py, sp.py — not on domain/.) Domain helpers are reusable by any L2 stage runner.

L4a backends/ (~8 files). MLIP backend dispatcher (__init__.py + base.py) plus one adapter per supported MLIP (uma.py, orb.py, mace.py, aimnet2.py). solvent.py and xtb_alpb_correction.py carry the xTB ALPB implicit-solvent delta correction (an opt-in MLIP wrapper). pdb2reaction is a pure-MLIP cluster-model package.

L4b io/. Output-side I/O concerns: per-stage summary writer, energy diagram, trajectory rendering, PDB altloc fix, in-memory Hessian cache. io/ never depends on workflows/; output format is owned here and consumed by stage runners.

L5 core/. The lowest layer. defaults.py is the single source of truth for every CLI default — grep here before adding a number anywhere else. utils.py is a ~3,200-LOC grab-bag of PDB / XYZ / plotting helpers.

2.4 Lazy-import mechanism (conceptual diagram)

External consumer                        Package root             Layer dir
---------------------------------------  ----------------------   ---------

from pdb2reaction.core.utils import x ──► (direct dotted import) ──► pdb2reaction/core/utils.py
import pdb2reaction.io.trj2fig        ──► (direct dotted import) ──► pdb2reaction/io/trj2fig.py

from pdb2reaction import <Symbol>     ──► pdb2reaction/__init__.py
                                          __getattr__("<Symbol>")
                                          └─► _LAZY_SYMBOLS["<Symbol>"]
                                              = "pdb2reaction.<layer>.<module>"
                                              └─► importlib.import_module(...)

from pdb2reaction import <module>     ──► pdb2reaction/__init__.py
        (= module attr)                   __getattr__("<module>")
                                          └─► _LAZY_MODULES["<module>"]
                                              = "pdb2reaction.<layer>.<module>"
                                              └─► importlib.import_module(...) returns module

pdb2reaction myaction                 ──► pdb2reaction/cli/app.py
                                          _LAZY_SUBCOMMANDS["myaction"]
                                          = ("pdb2reaction.workflows.myaction", "cli", "...")
                                          └─► importlib.import_module(absolute path)
                                              └─► getattr(module, "cli") → Click command

Two layers of lazy-import compatibility plus CLI dispatch:

  1. Root symbol attribute (from pdb2reaction import <Symbol>) — handled by pdb2reaction/__init__.py:_LAZY_SYMBOLS + PEP 562 __getattr__. Symbols are loaded on first access from the layer-dir path; import cost stays zero at pdb2reaction import time.

  2. Root module attribute (from pdb2reaction import <module>) — handled by _LAZY_MODULES. __getattr__ returns the module object itself via importlib.import_module. pdb2reaction currently has 0 consumed module-attr paths (the registry is empty — root attribute access is reserved for future expansion).

The CLI subcommand resolver (cli/app.py:_LAZY_SUBCOMMANDS) uses absolute module paths (e.g. "pdb2reaction.workflows.all") so that moving default_group.py into cli/ does not silently break subcommand discovery (the registry no longer depends on __package__).


3. Fresh-eyes 5-step navigation (≈ 40 min total)

For a contributor opening the repo for the first time, follow this path top-to-bottom; each step closes one concern.

step

minutes

open

what you learn

1

3

README.md

one-paragraph elevator pitch + single-command usage

2

5

this file (docs/architecture.md) §2 + §4

6-layer dir tree, dependency direction, where each concern lives

3

5

pdb2reaction/cli/app.py

Click root group, _LAZY_SUBCOMMANDS registry (≈ 18 entries), absolute-path resolution

4

20

pdb2reaction/workflows/all.py (5,131 LOC, skim)

one full subcommand top-to-bottom; trace extract MEP tsopt IRC freq dft

5

7

CONTRIBUTING.md §3 + §4

5 add-a-X recipes + the “do not touch” hidden constraints

After step 5 you can read any other file by following the file index in §4. The package is intentionally flat-within-each-layer — there is no nested package below pdb2reaction/<layer>/, so you never need to navigate more than two directories deep.


4. File index — “where does this concern live?”

4.1 CLI / entry (L1 cli/)

concern

file

Click root group + subcommand dispatch

pdb2reaction/cli/app.py

Subcommand resolver (lazy import)

pdb2reaction/cli/default_group.py

python -m pdb2reaction entry

pdb2reaction/__main__.py

YAML source resolution + standardized exception handling

pdb2reaction/cli/decorators.py

--help-advanced pager

pdb2reaction/cli/help_pages.py

Bool flag compat (--flag / --no-flag + value style)

pdb2reaction/cli/bool_compat.py

Shared option-decorator factories (--print-every, --irc-pos-def, --precision, --coord-type, --charge / --ligand-charge / --multiplicity)

pdb2reaction/cli/common_options.py

4.2 Workflow stage runners (L2 workflows/)

concern

file

Full pipeline orchestrator

pdb2reaction/workflows/all.py

Geometry optimization (LBFGS / RFO)

pdb2reaction/workflows/opt.py

1D / 2D / 3D scans + shared

pdb2reaction/workflows/scan{,2d,3d,_common}.py

MEP search (GSM)

pdb2reaction/workflows/path_search.py

MEP optimizer core (pysisyphus COS)

pdb2reaction/workflows/path_opt.py

TS optimization (RSIRFO + Bofill + macro/micro)

pdb2reaction/workflows/tsopt.py

Vibrational analysis (PHVA + UMA active block)

pdb2reaction/workflows/freq.py

IRC integration (macro / micro)

pdb2reaction/workflows/irc.py

Single-point DFT (gpu4pyscf subprocess)

pdb2reaction/workflows/dft.py

Active-site extraction (cluster cap)

pdb2reaction/workflows/extract.py

Restraint helpers

pdb2reaction/workflows/restraints.py

Kabsch / frozen-subset alignment

pdb2reaction/workflows/align_freeze.py

4.3 Chemistry helpers (L3 domain/)

concern

file

R↔P bond change detection

pdb2reaction/domain/bond_changes.py

Post-IRC bond summary

pdb2reaction/domain/bond_summary.py

PDB element column normalizer

pdb2reaction/domain/add_elem_info.py

4.4 MLIP backends (L4a backends/)

concern

file

Backend dispatch + registry

pdb2reaction/backends/__init__.py

MLIPCalculator protocol + base

pdb2reaction/backends/base.py

Per-backend adapters

pdb2reaction/backends/{uma, orb, mace, aimnet2}.py

xTB ALPB implicit-solvent helper

pdb2reaction/backends/solvent.py

xTB ALPB delta correction

pdb2reaction/backends/xtb_alpb_correction.py

See Backends for the add-a-backend recipe.

4.5 I/O (L4b io/)

concern

file

summary.json / summary.log writer

pdb2reaction/io/summary.py

Plotly energy diagram

pdb2reaction/io/energy_diagram.py

Trajectory → PNG / SVG / PDF / HTML / CSV

pdb2reaction/io/trj2fig.py

PDB altloc resolution

pdb2reaction/io/pdb_fix.py

In-memory Hessian cache (per-run TTL)

pdb2reaction/io/hessian_cache.py

Harmonic restraint setup

pdb2reaction/workflows/restraints.py (L2 stage helper)

4.6 Foundation (L5 core/)

concern

file

Every CLI default (single source of truth)

pdb2reaction/core/defaults.py

PDB / XYZ / plot helpers

pdb2reaction/core/utils.py

-v / -vv logging wiring

pdb2reaction/core/logging.py

4.7 Repo-internal bundled forks

dir

role

divergent files (do NOT replace with upstream)

pysisyphus/

optimizer / TS / IRC engine

irc/IRC.py (opt-in require_pos_def_hessian PSD convergence guard), optimizers/hessian_updates.py (Bofill scatter on advanced indices, CPU-only bofill_update path for GPU OOM avoidance), tsoptimizers/{RSIRFOptimizer,RSPRFOptimizer,TRIM,TSHessianOptimizer}.py, calculators/{Calculator,Dimer}.py, _array.py (torch/numpy backend shim)

thermoanalysis/

thermochemistry (ΔG, ZPE, partition functions)

QCData.py (branding diff vs upstream)

See each dir’s README.md for the touch-restriction boundary.


5. Hidden constraints (read this before any patch)

5.1 Chemistry rules (grep recipe)

Three correctness-critical rules are spread across backends/, workflows/, and core/defaults.py. They are not detected by smoke tests — silent drift here breaks reaction-path accuracy. Inline # CHEMISTRY-RULE:N markers and # DOMAIN_PURE module-docstring markers identify the rules; .github/scripts/check_engineering_markers.py enforces marker completeness in CI.

To find every chemistry rule before editing:

# List all rule sites in the repo (host file + line)
grep -rnE '# CHEMISTRY-RULE:[0-9]+' pdb2reaction/

# List every # DOMAIN_PURE marker (= chemistry-rule host modules)
grep -rn '# DOMAIN_PURE' pdb2reaction/

The three rules (marker IDs are non-contiguous) are:

marker

rule

host file

4

gpu4pyscf rks_lowmem triple-guard

pdb2reaction/workflows/dft.py

5

def2 family auto-ECP injection

pdb2reaction/workflows/dft.py

7

bofill_update advanced-indexing scatter

pdb2reaction/workflows/tsopt.py

Editing any of these requires a [CHEMISTRY-RULE:N] commit prefix and a HEAVY-tier numerical-golden gate pass (see CONTRIBUTING.md §1.1). Learn the DFT pair (#4, #5) first, then the TS scatter rule (#7).

5.2 VRAM-management invariant (do not refactor del chains)

The IRC / TSopt / Freq stages explicitly del GPU-resident objects (calc, geom, hess) between stages to free CUDA memory; the all workflow additionally runs gc.collect() at stage boundaries. Do not refactor these del / gc.collect() statements out — long-running all jobs with large active-site models OOM without them.

5.3 Bundled forks: do NOT install upstream alongside

The bundled pysisyphus/ and thermoanalysis/ packages are forks. Reinstalling pip install pysisyphus or pip install thermoanalysis next to this package silently breaks:

  • pysisyphus/irc/IRC.py — initial-displacement memory hygiene + opt-in require_pos_def_hessian kwarg

  • pysisyphus/optimizers/hessian_updates.py — Bofill scatter on advanced indices, CPU-only bofill_update path for GPU OOM avoidance

  • pysisyphus/tsoptimizers/TSHessianOptimizer.py — RSIRFO kwargs (host package import path divergent between forks)

  • pysisyphus/calculators/{Calculator,Dimer}.py — GPU-aware backend hooks (the 30+ QM backends have been removed; only the abstract base + Dimer TS calculator remain)

  • pysisyphus/_array.pyget_xp / _outer / _dot / _eigh shim used by optimizers/hessian_updates.py (and progressively by other hot-path files)

  • thermoanalysis/QCData.py — branding / I/O diff vs upstream

5.4 pyproject.toml arrays are 0-diff

[tool.setuptools.packages.find].include and dependencies are treated as 0-diff arrays during this release. The include glob (pdb2reaction*) already auto-discovers any new layer subpackage; adding a vendor/ or internal/ container directory, or pinning a new runtime dependency, breaks the install contract and is forbidden by the release scope. Reflow / comment edits are fine; array contents are frozen.

5.5 _LAZY_SUBCOMMANDS registry must use absolute paths

pdb2reaction/cli/app.py:_LAZY_SUBCOMMANDS resolves every subcommand through an absolute module path. Switching any entry back to a relative dotted import (".all" etc.) silently breaks subcommand discovery whenever default_group.py moves, because the resolver’s __package__ then drifts away from the package root. See internal design notes.


6. Bundled forks (repo-internal)

pdb2reaction ships two repo-internal modules at the repo top:

dir

upstream PyPI?

purpose

scope of edits allowed

pysisyphus/

NO — fork, do not pip install pysisyphus alongside

optimizer, TS, IRC, COS, calculators

annotation-only in this release line (docstring + type hints); logic edits forbidden

thermoanalysis/

NO — fork (branding diff)

ΔG, ZPE, partition functions, QCData

same as pysisyphus/

Each dir carries its own README.md listing the divergent files and the touch-restriction boundary. From the layer model these forks live outside the L1..L5 graph: any layer may import them via the absolute package path (from pysisyphus.X import Y) without breaking the L1 L2 {L3, L4} L5 direction.