Architecture: mlmm-toolkit¶
1. Overview¶
mlmm-toolkit is a Python CLI that performs ML/MM (ONIOM) enzymatic reaction-path analysis on a complete protein environment. ML/MM here means a hybrid model in which a small reaction core is treated by a machine-learning interatomic potential (ML) and the surrounding protein by a molecular-mechanics (MM) force field, combined through the subtractive ONIOM (Our own N-layered Integrated molecular Orbital and molecular Mechanics) energy scheme.
The input is a PDB plus a substrate name. From these the tool automatically generates the parm7 topology and encodes the ONIOM region split (ML / Movable MM / Frozen MM) into B-factor channels. It then runs a full-system Hessian-based transition-state (TS) search via a macro/micro alternation scheme.
The result is a full reaction path produced by the stage pipeline extract → mm-parm → ONIOM model → MEP → tsopt → IRC → freq → dft, where MEP is the minimum-energy path and IRC the intrinsic reaction coordinate.
The package is laid out as 6 physical layer directories (cli/, workflows/, domain/, backends/, io/, core/). The role and dependency direction of each are summarized in the §2.1 layer table below.
External code imports directly from the layer directory (from mlmm.backends.mlmm_calc import MLMMCore, from mlmm.core.utils import …, import mlmm.io.trj2fig, etc.); the previous flat-top shim layer has been retired in this release. §2.4 details the two import surfaces this leaves.
Three bundled forks (pysisyphus/, thermoanalysis/, hessian_ff/) live at the repo top as repo-internal modules. They are deliberately not the upstream PyPI distributions (and hessian_ff/ has no upstream at all — bundling is mandatory). See §6.
2. Layered structure (6 physical directories)¶
2.1 Layer table¶
layer |
dir |
responsibility |
may depend on |
|---|---|---|---|
L1 Interface |
|
Click root group, decorator factories, |
|
L2 Application |
|
per-subcommand orchestration; one file per stage runner ( |
|
L3 Domain |
|
chemistry-aware helper logic (bond change detection, bond summary, element-info propagation) |
|
L4a Infra (MLIP + ONIOM) |
|
MLIP backend dispatcher + per-backend adapter + ML/MM ONIOM calculator core |
|
L4b Infra (I/O) |
|
output layout, summary, trajectory, PDB fix, energy diagram, Hessian cache, analytical-Hessian glue |
|
L5 Foundation |
|
defaults (single source of truth), utils (PDB / XYZ / plot helpers), future |
(none) |
(bundle, not a layer) |
|
repo-internal forks (optimizer / thermochemistry / analytical MM Hessian) |
(sibling, layer-external) |
Dependency direction (one-way): L1 → L2 → {L3, L4} → L5 (per the §2.1 layer table). The directional rule is enforced by CI marker coverage (.github/scripts/check_engineering_markers.py). Bundled forks sit outside the layer graph and may be imported from any layer through their absolute package path (from pysisyphus.X import Y, from hessian_ff.analytical_hessian import …).
2.2 ASCII map of the package tree¶
mlmm_toolkit/ [GH: t-0hmura/mlmm_toolkit]
├── pyproject.toml packages.find = ["mlmm*",...] (glob, frozen)
├── README.md / CONTRIBUTING.md / CHANGELOG.md
├── docs/
│ ├── architecture.md ← this file
│ └──... (Sphinx site, unchanged)
├── mlmm/ ← package body, 6-layer physical dir
│ ├── __init__.py PEP 562 lazy: _LAZY_IMPORTS + __getattr__
│ ├── __main__.py `from mlmm.cli.app import cli`
│ ├── _version.py / py.typed
│ │
│ ├── cli/ # === L1 Interface ===
│ │ ├── app.py Click group + _LAZY_SUBCOMMANDS registry (absolute paths)
│ │ ├── common_options.py @add_precision_option / @add_backend_model_option / @add_ml_charge_spin_options et al.
│ │ ├── decorators.py make_is_param_explicit, bool/YAML helpers, render_cli_exception
│ │ ├── help_pages.py --help-advanced pager
│ │ ├── bool_compat.py --flag / --no-flag normalization
│ │ ├── default_group.py subcommand resolver, lazy module import
│ │ └── preflight.py AmberTools / conda env / GPU preflight
│ │
│ ├── workflows/ # === L2 Application ===
│ │ ├── all.py full pipeline orchestrator (extract → … → DFT)
│ │ ├── path_search.py / path_opt.py MEP search / COS wrapper
│ │ ├── tsopt.py / freq.py / irc.py / dft.py per-stage runners
│ │ ├── opt.py / scan.py / scan2d.py /
│ │ │ scan3d.py / scan_common.py ONIOM geometry opt / scans
│ │ ├── extract.py active-site extraction CLI
│ │ ├── define_layer.py ML / Movable MM / Frozen MM B-factor assignment
│ │ ├── mm_parm.py AmberTools-driven parm7 / rst7 generation
│ │ ├── oniom_export.py ONIOM input writer (Gaussian / ORCA)
│ │ ├── oniom_import.py ONIOM input reader (sanity / atom-name diff)
│ │ └── align_freeze.py Kabsch + frozen-subset rmsd
│ │
│ ├── domain/ # === L3 Domain ===
│ │ ├── bond_changes.py R↔P bond detection
│ │ ├── bond_summary.py post-IRC diagnostic
│ │ └── add_elem_info.py PDB element column normalizer
│ │
│ ├── backends/ # === L4a Infra (MLIP + ONIOM) ===
│ │ ├── __init__.py --precision routing (apply_precision_to_calc_cfg)
│ │ ├── mlmm_calc.py ML/MM ONIOM calculator core (4 MLIP backends UMA / Orb / MACE / AIMNet2
│ │ inline; CHEMISTRY-RULE:1 / 2 / 8 / 9 host)
│ │ │ Future: split into base.py + per-backend uma.py / orb.py
│ │ │ / mace.py / aimnet2.py + ONIOM subdir
│ │ └── xtb_embedcharge_correction.py xTB point-charge embedding correction (--embedcharge)
│ │
│ ├── io/ # === L4b Infra (I/O) ===
│ │ ├── summary.py summary.json / summary.md writer
│ │ ├── energy_diagram.py Plotly diagram
│ │ ├── trj2fig.py trajectory → PNG / HTML / SVG / PDF
│ │ ├── pdb_fix.py altloc resolution
│ │ ├── hessian_cache.py in-memory Hessian cache
│ │ └── hessian_calc.py numerical-Hessian build + frequency / vibrational I/O helpers
│ │
│ ├── core/ # === L5 Foundation ===
│ │ ├── defaults.py C1 single source of truth for every default
│ │ ├── utils.py PDB / XYZ / plot helpers
│ │ ├── logging.py -v / -vv logging wiring
│ │ ├── calc_eval.py per-stage calc evaluation
│ │ └── residue_data.py residue tables
│ │
│ └── mcp/ # non-layer subpackage: MCP server exposing every CLI subcommand
│ ├── server.py / _runner.py
│ └── _tools.py
│
├── tests/ smoke / unit
├── .github/ workflows/ + scripts/ (docs-quality lint helpers; CI-only)
└── (repo-top sibling, layer-external bundled forks)
pysisyphus/ ~90 file, repo-internal fork (slimmed; CLI driver + QM backends + wavefunction + dead optimizers / IRC / NEB variants removed)
thermoanalysis/ 5 file, repo-internal fork
hessian_ff/ 19 file / 4.2k LOC, NO upstream PyPI, mandatory bundling
2.3 Per-layer responsibility detail¶
L1 cli/. Only this layer constructs Click commands and parses argv. app.py holds the root Click.Group plus the _LAZY_SUBCOMMANDS registry — every entry uses an absolute module path (mlmm.workflows.all, mlmm.io.trj2fig, …) so the resolver is independent of where default_group.py itself lives. The mlmm-specific preflight.py (AmberTools / conda env / GPU preflight) lives here because it runs during CLI startup before any L2 workflow is invoked.
L2 workflows/ (~21 files). One file per subcommand. Each file owns a single @click.command() named cli and its private helpers. Large stage runners (all.py = 4,414 LOC, path_search.py = 2,352 LOC, tsopt.py = 3,181 LOC, extract.py = 2,274 LOC, oniom_export.py = 2,027 LOC) remain as single files in the current layout; future work may split them into per-stage subdirectories, but this is opt-in and out of scope for this release line.
L3 domain/. Chemistry-aware helper logic that may import torch / numpy / pysisyphus.constants (numeric back-ends), but may not import machine-learning interatomic potential (MLIP) runtimes (fairchem, orb_models, mace, aimnet). Two distinct CI gates cover this, both in .github/scripts/check_engineering_markers.py:
The MLIP-runtime deny list (
fairchem/orb_models/mace/aimnet) is enforced repo-wide by_check_external_library_scope, which forbids those imports in any module outsidebackends/.The separate
# DOMAIN_PUREmodule-docstring marker is a distinct CI gate (_check_domain_pure) that flags the specific backend-agnostic modules required to stay MLIP-free —backends/mlmm_calc.py,workflows/tsopt.py,workflows/freq.py(and present onworkflows/sp.py). It is not itself the deny-list mechanism, and nodomain/file carries it.
Domain helpers are reusable by any L2 stage runner.
L4a backends/. The ML/MM ONIOM calculator core (mlmm_calc.py = 2,550 LOC) lives here together with the backend dispatch (__init__.py) and the standalone xTB point-charge embedding correction (xtb_embedcharge_correction.py, driven by --embedcharge). Today the 4 MLIP backends (UMA / Orb / MACE / AIMNet2) that evaluate the ML region and the OpenMM / hessian_ff coupling all sit inline inside mlmm_calc.py; future work may split this into backends/{base, uma, orb, mace, aimnet2}.py for the MLIP layer plus a backends/mlmm_calc/ subdir for the ONIOM core (core.py, ase_calc.py, embed_charge.py, hessianff_calc.py, openmm_calc.py, facade.py). The current single-file mlmm_calc.py carries chemistry rules #1 (subtractive ONIOM), #2 (link-atom Hessian B-matrix), #8 (3-layer 5-pass partial Hessian), and #9 (parm7 atom indexing) — see §5.1.
L4b io/ (7 files). Output-side I/O concerns: per-stage summary writer, energy diagram, trajectory rendering, PDB altloc fix, Hessian cache, numerical Hessian construction + frequency / vibrational I/O (hessian_calc.py). io/ never depends on workflows/; output format is owned here and consumed by stage runners.
L5 core/. The lowest layer. defaults.py is the single source of truth for every CLI default — grep here before adding a number anywhere else. utils.py is a ~3,200-LOC grab-bag of PDB / XYZ / plotting helpers; future work may split it into utils/{pdb,plot,coord,yaml,freeze,input_prep}.py. logging.py (-v / -vv wiring), calc_eval.py (per-stage calc evaluation) and residue_data.py (residue tables) also live here. The internal-only modules errors.py, types.py / _stage.py will be introduced here as they land.
2.4 Lazy-import mechanism (conceptual diagram)¶
External consumer Package root Layer dir
------------------ ---------------- -----------
from mlmm.core.utils import x ────────────────────────────────────► mlmm/core/utils.py
import mlmm.io.trj2fig ──────────────────────────────────────────► mlmm/io/trj2fig.py
from mlmm.backends.mlmm_calc import ─────────────────────────────► mlmm/backends/mlmm_calc.py
MLMMCore
from mlmm import MLMMCore ─────► mlmm/__init__.py
__getattr__("MLMMCore")
└─► _LAZY_IMPORTS["MLMMCore"]
= "mlmm.backends.mlmm_calc"
└─► importlib.import_module(...)
└─► getattr(module, "MLMMCore")
mlmm myaction ─────────────────► mlmm/cli/app.py
_LAZY_SUBCOMMANDS["myaction"]
= ("mlmm.workflows.myaction", "cli", "...")
└─► importlib.import_module(absolute path)
└─► getattr(module, "cli") → Click command
Two import surfaces (the flat-top shim layer was retired in this
release; downstream code that used from mlmm.<oldmod> must migrate
to the layered path):
Layered import path: external code imports directly from the layer directory (see the §2.1 layer table; e.g.
from mlmm.backends.mlmm_calc import MLMMCore).Root symbol attribute (
from mlmm import MLMMCore) — handled bymlmm/__init__.py:_LAZY_IMPORTS+ PEP 562__getattr__. The five re-exported symbols (MLMMCore,MLMMASECalculator,mlmm,mlmm_ase,mlmm_mm_only) all resolve tomlmm.backends.mlmm_calcand are loaded on first access, soimport mlmmstays cheap (only__version__is eager). There is no root module-attribute surface — submodules are reached by their full path (import mlmm.io.trj2fig), not as attributes of the top-level package.
The CLI subcommand resolver (cli/app.py:_LAZY_SUBCOMMANDS) uses absolute module paths (e.g. "mlmm.workflows.all") so that moving default_group.py into cli/ does not silently break subcommand discovery (the registry no longer depends on __package__).
4. File index — “where does this concern live?”¶
4.1 CLI / entry (L1 cli/)¶
concern |
file |
|---|---|
Click root group + subcommand dispatch |
|
Subcommand resolver (lazy import) |
|
|
|
Shared option decorator factories |
|
Bool/YAML/exception CLI helpers |
|
|
|
Bool flag compat ( |
|
AmberTools / conda env / GPU preflight |
|
4.2 Workflow stage runners (L2 workflows/)¶
Acronyms used below: MEP = minimum-energy path; GSM = growing-string method; COS = chain-of-states; RSIRFO = restricted-step image-function rational-function optimization (also written RS-I-RFO); Bofill = the Bofill Hessian-update formula; PHVA = partial Hessian vibrational analysis; IRC = intrinsic reaction coordinate; Kabsch = the Kabsch rigid-body alignment algorithm.
concern |
file |
|---|---|
Full pipeline orchestrator |
|
Geometry optimization (ONIOM macro/micro pre-opt) |
|
1D / 2D / 3D scans + shared |
|
MEP search (GSM) |
|
MEP optimizer core (pysisyphus COS) |
|
TS optimization (RSIRFO + Bofill + macro/micro) |
|
Vibrational analysis (PHVA + UMA active block) |
|
IRC integration (macro / micro) |
|
Single-point DFT (gpu4pyscf subprocess, ONIOM-embedded) |
|
Active-site extraction (cluster cap) |
|
ML / Movable MM / Frozen MM region assignment |
|
AmberTools-driven MM parameter generation |
|
ONIOM input writer (Gaussian / ORCA) |
|
ONIOM input reader (sanity, atom-name diff) |
|
Kabsch / frozen-subset alignment |
|
4.3 Chemistry helpers (L3 domain/)¶
concern |
file |
|---|---|
R↔P bond change detection |
|
Post-IRC bond summary |
|
PDB element column normalizer |
|
4.4 MLIP + ONIOM (L4a backends/)¶
concern |
file |
|---|---|
ML/MM ONIOM calculator core + 4 inline MLIP backends + ONIOM coupling |
|
|
|
Backend dispatch / factory ( |
|
xTB point-charge embedding correction ( |
|
per-backend adapter split (planned, not yet present) |
|
ONIOM core subdir (planned, not yet present) |
|
See MLIP Backends for the add-a-backend recipe (currently scoped to the planned per-backend split; until that lands, backend additions touch mlmm_calc.py inline).
4.5 I/O (L4b io/)¶
concern |
file |
|---|---|
|
|
Plotly energy diagram |
|
Trajectory → PNG / HTML / SVG / PDF |
|
PDB altloc resolution |
|
In-memory Hessian cache (per-run TTL) |
|
Numerical Hessian build + frequency / vibrational I/O |
|
Harmonic restraint setup |
|
4.6 Foundation (L5 core/)¶
concern |
file |
|---|---|
Every CLI default (single source of truth) |
|
PDB / XYZ / plot helpers |
|
|
|
Per-stage calc evaluation |
|
Residue tables |
|
(future) internal Protocol / TypedDict |
|
4.7 Repo-internal bundled forks¶
dir |
role |
divergent files (do NOT replace with upstream) |
|---|---|---|
|
optimizer / TS / IRC engine |
|
|
thermochemistry (ΔG, ZPE, partition functions) |
|
|
analytical Hessian on MM force field — PyPI 404, bundling is mandatory |
|
See each dir’s README.md for the touch-restriction boundary.
6. Bundled forks (repo-internal)¶
mlmm_toolkit ships three repo-internal modules at the repo top:
dir |
upstream PyPI? |
purpose |
scope of edits allowed |
|---|---|---|---|
|
NO — fork, do not |
optimizer, TS, IRC, COS, calculators |
annotation-only in this release (docstring + type hints); logic edits forbidden |
|
NO — fork (branding diff) |
ΔG, ZPE, partition functions, |
same as |
|
NO — PyPI 404, bundling mandatory |
analytical Hessian on MM force field |
same as |
Each dir carries its own README.md listing the divergent files and the touch-restriction boundary. From the layer model these forks live outside the L1..L5 graph: any layer may import them via the absolute package path (from pysisyphus.X import Y, from hessian_ff.analytical_hessian import …) without breaking the L1 → L2 → {L3, L4} → L5 direction.
7. Recommended deeper reading order (5–10 files)¶
After the Fresh-eyes tour (§3), follow this depth-first reading order:
mlmm/core/defaults.py— internalise the default-value table; everything downstream reads from here.mlmm/cli/app.py— Click root +_LAZY_SUBCOMMANDSregistry.mlmm/workflows/all.py— one full pipeline top-to-bottom.mlmm/workflows/extract.py+define_layer.py— cluster cap + ONIOM layer assignment.mlmm/workflows/mm_parm.py— AmberTools parm7 generation.mlmm/backends/mlmm_calc.py— the heart of ML/MM (chemistry-rules #1, #2, #8, #9 all live here).mlmm/workflows/tsopt.py— RSIRFO + Bofill (CHEMISTRY-RULE:7) + macro / micro alternation (CHEMISTRY-RULE:3).mlmm/workflows/freq.py— PHVA + UMA active-block (CHEMISTRY-RULE:6).mlmm/workflows/irc.py— VRAM hygiene + macro / micro IRC.mlmm/core/utils.py— shared PDB / XYZ / plot helpers.
8. ML/MM (ONIOM) scope¶
mlmm-toolkit operates on the full protein environment via ONIOM:
ML region: substrate + reaction-center residues, evaluated by one of 4 machine-learning interatomic potential (MLIP) backends (UMA / Orb / MACE / AIMNet2); an optional xTB point-charge embedding correction (
--embedcharge) adds MM→ML environmental effectsMovable MM region: a shell around the ML region, free to move under the AMBER force field
Frozen MM region: the rest of the protein, held rigid
The split is encoded in B-factor channels of the input PDB and propagated through extract → mm-parm → ONIOM model → MEP → tsopt → IRC → freq → dft.