Getting Started¶
Overview¶
mlmm-toolkit is a Python CLI toolkit for computing enzymatic reaction pathways using an ML/MM (Machine Learning / Molecular Mechanics) approach. It couples an MLIP (Machine Learning Interatomic Potential) backend for the reactive (ML) region with a bundled MM force field engine (hessian_ff) for the surrounding protein environment, using an ONIOM-like energy decomposition. The default MLIP backend is UMA (Meta’s FAIR-Chem); alternative backends (orb, mace, aimnet2) can be selected via --backend.
In many workflows, a single command is enough to generate a useful first-pass reaction path:
mlmm all -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3'
You can also run ML/MM model setup, MEP search, TS optimization, IRC, thermochemistry, and single-point DFT in a single run by adding --tsopt --thermo --dft:
mlmm all -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --tsopt --thermo --dft
Given (i) two or more full protein-ligand PDB files (R,…, P), or (ii) one PDB with --scan-lists, or (iii) one TS candidate with --tsopt, mlmm-toolkit automatically:
defines the ML region around user-defined substrates,
generates Amber parm7/rst7 topology files for the MM region (
mm-parm),assigns a 3-layer ML/MM partitioning (
define-layer),explores minimum-energy paths (MEPs) with path optimization methods such as the Growing String Method (GSM) and Direct Max Flux (DMF),
optionally optimizes transition states, runs vibrational analysis, IRC calculations, and single-point DFT calculations.
At the ML stage, the reactive region uses a machine learning interatomic potential (MLIP). The default backend is UMA (Meta’s FAIR-Chem); alternative backends include ORB, MACE, and AIMNet2 (selected via -b/--backend). The MM region uses hessian_ff, a bundled C++ native extension that computes Amber force field energies, forces, and Hessians. The total energy follows an ONIOM-like decomposition:
E_total = E_REAL_low + E_MODEL_high - E_MODEL_low
where REAL is the full system, MODEL is the ML region, “high” is the selected MLIP backend (default: UMA), and “low” is hessian_ff.
The CLI is designed to generate multi-step enzymatic reaction mechanisms with minimal manual intervention. The same workflow also works for small-molecule systems. When you skip pocket extraction (omit --center/-c and --ligand-charge), you can also use .xyz inputs.
Important
Input PDB files must already contain hydrogen atoms.
When you provide multiple PDBs, they must contain the same atoms in the same order (only coordinates may differ); otherwise an error is raised.
Most subcommands require
--parm(Amber topology) and--model-pdb(ML region definition). Theallcommand generates these automatically.
Tip
If you are new to the project, read Concepts & Workflow first. For symptom-first diagnosis, start with Common Error Recipes. If you encounter an error during setup or runtime, refer to Troubleshooting.
CLI conventions¶
Convention |
Example |
|---|---|
Residue selectors |
|
Charge mapping |
|
Atom selectors |
|
For full details, see CLI Conventions.
path-search naming note: the CLI subcommand is path-search, while the documentation filename is path-search.md.
Recommended tools for hydrogen addition¶
If your PDB lacks hydrogen atoms, use one of the following tools before running mlmm-toolkit:
Tool |
Example Command |
Notes |
|---|---|---|
reduce (Richardson Lab) |
|
Fast, widely used for crystallographic structures |
pdb2pqr |
|
Adds hydrogens and assigns partial charges |
Open Babel |
|
General-purpose cheminformatics toolkit |
mm-parm –add-h |
|
Hydrogen addition via PDBFixer (through AmberTools) |
To ensure identical atom ordering across multiple PDB inputs, apply the same hydrogen-addition tool with consistent settings to all structures.
Warning
This software is still under development. Please use it at your own risk.
Installation¶
mlmm-toolkit is intended for Linux environments (local workstations or HPC clusters) with a CUDA-capable GPU. Several dependencies – notably PyTorch, fairchem-core (UMA), gpu4pyscf-cuda12x, and hessian_ff – expect a working CUDA installation. Alternative MLIP backends (ORB, MACE, AIMNet2) have their own optional dependencies; see the install extras below.
Prerequisites¶
mlmm-toolkit uses the following components:
MLIP backends: Energy, force, and Hessian calculations for the ML region. The default is UMA (fairchem-core). ORB (
pip install "mlmm-toolkit[orb]") and AIMNet2 (pip install "mlmm-toolkit[aimnet]") are also available. MACE is also available but requires uninstallingfairchem-corefirst due to ane3nnversion conflict (pip uninstall fairchem-core && pip install mace-torch).hessian_ff: Amber force field calculations for the MM region (requires building the C++ extension).
AmberTools: Automatic parm7/rst7 generation via the
mm-parmsubcommand (tleap, antechamber, parmchk2).
Refer to the upstream projects for additional details:
fairchem / UMA: https://github.com/facebookresearch/fairchem, https://huggingface.co/facebook/UMA
Hugging Face token & security: https://huggingface.co/docs/hub/security-tokens
Quick start¶
Below is a minimal setup example that works on many CUDA 12.9 clusters. Adjust module names and versions to match your system. This example assumes the default GSM MEP mode (no DMF). For DMF, install cyipopt via conda first.
# 1) Install a CUDA-enabled PyTorch build
# 2) Install mlmm-toolkit
# 3) Build the hessian_ff C++ native extension
# 4) Install a headless Chrome for Plotly figure export
pip install torch --index-url https://download.pytorch.org/whl/cu129
pip install mlmm-toolkit
# Previous stable release (v0.1.1):
# pip install git+https://github.com/t-0hmura/mlmm_toolkit.git@v0.1.1
# Optional: install alternative MLIP backends
pip install "mlmm-toolkit[orb]" # ORB backend
pip install "mlmm-toolkit[aimnet]" # AIMNet2 backend
# MACE backend (conflicts with UMA — uninstall fairchem-core first)
# pip uninstall fairchem-core && pip install mace-torch
cd $(python -c "import hessian_ff; print(hessian_ff.__path__[0])")/native && make
plotly_get_chrome -y
Note: If you switch to a different runtime environment (node/container/Python/PyTorch), rebuild
hessian_ffin that environment.
On many clusters, install Ninja first:conda install -c conda-forge ninja -y cd hessian_ff/native && make clean && make
Next, log in to Hugging Face Hub so that UMA models can be downloaded (only required when using the default uma backend). Either:
# Hugging Face CLI
hf auth login --token '<YOUR_ACCESS_TOKEN>' --add-to-git-credential
or
# Classic CLI
huggingface-cli login
You only need to do this once per machine / environment.
If you want to use the Direct Max Flux (DMF) method for MEP search, create a conda environment and install cyipopt before installing mlmm-toolkit.
# Create and activate a dedicated conda environment conda create -n mlmm python=3.11 -y conda activate mlmm # Install cyipopt (required for the DMF method in MEP search) conda install -c conda-forge cyipopt -y
If you are on an HPC cluster that uses environment modules, load CUDA before installing PyTorch, like this:
module load cuda/12.9
AmberTools is required for the
mlmm mm-parmsubcommand (Amber topology generation). Install it separately:conda install -c conda-forge ambertools -y
Even without AmberTools, other subcommands work if you provide
--parmmanually.
Step-by-step installation¶
If you prefer to build the environment piece by piece:
Load CUDA (if you use environment modules on an HPC cluster)
module load cuda/12.9
Create and activate a conda environment
conda create -n mlmm python=3.11 -y conda activate mlmm
Install cyipopt Required if you want to use the DMF method in MEP search.
conda install -c conda-forge cyipopt -y
Install AmberTools Required for the
mlmm mm-parmsubcommand (Amber topology generation with tleap/antechamber).conda install -c conda-forge ambertools -y
Install PyTorch with the right CUDA build
For CUDA 12.9:
pip install torch --index-url https://download.pytorch.org/whl/cu129
(You may use another compatible version if your cluster recommends it.)
Install
mlmm-toolkitpip install mlmm-toolkit
To install with optional MLIP backends:
pip install "mlmm-toolkit[orb]" # ORB backend pip install "mlmm-toolkit[aimnet]" # AIMNet2 backend # MACE backend (conflicts with UMA — uninstall fairchem-core first)
pip uninstall fairchem-core && pip install mace-torch¶
```
To enable xTB point-charge embedding (`--embedcharge`), install [xTB](https://github.com/grimme-lab/xtb) and ensure the `xtb` command is available on your `PATH`.
#### Installing xTB
```bash
conda install -c conda-forge xtb
```
Or build from source (requires GCC >= 10):
```bash
git clone --depth 1 https://github.com/grimme-lab/xtb.git
cd xtb
cmake -B build -S . -DCMAKE_BUILD_TYPE=Release
make -C build -j8
```
To use a custom xTB binary, set the `xtb_cmd` key in your YAML config.
Build the
hessian_ffC++ native extensionIn most environments the native extension is JIT compiled on first use. If you see a warning about the native extension not being available, build it manually:
cd $(python -c "import hessian_ff; print(hessian_ff.__path__[0])")/native && make
This compiles the C++ code that provides fast Amber force field energy, force, and Hessian calculations.
Note: If you move to a different runtime environment, install Ninja and rebuild in that environment:
conda install -c conda-forge ninja -y cd $(python -c "import hessian_ff; print(hessian_ff.__path__[0])")/native && make clean && make
Install Chrome for visualization
plotly_get_chrome -yLog in to Hugging Face Hub (required for UMA backend only)
huggingface-cli loginSee also:
Verify installation
mlmm --versionThis should display the installed version (e.g.,
0.x.y; the exact output depends on the git tag).
Multi-backend examples¶
The default MLIP backend is UMA. Use -b/--backend to switch to an alternative backend, and --embedcharge to enable xTB point-charge embedding:
# Use ORB backend
mlmm opt -i ml_region.pdb --parm real.parm7 --model-pdb ml.pdb -q 0 -b orb
# Use MACE backend
mlmm opt -i ml_region.pdb --parm real.parm7 --model-pdb ml.pdb -q 0 -b mace
# Enable xTB point-charge embedding
mlmm opt -i ml_region.pdb --parm real.parm7 --model-pdb ml.pdb -q 0 --embedcharge
Quickstart routes (recommended)¶
Typical workflow¶
The mlmm all command orchestrates a multi-step pipeline. When run individually, the typical workflow is:
1. extract - Define ML region from full protein-ligand PDB
2. mm-parm - Generate Amber parm7/rst7 topology (requires AmberTools)
3. define-layer - Assign 3-layer ML/MM partitioning (B-factor encoding)
4. path-search - MEP search (recursive path-search by default); add --no-refine-path for single-pass path-opt
5. tsopt - Transition state optimization
6. freq - Vibrational analysis and thermochemistry
7. dft - Single-point DFT energy refinement
The all command runs steps 1-7 automatically. You can also run each step individually for debugging or custom workflows.
Command line basics¶
The main entry point is the mlmm command, installed via pip. Internally it uses the Click library, and the default subcommand is all.
This is equivalent to:
mlmm [OPTIONS]...
# is equivalent to
mlmm all [OPTIONS]...
The all command runs the full pipeline – ML region extraction, MM parameterization, layer definition, MEP search, TS optimization, vibrational analysis, and optional DFT – in a single invocation.
All high-level workflows share two important options when you use ML region extraction:
-i/--input: one or more full structures (reactant, intermediate(s), product).-c/--center: how to define the substrate / extraction center (e.g., residue names or residue IDs).
If you omit --center/-c, ML region extraction is skipped and the full input structure is used directly.
Main workflow modes¶
Multi-structure MEP workflow (reactant -> product)¶
Use this when you already have several full PDB structures along a putative reaction coordinate (e.g., R -> I1 -> I2 -> P).
Minimal example
mlmm -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3'
Richer example
mlmm -i R.pdb I1.pdb I2.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --out-dir ./result_all --tsopt --thermo --dft
Behavior:
takes two or more full systems in reaction order,
defines the ML region for each structure,
generates Amber parm7/rst7 topology and assigns 3-layer ML/MM partitioning,
performs MEP search via recursive
path-searchby default (outputs underpath_search/),optionally switches to a single-pass
path-optrun with--no-refine-path,when PDB templates are available, merges the ML-region MEP back into the full system,
optionally runs TS optimization, vibrational analysis, and single-point DFT calculations for each segment.
This is the recommended mode when you can generate reasonably spaced intermediates (e.g., from docking, MD, or manual modeling).
Important
mlmm-toolkit assumes that multiple input PDBs contain exactly the same atoms in the same order (only coordinates may differ). If any non-coordinate fields differ across inputs, an error is raised. Input PDB files must also contain hydrogen atoms.
Single-structure + staged scan (feeds MEP refinement)¶
Use this when you prefer to define reaction coordinates yourself, rather than providing multiple endpoint structures.
Provide a single -i together with --scan-lists:
Minimal example
mlmm -i R.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --scan-lists '[("TYR 285 CA","MMT 309 C10",2.20),("TYR 285 CB","MMT 309 C11",1.80)]' '[("TYR 285 CB","MMT 309 C11",1.20)]'
Richer example
mlmm -i SINGLE.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --scan-lists '[("TYR 285 CA","MMT 309 C10",2.20),("TYR 285 CB","MMT 309 C11",1.80)]' '[("TYR 285 CB","MMT 309 C11",1.20)]' --multiplicity 1 --out-dir ./result_scan_all --tsopt --thermo --dft
Key points:
--scan-listsdescribes staged distance scans on the extracted ML region.Each tuple
(i, j, target_A)is:a PDB atom selector string like
'TYR,285,CA'(delimiters can be: space/comma/slash/backtick/backslash,/`\) or a 1-based atom index,automatically remapped to the ML region indices.
Supplying one
--scan-listsliteral runs a single scan stage; multiple literals run sequential stages. Pass multiple literals after a single flag (repeated flags are not accepted).Each stage writes a
stage_XX/result.pdb, which is treated as a candidate intermediate or product.The default
allworkflow runs recursivepath-searchwith automatic refinement on the concatenated stages.With
--no-refine-path, it instead runs single-passpath-optGSM per adjacent pair.
This mode is useful for building reaction paths starting from a single structure.
Single-structure TSOPT-only mode¶
Use this when you already have a transition state candidate and only want to optimize it and proceed to IRC calculations.
Provide exactly one PDB and enable --tsopt:
Minimal example
mlmm -i TS_CANDIDATE.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --tsopt
Richer example
mlmm -i TS_CANDIDATE.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --tsopt --thermo --dft --out-dir ./result_tsopt_only
Behavior:
skips the MEP/path search entirely,
optimizes the ML/MM TS with TS optimization,
runs an IRC in both directions and optimizes both ends to relax down to R and P minima,
can then perform
freqanddfton the R/TS/P,produces MLIP, Gibbs, and DFT//MLIP energy diagrams.
Important
Single-input runs require either --scan-lists (staged scan -> GSM) or --tsopt (TSOPT-only). Supplying only a single -i without one of these will not trigger a full workflow.
Important CLI options and behaviors¶
Below are the most commonly used options across workflows.
Option |
Description |
|---|---|
|
Input structures. >= 2 PDBs -> MEP search; 1 PDB + |
|
Defines the substrate / extraction center. Supports residue names ( |
|
Charge info: mapping ( |
|
Hard override of total system charge. |
|
Spin multiplicity (e.g., |
|
Staged distance scans for single-input runs (YAML/JSON file or inline literals). |
|
Top-level output directory. |
|
Enable TS optimization and IRC. |
|
Run vibrational analysis and thermochemistry. |
|
Perform single-point DFT calculations. |
|
Recursive |
|
MEP method: Growing String Method or Direct Max Flux. |
|
Workflow preset in |
|
MLIP backend for the ML region (default: |
|
Enable xTB point-charge embedding correction (default: off). |
|
ML Hessian calculation mode. |
For a full matrix of options and YAML schemas, see YAML Reference.
Run summaries¶
Every mlmm all run writes:
summary.log– formatted summary for quick inspection, andsummary.json– JSON results.
They typically contain:
the exact CLI command invoked,
global MEP statistics (e.g. maximum barrier, path length),
per-segment barrier heights and key bond changes,
energies from the MLIP backend, thermochemistry, and DFT post-processing (where enabled).
Each segment directory under path_search/ (or path_opt/ when --no-refine-path is used) also gets its own summary.log and summary.json, so you can inspect local refinements independently.
CLI commands¶
Most users will primarily call mlmm all. The CLI also exposes individual subcommands – each supports -h/--help.
mlmm all --help shows core options and mlmm all --help-advanced shows the complete list.
scan, scan2d, scan3d, the calculation commands (opt, path-opt, path-search, tsopt, freq, irc, dft), and selected utility commands (mm-parm, define-layer, add-elem-info, trj2fig, energy-diagram, oniom-export) now follow the same progressive-help pattern (--help core, --help-advanced full). extract and fix-altloc also support progressive help (--help core, --help-advanced full parser options).
Subcommand |
Role |
Documentation |
|---|---|---|
|
End-to-end workflow |
|
|
Define ML region (QM region) |
|
|
Generate Amber parm7/rst7 topology |
|
|
Assign 3-layer ML/MM partitioning |
|
|
Geometry optimization |
|
|
Transition state optimization |
|
|
MEP optimization (GSM/DMF) |
|
|
Recursive MEP search |
|
|
1D bond-length scan |
|
|
2D distance scan |
|
|
3D distance scan |
|
|
IRC calculation |
|
|
Vibrational analysis |
|
|
Single-point DFT |
|
|
Export to Gaussian ONIOM / ORCA QM/MM (`–mode g16 |
orca`) |
|
Import Gaussian/ORCA ONIOM input and reconstruct XYZ + layered PDB |
|
|
Plot energy profiles |
|
|
Draw state energy diagram from numeric values |
|
|
Repair PDB element columns |
|
|
Remove alternate location (altLoc) indicators from PDB files |
|
|
Run pysisyphus YAML workflow (v0.1.x compat) |
Tip
In all, tsopt, freq and irc, setting --hessian-calc-mode Analytical (for the ML region) is strongly recommended when you have enough VRAM. Note: Analytical mode is only available with the UMA backend; other backends automatically use FiniteDifference.
Quick reference¶
Common command patterns:
# Basic MEP search (2+ structures)
mlmm -i R.pdb P.pdb -c 'SUBSTRATE' -l 'SUB:-1'
# Full workflow with post-processing
mlmm -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' \
--tsopt --thermo --dft
# Single structure with staged scan
mlmm -i SINGLE.pdb -c 'LIG' -l 'LIG:-1' --scan-lists '[("RES1,100,CA","LIG,200,C1",2.0)]'
# TS-only optimization
mlmm -i TS.pdb -c 'LIG' -l 'LIG:-1' --tsopt --thermo
# Individual subcommands (after running extract + mm-parm + define-layer)
mlmm path-search -i R.pdb P.pdb --parm real.parm7 --model-pdb model.pdb -q 0 -m 1
mlmm tsopt -i ts_guess.pdb --parm real.parm7 --model-pdb model.pdb -q 0 -m 1
Essential options:
Option |
Purpose |
|---|---|
|
Input structure(s) |
|
Substrate definition for ML region extraction |
|
Substrate charges (e.g., |
|
Amber parm7 topology file (required for subcommands) |
|
ML region PDB file (required for subcommands) |
|
Enable TS optimization + IRC |
|
Run vibrational analysis |
|
Run single-point DFT |
|
MLIP backend ( |
|
Enable xTB point-charge embedding correction |
|
Output directory |
Getting help¶
For any subcommand:
mlmm <subcommand> --help
mlmm <subcommand> --help-advanced
mlmm all --help-advanced
For all, --help is intentionally short. Use --help-advanced to see every option.
Driving mlmm-toolkit from an AI coding agent¶
mlmm-toolkit ships a .claude/skills/ directory with agent-readable
instructions covering CLI usage, structure I/O, backend installation,
typical workflows, and HPC operation. To use them with Claude Code,
Cursor, OpenCode, or other agent platforms, copy .claude/skills/
into your project or home directory and the agent will pick them up
automatically. See the project README “Agent Skills” section for the
full list.