Getting Started¶
Overview¶
mlmm-toolkit is a Python CLI toolkit for computing enzymatic reaction pathways using an ML/MM (Machine Learning / Molecular Mechanics) approach. It couples an MLIP (Machine-Learned Interatomic Potential) backend for the reactive (ML) region with a classical force field (hessian_ff) for the surrounding protein environment, using an ONIOM-like energy decomposition. The default backend is UMA (Meta’s FAIR-Chem); alternative backends (orb, mace, aimnet2) can be selected via --backend.
In many workflows, a single command is enough to generate a useful first-pass reaction path:
mlmm all -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3'
You can also run MEP search, TS optimization, IRC, thermochemistry, and single-point DFT in a single run by adding --tsopt --thermo --dft:
mlmm all -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --tsopt --thermo --dft
Given (i) two or more full protein-ligand PDB files (R,…, P), or (ii) one PDB with --scan-lists, or (iii) one TS candidate with --tsopt, mlmm-toolkit automatically:
extracts an active-site pocket around user-defined substrates to build a cluster model,
generates Amber parm7/rst7 topology files for the MM region (
mm-parm),assigns a 3-layer ML/MM partitioning (
define-layer),explores minimum-energy paths (MEPs) with path optimization methods such as the Growing String Method (GSM) and Direct Max Flux (DMF),
optionally optimizes transition states, runs vibrational analysis, IRC calculations, and single-point DFT calculations.
At the ML stage, the reactive region uses a machine-learned interatomic potential (MLIP). The default backend is UMA (Meta’s FAIR-Chem); alternative backends include ORB, MACE, and AIMNet2 (selected via -b/--backend). The MM region uses hessian_ff, a C++ native extension that computes Amber force field energies, forces, and Hessians. The total energy follows an ONIOM-like decomposition:
E_total = E_REAL_low + E_MODEL_high - E_MODEL_low
where REAL is the full system, MODEL is the ML region, “high” is the selected MLIP backend (default: UMA), and “low” is hessian_ff.
The CLI is designed to generate multi-step enzymatic reaction mechanisms with minimal manual intervention. The same workflow also works for small-molecule systems. When you skip pocket extraction (omit --center/-c and --ligand-charge), you can also use .xyz or .gjf inputs.
Important
Input PDB files must already contain hydrogen atoms.
When you provide multiple PDBs, they must contain the same atoms in the same order (only coordinates may differ); otherwise an error is raised.
Most subcommands require
--parm(Amber topology) and--model-pdb(ML region definition). Theallcommand generates these automatically.
Tip
If you are new to the project, read Concepts & Workflow first. For symptom-first diagnosis, start with Common Error Recipes. If you encounter an error during setup or runtime, refer to Troubleshooting.
CLI conventions¶
Convention |
Example |
|---|---|
Residue selectors |
|
Charge mapping |
|
Atom selectors |
|
For full details, see CLI Conventions.
path-search naming note: the CLI subcommand is path-search, while the documentation filename is path_search.md.
Recommended tools for hydrogen addition¶
If your PDB lacks hydrogen atoms, use one of the following tools before running mlmm-toolkit:
Tool |
Example Command |
Notes |
|---|---|---|
reduce (Richardson Lab) |
|
Fast, widely used for crystallographic structures |
pdb2pqr |
|
Adds hydrogens and assigns partial charges |
Open Babel |
|
General-purpose cheminformatics toolkit |
mm-parm –add-h |
|
Hydrogen addition via PDBFixer (through AmberTools) |
To ensure identical atom ordering across multiple PDB inputs, apply the same hydrogen-addition tool with consistent settings to all structures.
Warning
This software is still under development. Please use it at your own risk.
Installation¶
mlmm-toolkit is intended for Linux environments (local workstations or HPC clusters) with a CUDA-capable GPU. Several dependencies – notably PyTorch, fairchem-core (UMA), gpu4pyscf-cuda12x, and hessian_ff – expect a working CUDA installation. Alternative MLIP backends (ORB, MACE, AIMNet2) have their own optional dependencies; see the install extras below.
Prerequisites¶
mlmm-toolkit uses the following components:
MLIP backends: Energy, force, and Hessian calculations for the ML region. The default is UMA (fairchem-core). ORB (
pip install "mlmm-toolkit[orb]") and AIMNet2 (pip install "mlmm-toolkit[aimnet2]") are also available. MACE requires a separate environment due to e3nn conflicts.hessian_ff: Amber force field calculations for the MM region (requires building the C++ extension).
AmberTools: Automatic parm7/rst7 generation via the
mm-parmsubcommand (tleap, antechamber, parmchk2).
Refer to the upstream projects for additional details:
fairchem / UMA: https://github.com/facebookresearch/fairchem, https://huggingface.co/facebook/UMA
Hugging Face token & security: https://huggingface.co/docs/hub/security-tokens
Quick start¶
Below is a minimal setup example that works on many CUDA 12.9 clusters. Adjust module names and versions to match your system. This example assumes the default GSM MEP mode (no DMF). For DMF, install cyipopt via conda first.
# 1) Install a CUDA-enabled PyTorch build
# 2) Install mlmm-toolkit
# 3) Build the hessian_ff C++ native extension
# 4) Install a headless Chrome for Plotly figure export
pip install torch --index-url https://download.pytorch.org/whl/cu129
pip install mlmm-toolkit
# Previous stable release (v0.1.1):
# pip install git+https://github.com/t-0hmura/mlmm_toolkit.git@v0.1.1
# Optional: install alternative MLIP backends
pip install "mlmm-toolkit[orb]" # ORB backend
pip install "mlmm-toolkit[aimnet2]" # AIMNet2 backend
# MACE: pip uninstall fairchem-core && pip install mace-torch (separate env required)
cd $(python -c "import hessian_ff; print(hessian_ff.__path__[0])")/native && make
plotly_get_chrome -y
Note: If you switch to a different runtime environment (node/container/Python/PyTorch), rebuild
hessian_ffin that environment.
On many clusters, install Ninja first:conda install -c conda-forge ninja -y cd hessian_ff/native && make clean && make
Next, log in to Hugging Face Hub so that UMA models can be downloaded (only required when using the default uma backend). Either:
# Hugging Face CLI
hf auth login --token '<YOUR_ACCESS_TOKEN>' --add-to-git-credential
or
# Classic CLI
huggingface-cli login
You only need to do this once per machine / environment.
If you want to use the Direct Max Flux (DMF) method for MEP search, create a conda environment and install cyipopt before installing mlmm-toolkit.
# Create and activate a dedicated conda environment
conda create -n mlmm python=3.11 -y
conda activate mlmm
# Install cyipopt (required for the DMF method in MEP search)
conda install -c conda-forge cyipopt -y
If you are on an HPC cluster that uses environment modules, load CUDA before installing PyTorch, like this:
module load cuda/12.9
AmberTools is required for the
mlmm mm-parmsubcommand (Amber topology generation). Install it separately:
conda install -c conda-forge ambertools -y
Even without AmberTools, other subcommands work if you provide --parm manually.
AmberTools installation¶
The mm-parm subcommand (automatic parm7/rst7 generation) requires AmberTools. The easiest way to install it is via conda:
conda install -c conda-forge ambertools -y
Even without AmberTools, other subcommands work if you provide --parm manually.
Step-by-step installation¶
If you prefer to build the environment piece by piece:
Load CUDA (if you use environment modules on an HPC cluster)
module load cuda/12.9
Create and activate a conda environment
conda create -n mlmm python=3.11 -y
conda activate mlmm
Install cyipopt Required if you want to use the DMF method in MEP search.
conda install -c conda-forge cyipopt -y
Install AmberTools Required for the
mlmm mm-parmsubcommand (Amber topology generation with tleap/antechamber).
conda install -c conda-forge ambertools -y
Install PyTorch with the right CUDA build
For CUDA 12.9:
pip install torch --index-url https://download.pytorch.org/whl/cu129
(You may use another compatible version if your cluster recommends it.)
Install
mlmm-toolkit
pip install mlmm-toolkit
To install with optional MLIP backends:
pip install "mlmm-toolkit[orb]" # ORB backend
pip install "mlmm-toolkit[aimnet2]" # AIMNet2 backend
# MACE: pip uninstall fairchem-core && pip install mace-torch (separate env required)
Build the
hessian_ffC++ native extension
cd $(python -c "import hessian_ff; print(hessian_ff.__path__[0])")/native && make
This compiles the C++ code that provides fast Amber force field energy, force, and Hessian calculations.
Note: If you move to a different runtime environment, install Ninja and rebuild in that environment:
conda install -c conda-forge ninja -y cd $(python -c "import hessian_ff; print(hessian_ff.__path__[0])")/native && make clean && make
Install Chrome for visualization
plotly_get_chrome -y
Log in to Hugging Face Hub (required for UMA backend only)
huggingface-cli login
See also:
Verify installation
mlmm --version
This should display the installed version (e.g., 0.x.y; the exact output depends on the git tag).
Multi-backend examples¶
The default MLIP backend is UMA. Use -b/--backend to switch to an alternative backend, and --embedcharge to enable xTB point-charge embedding:
# Use ORB backend
mlmm opt -i pocket.pdb --parm real.parm7 --model-pdb ml.pdb -q 0 -b orb
# Use MACE backend
mlmm opt -i pocket.pdb --parm real.parm7 --model-pdb ml.pdb -q 0 -b mace
# Enable xTB point-charge embedding
mlmm opt -i pocket.pdb --parm real.parm7 --model-pdb ml.pdb -q 0 --embedcharge
Quickstart routes (recommended)¶
Typical workflow¶
The mlmm all command orchestrates a multi-step pipeline. When run individually, the typical workflow is:
1. extract - Extract active-site pocket from full protein-ligand PDB
2. mm-parm - Generate Amber parm7/rst7 topology (requires AmberTools)
3. define-layer - Assign 3-layer ML/MM partitioning (B-factor encoding)
4. path-search - Recursive MEP search (Growing String Method)
5. tsopt - Transition state optimization
6. freq - Vibrational analysis and thermochemistry
7. dft - Single-point DFT energy refinement
The all command runs steps 1-7 automatically. You can also run each step individually for debugging or custom workflows.
Command line basics¶
The main entry point is the mlmm command, installed via pip. Internally it uses the Click library, and the default subcommand is all.
This is equivalent to:
mlmm [OPTIONS]...
# is equivalent to
mlmm all [OPTIONS]...
The all command runs the full pipeline – cluster extraction, MM parameterization, layer definition, MEP search, TS optimization, vibrational analysis, and optional DFT – in a single invocation.
All high-level workflows share two important options when you use cluster extraction:
-i/--input: one or more full structures (reactant, intermediate(s), product).-c/--center: how to define the substrate / extraction center (e.g., residue names or residue IDs).
If you omit --center/-c, cluster extraction is skipped and the full input structure is used directly.
Main workflow modes¶
Multi-structure MEP workflow (reactant -> product)¶
Use this when you already have several full PDB structures along a putative reaction coordinate (e.g., R -> I1 -> I2 -> P).
Minimal example
mlmm -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3'
Richer example
mlmm -i R.pdb I1.pdb I2.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --out-dir ./result_all --tsopt --thermo --dft
Behavior:
takes two or more full systems in reaction order,
extracts catalytic cluster models for each structure,
generates Amber parm7/rst7 topology and assigns 3-layer ML/MM partitioning,
performs a recursive MEP search via
path-searchby default (outputs underpath_search/),optionally switches to a single-pass
path-optrun with--no-refine-path,when PDB templates are available, merges the cluster-model MEP back into the full system,
optionally runs TS optimization, vibrational analysis, and single-point DFT calculations for each segment.
This is the recommended mode when you can generate reasonably spaced intermediates (e.g., from docking, MD, or manual modeling).
Important
mlmm-toolkit assumes that multiple input PDBs contain exactly the same atoms in the same order (only coordinates may differ). If any non-coordinate fields differ across inputs, an error is raised. Input PDB files must also contain hydrogen atoms.
Single-structure + staged scan (feeds MEP refinement)¶
Use this when you only have one PDB structure, but you know which inter-atomic distances should change along the reaction.
Provide a single -i together with --scan-lists:
Minimal example
mlmm -i R.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --scan-lists '[("TYR 285 CA","MMT 309 C10",2.20),("TYR 285 CB","MMT 309 C11",1.80)]' '[("TYR 285 CB","MMT 309 C11",1.20)]'
Richer example
mlmm -i SINGLE.pdb -c 'SAM,GPP' --scan-lists '[("TYR 285 CA","MMT 309 C10",2.20),("TYR 285 CB","MMT 309 C11",1.80)]' '[("TYR 285 CB","MMT 309 C11",1.20)]' --multiplicity 1 --out-dir ./result_scan_all --tsopt --thermo --dft
Key points:
--scan-listsdescribes staged distance scans on the extracted cluster model.Each tuple
(i, j, target_A)is:a PDB atom selector string like
'TYR,285,CA'(delimiters can be: space/comma/slash/backtick/backslash,/`\) or a 1-based atom index,automatically remapped to the cluster-model indices.
Supplying one
--scan-listsliteral runs a single scan stage; multiple literals run sequential stages. Pass multiple literals after a single flag (repeated flags are not accepted).Each stage writes a
stage_XX/result.pdb, which is treated as a candidate intermediate or product.The default
allworkflow refines the concatenated stages with recursivepath-search.With
--no-refine-path, it instead performs a single-passpath-optchain and skips the recursive refiner.
This mode is useful for building reaction paths starting from a single structure.
Single-structure TSOPT-only mode¶
Use this when you already have a transition state candidate and only want to optimize it and proceed to IRC calculations.
Provide exactly one PDB and enable --tsopt:
Minimal example
mlmm -i TS_CANDIDATE.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --tsopt
Richer example
mlmm -i TS_CANDIDATE.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' --tsopt --thermo --dft --out-dir ./result_tsopt_only
Behavior:
skips the MEP/path search entirely,
optimizes the cluster-model TS with TS optimization,
runs an IRC in both directions and optimizes both ends to relax down to R and P minima,
can then perform
freqanddfton the R/TS/P,produces MLIP, Gibbs, and DFT//MLIP energy diagrams.
Important
Single-input runs require either --scan-lists (staged scan -> GSM) or --tsopt (TSOPT-only). Supplying only a single -i without one of these will not trigger a full workflow.
Important CLI options and behaviors¶
Below are the most commonly used options across workflows.
Option |
Description |
|---|---|
|
Input structures. >= 2 PDBs -> MEP search; 1 PDB + |
|
Defines the substrate / extraction center. Supports residue names ( |
|
Charge info: mapping ( |
|
Hard override of total system charge. |
|
Spin multiplicity (e.g., |
|
Staged distance scans for single-input runs (YAML/JSON file or inline literals). |
|
Top-level output directory. |
|
Enable TS optimization and IRC. |
|
Run vibrational analysis and thermochemistry. |
|
Perform single-point DFT calculations. |
|
Recursive MEP refinement (default) vs single-pass. |
|
Workflow preset in |
|
MLIP backend for the ML region (default: |
|
Enable xTB point-charge embedding correction (default: off). |
|
ML Hessian calculation mode. |
For a full matrix of options and YAML schemas, see YAML Reference.
Run summaries¶
Every mlmm all run writes:
summary.log– formatted summary for quick inspection, andsummary.yaml– YAML version summary.
They typically contain:
the exact CLI command invoked,
global MEP statistics (e.g. maximum barrier, path length),
per-segment barrier heights and key bond changes,
energies from the MLIP backend, thermochemistry, and DFT post-processing (where enabled).
Each segment directory under path_search/ also gets its own summary.log and summary.yaml, so you can inspect local refinements independently.
CLI commands¶
Most users will primarily call mlmm all. The CLI also exposes individual subcommands – each supports -h/--help.
mlmm all --help shows core options and mlmm all --help-advanced shows the complete list.
scan, scan2d, scan3d, the calculation commands (opt, path-opt, path-search, tsopt, freq, irc, dft), and selected utility commands (mm-parm, define-layer, add-elem-info, trj2fig, energy-diagram, oniom-export) now follow the same progressive-help pattern (--help core, --help-advanced full). extract also supports progressive help (--help core, --help-advanced full parser options).
Subcommand |
Role |
Documentation |
|---|---|---|
|
End-to-end workflow |
|
|
Extract active-site pocket (cluster model) |
|
|
Generate Amber parm7/rst7 topology |
|
|
Assign 3-layer ML/MM partitioning |
|
|
Geometry optimization |
|
|
Transition state optimization |
|
|
MEP optimization (GSM/DMF) |
|
|
Recursive MEP search |
|
|
1D bond-length scan |
|
|
2D distance scan |
|
|
3D distance scan |
|
|
IRC calculation |
|
|
Vibrational analysis |
|
|
Single-point DFT |
|
|
Export to Gaussian ONIOM / ORCA QM/MM (`–mode g16 |
orca`) |
|
Import Gaussian/ORCA ONIOM input and reconstruct XYZ + layered PDB |
|
|
Plot energy profiles |
|
|
Draw state energy diagram from numeric values |
|
|
Repair PDB element columns |
|
|
Remove alternate location (altLoc) indicators from PDB files |
|
|
Run pysisyphus YAML workflow (v0.1.x compat) |
Tip
In all, tsopt, freq and irc, setting --hessian-calc-mode Analytical (for the ML region) is strongly recommended when you have enough VRAM. Note: Analytical mode is only available with the UMA backend; other backends automatically use FiniteDifference.
Quick reference¶
Common command patterns:
# Basic MEP search (2+ structures)
mlmm -i R.pdb P.pdb -c 'SUBSTRATE' -l 'SUB:-1'
# Full workflow with post-processing
mlmm -i R.pdb P.pdb -c 'SAM,GPP' -l 'SAM:1,GPP:-3' \
--tsopt --thermo --dft
# Single structure with staged scan
mlmm -i SINGLE.pdb -c 'LIG' --scan-lists '[("RES1,100,CA","LIG,200,C1",2.0)]'
# TS-only optimization
mlmm -i TS.pdb -c 'LIG' --tsopt --thermo
# Individual subcommands (after running extract + mm-parm + define-layer)
mlmm path-search -i R.pdb P.pdb --parm real.parm7 --model-pdb model.pdb -q 0 -m 1
mlmm tsopt -i ts_guess.pdb --parm real.parm7 --model-pdb model.pdb -q 0 -m 1
Essential options:
Option |
Purpose |
|---|---|
|
Input structure(s) |
|
Substrate definition for pocket extraction |
|
Substrate charges (e.g., |
|
Amber parm7 topology file (required for subcommands) |
|
ML region PDB file (required for subcommands) |
|
Enable TS optimization + IRC |
|
Run vibrational analysis |
|
Run single-point DFT |
|
MLIP backend ( |
|
Enable xTB point-charge embedding correction |
|
Output directory |
Getting help¶
For any subcommand:
mlmm <subcommand> --help
mlmm <subcommand> --help-advanced
mlmm all --help-advanced
For all, --help is intentionally short. Use --help-advanced to see every option.