Device Configuration & HPC Setup¶

How to configure GPU/CPU devices for the ML/MM calculator and submit jobs on HPC clusters.

Device Parameters¶

The ML/MM calculator (mlmm_calc.mlmm) uses separate device settings for the ML and MM backends:

Parameter	Default	Description
`ml_device`	`auto`	Device for MLIP inference. `auto` selects CUDA if available, otherwise CPU.
`ml_cuda_idx`	`0`	CUDA device index when `ml_device=cuda`.
`mm_backend`	`hessian_ff`	MM force field engine. `hessian_ff` (analytical, CPU-only) or `openmm` (supports CUDA).
`mm_device`	`cpu`	Device for MM backend. `cpu` for hessian_ff (required). `cuda` available for openmm.
`mm_cuda_idx`	`0`	CUDA device index when `mm_device=cuda` (openmm only).
`mm_threads`	`16`	Number of CPU threads for MM backend.

YAML configuration example¶

calc:
  ml_device: cuda
  ml_cuda_idx: 0
  mm_backend: hessian_ff
  mm_device: cpu
  mm_threads: 16

Using the OpenMM backend with CUDA¶

calc:
  ml_device: cuda
  ml_cuda_idx: 0
  mm_backend: openmm
  mm_device: cuda
  mm_cuda_idx: 0

Note: When both ML and MM use CUDA, they share GPU memory. For large systems, consider using mm_device: cpu to reduce VRAM consumption.

VRAM Management¶

Hessian device (`--hess-device`)¶

The freq command supports --hess-device to control where Hessian assembly and diagonalization run:

# Default: same device as ml_device (typically CUDA)
mlmm freq -i input.pdb --parm real.parm7 -q -1

# Force CPU for Hessian assembly (saves VRAM for large systems)
mlmm freq -i input.pdb --parm real.parm7 -q -1 --hess-device cpu

Use --hess-device cpu when:

The active region is large (> ~500 unfrozen atoms)
You encounter CUDA out-of-memory errors during frequency calculations
VRAM is limited (< 16 GB)

General VRAM tips¶

Reduce the ML region size: Use mlmm extract with a smaller --radius or mlmm define-layer with a tighter --radius-freeze.
Use hessian_ff (default): The hessian_ff backend is CPU-only, leaving all VRAM for MLIP inference.
Avoid OpenMM CUDA for large systems: If both ML and MM use CUDA, VRAM pressure doubles.
Monitor VRAM: print_vram defaults to True (VRAM usage is printed during Hessian computation); set print_vram: False in YAML to suppress it.

Precision by GPU class¶

--precision selects the MLIP backend floating-point precision (fp32 or fp64, case-insensitive). The effective default is fp32. The right choice depends on the GPU class you are running on:

Hardware	Recommended	Reasoning
HPC datacenter GPU (H100 / H200 / A100)	`--precision fp64`	Near-deterministic, low numerical noise; native fp64 throughput makes the cost acceptable. Stabilises TS optimization and the Hessian.
Consumer GPU (RTX 50xx / 40xx)	`--precision fp32` (default)	fp64 is markedly slower on consumer cards. fp32 is the baseline for speed and screening.

# Datacenter H200 — full-precision base inference
mlmm tsopt -i ts.pdb --parm enzyme.parm7 -l 'LIG:Q' -b uma --precision fp64 -o result_ts

# Consumer RTX — fast screening with the default
mlmm scan -i r.pdb --parm enzyme.parm7 -l 'LIG:Q' -b uma --scan-lists '[(1,5,1.4)]' -o result_scan

--precision is accepted on every compute subcommand (sp, opt, tsopt, freq, irc, scan / scan2d / scan3d, path-opt, path-search, all) and is routed per backend (UMA precision, ORB precision, MACE default_dtype).

Note

For -b aimnet2, fp32 is a no-op and fp64 is rejected — its model inputs are cast to float32 upstream. Use uma, orb, or mace when you need fp64. --precision fp64 reduces GPU reduction-order drift but does not make a run bit-identical; only --deterministic gives bit-exactness — see Reproducibility.

HPC Job Submission¶

PBS example¶

#!/bin/bash
#PBS -N mlmm_opt
#PBS -q default
#PBS -l nodes=1:ppn=32:gpus=1,mem=120GB,walltime=72:00:00
#PBS -o ${PBS_JOBNAME}.o${PBS_JOBID}
#PBS -e ${PBS_JOBNAME}.e${PBS_JOBID}

set -euo pipefail
hostname
cd "${PBS_O_WORKDIR}"

# Load environment modules
source /etc/profile.d/modules.sh  # cluster-dependent
module load cuda/<version>

# Activate conda environment
source ~/miniconda3/etc/profile.d/conda.sh
conda activate <your-env>

# Run optimization
mlmm opt \
  -i r_complex_layered.pdb \
  --parm p_complex.parm7 \
  -q -1 -m 1 \
  --opt-mode grad \
  --out-dir opt_result

Slurm example¶

#!/bin/bash
#SBATCH --job-name=mlmm_opt
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=32
#SBATCH --mem=120G
#SBATCH --time=72:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

set -euo pipefail
hostname

module load cuda/<version>

source ~/miniconda3/etc/profile.d/conda.sh
conda activate <your-env>

mlmm opt \
  -i r_complex_layered.pdb \
  --parm p_complex.parm7 \
  -q -1 -m 1 \
  --opt-mode grad \
  --out-dir opt_result

Key points¶

Single GPU for ML: ML inference runs on one GPU. Request gpus=1 (PBS) or --gres=gpu:1 (Slurm); request a second GPU only if you place the OpenMM MM backend on a separate CUDA device (mm_device: cuda, mm_cuda_idx: 1).
CPU threads: Request enough CPUs for the MM backend (mm_threads, default 16). Set ppn=32 (PBS) or --cpus-per-task=32 (Slurm) for a safety margin.
Memory: 120 GB is typically sufficient for enzyme active-site models. Increase for very large systems.
CUDA module: Load CUDA before activating conda to ensure PyTorch finds the correct CUDA runtime.

Specifying a GPU index¶

If you are allocated multiple GPUs or want to target a specific GPU on a multi-GPU node:

# Option A: Environment variable (affects all CUDA programs)
export CUDA_VISIBLE_DEVICES=0

# Option B: YAML configuration (mlmm-specific)
# In config.yaml:
# calc:
#   ml_cuda_idx: 0
mlmm opt -i input.pdb --parm real.parm7 -q -1 --config config.yaml

Limitations¶

No ML multi-GPU parallelism: ML inference runs on a single GPU. The OpenMM MM backend may use a separate CUDA device (mm_device: cuda, mm_cuda_idx); the default hessian_ff MM backend is CPU-only.
No distributed computing: All calculations run within a single process on a single node.
hessian_ff is CPU-only: the default MM backend runs on CPU; mm_device must be cpu/auto — mm_device: cuda raises a ValueError rather than silently falling back.