Device Configuration & HPC Setup¶
How to configure GPU/CPU devices for the ML/MM calculator and submit jobs on HPC clusters.
Device Parameters¶
The ML/MM calculator (mlmm_calc.mlmm) uses separate device settings for the ML and MM backends:
Parameter |
Default |
Description |
|---|---|---|
|
|
Device for MLIP inference. |
|
|
CUDA device index when |
|
|
MM force field engine. |
|
|
Device for MM backend. |
|
|
CUDA device index when |
|
|
Number of CPU threads for MM backend. |
YAML configuration example¶
calc:
ml_device: cuda
ml_cuda_idx: 0
mm_backend: hessian_ff
mm_device: cpu
mm_threads: 16
Using the OpenMM backend with CUDA¶
calc:
ml_device: cuda
ml_cuda_idx: 0
mm_backend: openmm
mm_device: cuda
mm_cuda_idx: 0
Note: When both ML and MM use CUDA, they share GPU memory. For large systems, consider using
mm_device: cputo reduce VRAM consumption.
VRAM Management¶
Hessian device (--hess-device)¶
The freq command supports --hess-device to control where Hessian assembly and diagonalization run:
# Default: same device as ml_device (typically CUDA)
mlmm freq -i input.pdb --parm real.parm7 -q -1
# Force CPU for Hessian assembly (saves VRAM for large systems)
mlmm freq -i input.pdb --parm real.parm7 -q -1 --hess-device cpu
Use --hess-device cpu when:
The active region is large (> ~500 unfrozen atoms)
You encounter CUDA out-of-memory errors during frequency calculations
VRAM is limited (< 16 GB)
General VRAM tips¶
Reduce the ML region size: Use
mlmm extractwith a smaller--radiusormlmm define-layerwith a tighter--radius-freeze.Use hessian_ff (default): The hessian_ff backend is CPU-only, leaving all VRAM for MLIP inference.
Avoid OpenMM CUDA for large systems: If both ML and MM use CUDA, VRAM pressure doubles.
Monitor VRAM:
print_vramdefaults toTrue(VRAM usage is printed during Hessian computation); setprint_vram: Falsein YAML to suppress it.
Precision by GPU class¶
--precision selects the MLIP backend floating-point precision (fp32 or fp64, case-insensitive). The effective default is fp32. The right choice depends on the GPU class you are running on:
Hardware |
Recommended |
Reasoning |
|---|---|---|
HPC datacenter GPU (H100 / H200 / A100) |
|
Near-deterministic, low numerical noise; native fp64 throughput makes the cost acceptable. Stabilises TS optimization and the Hessian. |
Consumer GPU (RTX 50xx / 40xx) |
|
fp64 is markedly slower on consumer cards. fp32 is the baseline for speed and screening. |
# Datacenter H200 — full-precision base inference
mlmm tsopt -i ts.pdb --parm enzyme.parm7 -l 'LIG:Q' -b uma --precision fp64 -o result_ts
# Consumer RTX — fast screening with the default
mlmm scan -i r.pdb --parm enzyme.parm7 -l 'LIG:Q' -b uma --scan-lists '[(1,5,1.4)]' -o result_scan
--precision is accepted on every compute subcommand (sp, opt, tsopt, freq, irc, scan / scan2d / scan3d, path-opt, path-search, all) and is routed per backend (UMA precision, ORB precision, MACE default_dtype).
Note
For -b aimnet2, fp32 is a no-op and fp64 is rejected — its model inputs are cast to float32 upstream. Use uma, orb, or mace when you need fp64. --precision fp64 reduces GPU reduction-order drift but does not make a run bit-identical; only --deterministic gives bit-exactness — see Reproducibility.
HPC Job Submission¶
PBS example¶
#!/bin/bash
#PBS -N mlmm_opt
#PBS -q default
#PBS -l nodes=1:ppn=32:gpus=1,mem=120GB,walltime=72:00:00
#PBS -o ${PBS_JOBNAME}.o${PBS_JOBID}
#PBS -e ${PBS_JOBNAME}.e${PBS_JOBID}
set -euo pipefail
hostname
cd "${PBS_O_WORKDIR}"
# Load environment modules
source /etc/profile.d/modules.sh # cluster-dependent
module load cuda/<version>
# Activate conda environment
source ~/miniconda3/etc/profile.d/conda.sh
conda activate <your-env>
# Run optimization
mlmm opt \
-i r_complex_layered.pdb \
--parm p_complex.parm7 \
-q -1 -m 1 \
--opt-mode grad \
--out-dir opt_result
Slurm example¶
#!/bin/bash
#SBATCH --job-name=mlmm_opt
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=32
#SBATCH --mem=120G
#SBATCH --time=72:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
set -euo pipefail
hostname
module load cuda/<version>
source ~/miniconda3/etc/profile.d/conda.sh
conda activate <your-env>
mlmm opt \
-i r_complex_layered.pdb \
--parm p_complex.parm7 \
-q -1 -m 1 \
--opt-mode grad \
--out-dir opt_result
Key points¶
Single GPU for ML: ML inference runs on one GPU. Request
gpus=1(PBS) or--gres=gpu:1(Slurm); request a second GPU only if you place the OpenMM MM backend on a separate CUDA device (mm_device: cuda,mm_cuda_idx: 1).CPU threads: Request enough CPUs for the MM backend (
mm_threads, default 16). Setppn=32(PBS) or--cpus-per-task=32(Slurm) for a safety margin.Memory: 120 GB is typically sufficient for enzyme active-site models. Increase for very large systems.
CUDA module: Load CUDA before activating conda to ensure PyTorch finds the correct CUDA runtime.
Specifying a GPU index¶
If you are allocated multiple GPUs or want to target a specific GPU on a multi-GPU node:
# Option A: Environment variable (affects all CUDA programs)
export CUDA_VISIBLE_DEVICES=0
# Option B: YAML configuration (mlmm-specific)
# In config.yaml:
# calc:
# ml_cuda_idx: 0
mlmm opt -i input.pdb --parm real.parm7 -q -1 --config config.yaml
Limitations¶
No ML multi-GPU parallelism: ML inference runs on a single GPU. The OpenMM MM backend may use a separate CUDA device (
mm_device: cuda,mm_cuda_idx); the default hessian_ff MM backend is CPU-only.No distributed computing: All calculations run within a single process on a single node.
hessian_ff is CPU-only: the default MM backend runs on CPU;
mm_devicemust becpu/auto—mm_device: cudaraises aValueErrorrather than silently falling back.
See Also¶
Getting Started — Installation and CUDA setup
ML/MM Calculator — Calculator architecture and parameters
YAML Reference — Full configuration reference
freq —
--hess-deviceoption detailsTroubleshooting — Common error fixes