Device Configuration & HPC Setup

How to configure GPU/CPU devices for the ML/MM calculator and submit jobs on HPC clusters.

Device Parameters

The ML/MM calculator (mlmm_calc.mlmm) uses separate device settings for the ML and MM backends:

Parameter

Default

Description

ml_device

auto

Device for MLIP inference. auto selects CUDA if available, otherwise CPU.

ml_cuda_idx

0

CUDA device index when ml_device=cuda.

mm_backend

hessian_ff

MM force field engine. hessian_ff (analytical, CPU-only) or openmm (supports CUDA).

mm_device

cpu

Device for MM backend. cpu for hessian_ff (required). cuda available for openmm.

mm_cuda_idx

0

CUDA device index when mm_device=cuda (openmm only).

mm_threads

16

Number of CPU threads for MM backend.

YAML configuration example

calc:
  ml_device: cuda
  ml_cuda_idx: 0
  mm_backend: hessian_ff
  mm_device: cpu
  mm_threads: 16

Using the OpenMM backend with CUDA

calc:
  ml_device: cuda
  ml_cuda_idx: 0
  mm_backend: openmm
  mm_device: cuda
  mm_cuda_idx: 0

Note: When both ML and MM use CUDA, they share GPU memory. For large systems, consider using mm_device: cpu to reduce VRAM consumption.


VRAM Management

Hessian device (--hess-device)

The freq command supports --hess-device to control where Hessian assembly and diagonalization run:

# Default: same device as ml_device (typically CUDA)
mlmm freq -i input.pdb --parm real.parm7 -q -1

# Force CPU for Hessian assembly (saves VRAM for large systems)
mlmm freq -i input.pdb --parm real.parm7 -q -1 --hess-device cpu

Use --hess-device cpu when:

  • The active region is large (> ~500 unfrozen atoms)

  • You encounter CUDA out-of-memory errors during frequency calculations

  • VRAM is limited (< 16 GB)

General VRAM tips

  1. Reduce the ML region size: Use mlmm extract with a smaller --radius or mlmm define-layer with a tighter --radius-freeze.

  2. Use hessian_ff (default): The hessian_ff backend is CPU-only, leaving all VRAM for MLIP inference.

  3. Avoid OpenMM CUDA for large systems: If both ML and MM use CUDA, VRAM pressure doubles.

  4. Monitor VRAM: print_vram defaults to True (VRAM usage is printed during Hessian computation); set print_vram: False in YAML to suppress it.


Precision by GPU class

--precision selects the MLIP backend floating-point precision (fp32 or fp64, case-insensitive). The effective default is fp32. The right choice depends on the GPU class you are running on:

Hardware

Recommended

Reasoning

HPC datacenter GPU (H100 / H200 / A100)

--precision fp64

Near-deterministic, low numerical noise; native fp64 throughput makes the cost acceptable. Stabilises TS optimization and the Hessian.

Consumer GPU (RTX 50xx / 40xx)

--precision fp32 (default)

fp64 is markedly slower on consumer cards. fp32 is the baseline for speed and screening.

# Datacenter H200 — full-precision base inference
mlmm tsopt -i ts.pdb --parm enzyme.parm7 -l 'LIG:Q' -b uma --precision fp64 -o result_ts

# Consumer RTX — fast screening with the default
mlmm scan -i r.pdb --parm enzyme.parm7 -l 'LIG:Q' -b uma --scan-lists '[(1,5,1.4)]' -o result_scan

--precision is accepted on every compute subcommand (sp, opt, tsopt, freq, irc, scan / scan2d / scan3d, path-opt, path-search, all) and is routed per backend (UMA precision, ORB precision, MACE default_dtype).

Note

For -b aimnet2, fp32 is a no-op and fp64 is rejected — its model inputs are cast to float32 upstream. Use uma, orb, or mace when you need fp64. --precision fp64 reduces GPU reduction-order drift but does not make a run bit-identical; only --deterministic gives bit-exactness — see Reproducibility.


HPC Job Submission

PBS example

#!/bin/bash
#PBS -N mlmm_opt
#PBS -q default
#PBS -l nodes=1:ppn=32:gpus=1,mem=120GB,walltime=72:00:00
#PBS -o ${PBS_JOBNAME}.o${PBS_JOBID}
#PBS -e ${PBS_JOBNAME}.e${PBS_JOBID}

set -euo pipefail
hostname
cd "${PBS_O_WORKDIR}"

# Load environment modules
source /etc/profile.d/modules.sh  # cluster-dependent
module load cuda/<version>

# Activate conda environment
source ~/miniconda3/etc/profile.d/conda.sh
conda activate <your-env>

# Run optimization
mlmm opt \
  -i r_complex_layered.pdb \
  --parm p_complex.parm7 \
  -q -1 -m 1 \
  --opt-mode grad \
  --out-dir opt_result

Slurm example

#!/bin/bash
#SBATCH --job-name=mlmm_opt
#SBATCH --partition=gpu
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=32
#SBATCH --mem=120G
#SBATCH --time=72:00:00
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err

set -euo pipefail
hostname

module load cuda/<version>

source ~/miniconda3/etc/profile.d/conda.sh
conda activate <your-env>

mlmm opt \
  -i r_complex_layered.pdb \
  --parm p_complex.parm7 \
  -q -1 -m 1 \
  --opt-mode grad \
  --out-dir opt_result

Key points

  • Single GPU for ML: ML inference runs on one GPU. Request gpus=1 (PBS) or --gres=gpu:1 (Slurm); request a second GPU only if you place the OpenMM MM backend on a separate CUDA device (mm_device: cuda, mm_cuda_idx: 1).

  • CPU threads: Request enough CPUs for the MM backend (mm_threads, default 16). Set ppn=32 (PBS) or --cpus-per-task=32 (Slurm) for a safety margin.

  • Memory: 120 GB is typically sufficient for enzyme active-site models. Increase for very large systems.

  • CUDA module: Load CUDA before activating conda to ensure PyTorch finds the correct CUDA runtime.

Specifying a GPU index

If you are allocated multiple GPUs or want to target a specific GPU on a multi-GPU node:

# Option A: Environment variable (affects all CUDA programs)
export CUDA_VISIBLE_DEVICES=0

# Option B: YAML configuration (mlmm-specific)
# In config.yaml:
# calc:
#   ml_cuda_idx: 0
mlmm opt -i input.pdb --parm real.parm7 -q -1 --config config.yaml

Limitations

  • No ML multi-GPU parallelism: ML inference runs on a single GPU. The OpenMM MM backend may use a separate CUDA device (mm_device: cuda, mm_cuda_idx); the default hessian_ff MM backend is CPU-only.

  • No distributed computing: All calculations run within a single process on a single node.

  • hessian_ff is CPU-only: the default MM backend runs on CPU; mm_device must be cpu/automm_device: cuda raises a ValueError rather than silently falling back.


See Also