`extract`¶

Overview¶

Summary: Extract a cluster model (active site model (binding pocket)) from a protein–ligand PDB. Specify substrates with -c by residue name, residue ID, or a PDB path. Link hydrogens are added to cap cut bonds. Use --ligand-charge/-l for non-standard residue charges.

At a glance¶

Use when: Carving an active site model (cluster model) from a protein–ligand PDB (single structure or ensemble) for downstream MEP/TSOPT/freq/DFT runs.
Method: Distance-based residue selection (--radius, optional --radius-het2het) with peptide/disulfide/PRO safeguards, backbone truncation, and optional link-hydrogen capping at severed bonds.
Outputs: Active site model PDB(s) with link hydrogens after a TER record (model.pdb, model_<input>.pdb, or a single multi-MODEL PDB depending on -o); optional result.json with --out-json.
Defaults: --radius 2.6 Å, --radius-het2het 0.0 (off), --include-h2o True, --exclude-backbone False, --add-linkh True, --freeze-links True, --verbose True.
Next step: Feed the model PDB(s) into path-search / scan / opt / tsopt, or use all with -c/--center to chain extraction with the rest of the pipeline.

pdb2reaction extract creates an active site model (cluster model) from a protein–ligand PDB. It selects residues near the substrate, truncates the model according to backbone/side-chain rules, optionally caps severed bonds with link hydrogens, and can process single structures or ensembles.

For misclassification due to unusual residue/atom naming, see the appendix below on naming requirements and the internal reference lists.

Important

Link hydrogens and frozen atoms. Severed covalent bonds are capped with link hydrogens (residue LKH, atom HL). By default (--freeze-links, on in every downstream subcommand), the parent atoms of these link hydrogens are frozen during optimization, MEP search, IRC, and vibrational analysis to prevent unphysical relaxation at the model boundary. See the full semantics in the Link hydrogen and frozen atoms section below. Subcommands such as opt, tsopt, freq, irc, path-search, path-opt, and scan all cross-reference this section.

Usage¶

pdb2reaction extract -i COMPLEX.pdb [COMPLEX2.pdb...]
 -c SUBSTRATE_SPEC
 [-o POCKET.pdb [POCKET2.pdb...]]
 [--radius Å] [--radius-het2het Å]
 [--include-h2o/--no-include-h2o]
 [--exclude-backbone/--no-exclude-backbone]
 [--add-linkh/--no-add-linkh]
 [--selected-resn LIST]
 [--modified-residue LIST]
 [-l, --ligand-charge MAP_OR_NUMBER]
 [--verbose/--no-verbose]

Examples¶

# Minimal (ID-based substrate) with explicit total ligand charge
pdb2reaction extract -i complex.pdb -c '123' -o model.pdb -l -3

# Substrate provided as a PDB; per-resname charge mapping (others remain 0)
pdb2reaction extract -i complex.pdb -c substrate.pdb -o model.pdb -l 'GPP:-3,SAM:1'

# Name-based substrate selection including all matches (WARNING is logged)
pdb2reaction extract -i complex.pdb -c 'GPP,SAM' -o model.pdb -l 'GPP:-3,SAM:1'

# Multi-structure to single multi-MODEL output with hetero-hetero proximity enabled
pdb2reaction extract -i complex1.pdb complex2.pdb -c 'GPP,SAM' \
 -o model_multi.pdb --radius-het2het 2.6 -l 'GPP:-3,SAM:1'

# Multi-structure to multiple outputs with hetero-hetero proximity enabled
pdb2reaction extract -i complex1.pdb complex2.pdb -c 'GPP,SAM' \
 -o model1.pdb model2.pdb --radius-het2het 2.6 -l 'GPP:-3,SAM:1'

Workflow¶

Residue inclusion¶

Always include the substrate residues from -c/--center.
Standard cutoff (--radius, default 2.6 Å):
When --no-exclude-backbone, any atom within the cutoff qualifies a residue.
When --exclude-backbone, amino-acid residues must contact the substrate with a non-backbone atom (not N/H*/CA/HA*/C/O). Non-amino acids use any atom.
Independent hetero–hetero cutoff (--radius-het2het): adds residues when a substrate hetero atom (non C/H) lies within the specified Å of a protein hetero atom. With backbone exclusion enabled the protein atom must be non-backbone.
Water handling: HOH/WAT/H2O/DOD/TIP/TIP3/SOL are included by default (--include-h2o).
Forced inclusion: --selected-resn accepts residue IDs (e.g., A:123A). See --selected-resn takes residue IDs, not names in CLI Conventions for the residue-ID requirement.
Neighbor safeguards:
When backbone exclusion is off and a residue contacts the substrate with a backbone atom, auto-include the peptide-adjacent N/C neighbors (C–N ≤ 1.9 Å). Termini keep caps (N/H* or C/O/OXT).
Disulfide bonds (SG–SG ≤ 2.5 Å) bring both cysteines.
Non-terminal PRO residues always pull in the N-side amino acid; CA is preserved even if backbone atoms are removed, and when --exclude-backbone, the neighbor’s C/O/OXT remain to maintain the peptide bond.

Truncation/capping¶

Isolated residues retain only side-chain atoms; amino-acid backbone atoms (N, CA, C, O, OXT plus N/CA hydrogens) are removed except for PRO/HYP safeguards.
Continuous peptide stretches keep internal backbone atoms; only terminal caps (N/H* or C/O/OXT) are removed. TER awareness prevents capping across chain breaks.
With --exclude-backbone, main-chain atoms on all non-substrate amino acids are stripped (subject to PRO/HYP safeguards and PRO neighbor retention).
Non-amino-acid residues never lose atoms named like backbone (N/CA/HA/H/H1/H2/H3).

Link hydrogens (`--add-linkh`)¶

Adds carbon-only link hydrogens at 1.09 Å along severed bond vectors (CB–CA, CA–N, CA–C; PRO/HYP use CA–C only).
Inserted after a TER as contiguous HETATM records named HL in residue LKH (chain L). Serial numbers continue from the main block.
In multi-structure mode the same bonds are capped across all models; coordinates remain model-specific.

Charge summary (`--ligand-charge/-l`)¶

Amino acids and common ions draw charges from internal dictionaries; waters are zero.
Unknown residues default to 0 unless --ligand-charge/-l supplies either a total charge (distributed across unknown substrate residues, or all unknowns when no unknown substrate) or a per-resname mapping like GPP:-3,SAM:1.
Summaries (protein/ligand/ion/total) are logged for the first input when verbose mode is enabled.

Multi-structure ensembles¶

Accepts multiple input PDBs (identical atom ordering is validated at the head/tail of each file). Each structure is processed independently and the union of selected residues is applied to every model so that outputs remain consistent.
Output policy:
No -o, multiple inputs → per-file model_<original_basename>.pdb.
One -o path → single multi-MODEL PDB.
N outputs where N == number of inputs → N individual PDBs.
Diagnostics echo raw vs. kept atom counts per model along with residue IDs.

Substrate specification (`-c/--center`)¶

PDB path: the coordinates must match the first input exactly (tolerance 1e-3 Å); residue IDs propagate to other structures.
Residue IDs: '123,124', 'A:123,B:456', '123A', 'A:123A' (insertion codes supported).
Residue names: comma-separated list (case insensitive). If multiple residues share a name, all matches are included and a warning is logged.

CLI options¶

Note: Default values shown are used when the option is not specified.

Option	Description	Default
`-i, --input PATH...`	One or more protein–ligand PDB files (identical atom ordering required).	Required
`-c, --center SPEC`	Substrate specification (PDB path, residue IDs, or residue names).	Required
`-o, --output PATH...`	Active site model PDB output(s). One path ⇒ multi-MODEL, N paths ⇒ per input. With 1 `-o` and multiple inputs, creates a single multi-MODEL PDB. With N `-o` values matching N inputs, creates N separate PDBs.	Auto (`model.pdb` or `model_<input>.pdb`)
`-r, --radius FLOAT`	Atom–atom distance cutoff (Å) for inclusion.	`2.6`
`--radius-het2het FLOAT`	Independent hetero–hetero cutoff (Å, non C/H).	`0.0` (internally 0.001 Å when zero)
`--include-h2o/--no-include-h2o`	Include HOH/WAT/H2O/DOD/TIP/TIP3/SOL waters.	`True`
`--exclude-backbone/--no-exclude-backbone`	Remove backbone atoms on non-substrate amino acids (PRO/HYP safeguards).	`False`
`--add-linkh/--no-add-linkh`	Add carbon-only link hydrogens at 1.09 Å along severed bonds.	`True`
`--selected-resn TEXT`	Force-include residues by residue ID (with optional chains/insertion codes, e.g., `A:123A`). See --selected-resn takes residue IDs, not names in CLI Conventions for the residue-ID requirement.	`""`
`--modified-residue TEXT`	Comma-separated residue names (with optional per-residue charge) to treat as amino acids for backbone truncation and charge assignment (e.g., `HD1,HD2,HD3` or `HD1:0,SEP:-2`). When a residue is given without a trailing `:charge`, that residue’s charge defaults to `0` (e.g. in `HD1,HD2:-1` → `HD1` gets charge 0 and `HD2` gets charge −1). The flag as a whole is off by default (empty string).	`""`
`-l, --ligand-charge TEXT`	Total charge or per-resname mapping (e.g., `GPP:-3,SAM:1`).	None
`-v, --verbose/--no-verbose`	Emit INFO-level logging (`True`) or keep warnings only (`False`).	`True`
`--out-json/--no-out-json`	Write a machine-readable `result.json` alongside the extracted PDB(s). See JSON Output Schema for the schema.	`False`

Outputs¶

<output>.pdb # Active site model PDB(s) with optional link hydrogens after a TER record
 # Single input → model.pdb by default
 # Multiple inputs without -o → model_<original_basename>.pdb per structure
 # One -o path with multiple inputs → single multi-MODEL PDB
 # Output directories are not created automatically; ensure they exist

Charge summary (protein/ligand/ion/total) is logged for model #1 when verbose mode is enabled.
Output directories are not created automatically; ensure they exist before running.
Programmatic use (extract_api) returns {"outputs": [...], "counts": [...], "charge_summary": {...}}.

Notes¶

For symptom-first diagnosis, start with Common Error Recipes, then use Troubleshooting for detailed fixes.
--radius defaults to 2.6 Å; 0 is nudged to 0.001 Å to avoid empty selections. --radius-het2het is off by default (also nudged to 0.001 Å when zero is provided).
If the extracted active site model is too small, calculated energies and barriers may be unreliable. In such cases, increasing the extraction radius (e.g., -r 4.0 or higher) can improve accuracy by including more of the protein environment.
Waters can be excluded with --no-include-h2o.
Backbone trimming plus capping respect chain breaks and PRO/HYP safeguards as outlined above; non-amino residues never lose backbone-like atom names.
Link hydrogens are inserted only on carbon cuts and reuse identical bonding patterns across models in ensemble mode.
INFO logs summarize residue selection, truncation counts, and charge breakdowns.

Systems with non-standard residues (MCPB, etc.)¶

When metal-coordinating amino acid parameters are generated by tools such as Amber’s MCPB.py (Metal Center Parameter Builder), the coordinating residues are assigned non-standard names (e.g., HD1, HE1, CM1, AP1). These are not included in extract’s internal AMINO_ACIDS dictionary, so backbone truncation and link-hydrogen capping will not be applied correctly.

When such residues are detected, extract will display a warning:

[extract] WARNING: Residue HD1 83 may be an amino acid (has N, CA, C, O)
but is not recognized as a standard residue name.
Backbone truncation was not applied.
Consider preparing the active site model manually.

`--modified-residue` option¶

Use --modified-residue to register non-standard residue names as amino acids so that backbone truncation and charge assignment are applied automatically. This is useful for modified amino acid residues that have non-standard three-letter codes (e.g., phosphoserine, methylated residues, D-amino acids with unusual names, or MCPB-renamed metal-coordinating residues).

# Treat HD1, HD2, HD3 as amino acids (charge defaults to 0)
pdb2reaction extract -i complex.pdb -c 'SUB' -o model.pdb \
  --modified-residue 'HD1,HD2,HD3'

# Specify explicit charges for each modified residue
pdb2reaction extract -i complex.pdb -c 'SUB' -o model.pdb \
  --modified-residue 'HD1:0,SEP:-2'

Important

If --modified-residue does not cover your use case, manual active site model construction is recommended. Steps:

Select residues around the active site and determine truncation points
Add a link hydrogen on the parent atom (the atom that remains) of each severed covalent bond
Use residue name LKH (chain L) and atom name HL for the link hydrogen
Place it at 1.09 Å along the original bond direction

Appendix: PDB naming requirements and reference lists¶

This appendix is mainly for debugging cases where extract misclassifies residues due to non-standard residue/atom naming.

Important

For extract to work correctly, residue names and atom names in the input PDB must conform to standard PDB naming conventions. The tool relies on internal dictionaries to recognize amino acids, ions, water molecules, and backbone atoms. Non-standard naming will cause residues to be misclassified or charges to be incorrectly assigned.

The following internal constants define the recognized names:

`AMINO_ACIDS`¶

A dictionary mapping residue names to their nominal integer charges. Membership in this dictionary determines whether a residue is treated as an amino acid for backbone handling, truncation, and charge calculation.

Standard 20 amino acids (charges reflect physiological pH):

Neutral: ALA, ASN, CYS, GLN, GLY, HIS, ILE, LEU, MET, PHE, PRO, SER, THR, TRP, TYR, VAL
Positive (+1): ARG, LYS
Negative (−1): ASP, GLU

Canonical extras:

SEC (selenocysteine, 0), PYL (pyrrolysine, +1)

Protonation/tautomer variants (Amber/CHARMM style):

HIP (+1, fully protonated His), HID (0, Nδ-protonated His), HIE (0, Nε-protonated His)
ASH (0, neutral Asp), GLH (0, neutral Glu), LYN (0, neutral Lys), ARN (0, neutral Arg)
TYM (−1, deprotonated Tyr phenolate)

Phosphorylated residues:

Dianionic (−2): SEP, TPO, PTR
Monoanionic (−1): S1P, T1P, Y1P
Phospho-His (phosaa19SB): H1D (0), H2D (−1), H1E (0), H2E (−1)

Cysteine variants:

CYX (0, disulfide), CSO (0, sulfenic acid), CSD (−1, sulfinic acid), CSX (0, generic derivative)
OCS (−1, cysteic acid), CYM (−1, deprotonated Cys)

Lysine variants / carboxylation:

MLY (+1), LLP (+1), KCX (−1, Nz-carboxylic acid)

D-amino acids (19 residues):

DAL, DAR, DSG, DAS, DCY, DGN, DGL, DHI, DIL, DLE, DLY, MED, DPN, DPR, DSN, DTH, DTR, DTY, DVA

Other modified residues:

CGU (−2, gamma-carboxy-glutamate), CGA (−1), PCA (0, pyroglutamate), MSE (0, selenomethionine), OMT (0, methionine sulfone), HYP (0, hydroxyproline)
Various others: ASA, CIR, FOR, MVA, IIL, AIB, HTN, SAR, NMC, PFF, NFA, ALY, AZF, CNX, CYF

N-terminal variants (prefix N): NALA (+1), NARG (+2), NASP (0), NGLU (0), NLYS (+2), etc., plus ACE (0), NTER (+1, generic)

C-terminal variants (prefix C): CALA (−1), CARG (0), CASP (−2), CGLU (−2), CLYS (0), etc., plus NHE (0), NME (0), CTER (−1, generic)

`BACKBONE_ATOMS`¶

A set of atom names considered backbone atoms for amino acids. These are used when --exclude-backbone to determine which atoms to remove from non-substrate residues:

N, C, O, CA, OXT, H, H1, H2, H3, HN, HA, HA2, HA3

`ION`¶

A dictionary mapping ion residue names to their formal charges. Recognized ions are automatically assigned correct charges in the charge summary.

Charge	Residue Names
+1	`LI`, `NA`, `K`, `RB`, `CS`, `TL`, `AG`, `CU1`, `Ag`, `K+`, `NA+`, `NH4`, `H3O+`
+2	`MG`, `CA`, `SR`, `BA`, `MN`, `FE2`, `CO`, `NI`, `CU`, `ZN`, `CD`, `HG`, `PB`, `BE`, `PD`, `PT`, `SN`, `RA`, `YB2`, `V2+`
+3	`FE`, `AU3`, `AL`, `GA`, `IN`, `CE`, `CR`, `DY`, `EU`, `EU3`, `ER`, `GD3`, `LA`, `LU`, `ND`, `PR`, `SM`, `TB`, `TM`, `Y`, `PU`
+4	`U4+`, `TH`, `HF`, `ZR`
−1	`F`, `CL`, `BR`, `I`, `CL-`, `IOD`

`WATER_RES`¶

A set of residue names recognized as water molecules. Waters are included by default (--include-h2o) and assigned zero charge:

HOH, WAT, H2O, DOD, TIP, TIP3, SOL

Link hydrogen and frozen atoms¶

When pdb2reaction extracts an active site model from a larger structure, severed bonds are capped with link hydrogens. By default (--freeze-links), the parent atoms of these link hydrogens are frozen during optimization and path searches to prevent unphysical rearrangement at the boundary.

Forces: frozen atoms receive zeroed forces.
Hessian: frozen degrees of freedom are either removed (return_partial_hessian: true) or zeroed in the full matrix.
Vibrational analysis: when frozen atoms are present, freq automatically performs Partial Hessian Vibrational Analysis (PHVA), diagonalizing only the active block of the Hessian.

Frozen atoms can also be set manually via the geom.freeze_atoms YAML key (1-based indices). CLI-detected link atoms are merged with YAML-specified atoms.

extract¶

Overview¶

At a glance¶

Usage¶

Examples¶

Workflow¶

Residue inclusion¶

Truncation/capping¶

Link hydrogens (--add-linkh)¶

Charge summary (--ligand-charge/-l)¶

Multi-structure ensembles¶

Substrate specification (-c/--center)¶

CLI options¶

Outputs¶

Notes¶

Systems with non-standard residues (MCPB, etc.)¶

--modified-residue option¶

Appendix: PDB naming requirements and reference lists¶

AMINO_ACIDS¶

BACKBONE_ATOMS¶

ION¶

WATER_RES¶

Link hydrogen and frozen atoms¶

See Also¶

`extract`¶

Link hydrogens (`--add-linkh`)¶

Charge summary (`--ligand-charge/-l`)¶

Substrate specification (`-c/--center`)¶

`--modified-residue` option¶

`AMINO_ACIDS`¶

`BACKBONE_ATOMS`¶

`ION`¶

`WATER_RES`¶