Selecting the active site/generating the active site RIN
Defining the “seed” is one of the most important steps of the QM-cluster model building process. What should you select as the seed? Typically, the seed will be the substrate(s) (or ligand in biochemical terms) participating in the chemical reaction. Any amino acid residues, co-factors, or fragments which participate in the active site catalytic breaking and forming of chemical bonds may also need to be included as part of the seed, but this will generate much larger models compare to only using the substrate.
The seed is specified as a comma-separated list of colon-separated Chain:ResidueID pairs. In the example of 3BWM, we select the seed as A:300 (Mg2+), A:301 (SAM) and A:302 (catechol).
If the PDB you’re using does not have chain identifiers, you will need to specify “:XXX” where XXX is the residue ID number in this step and beyond. Our current defaults are wonky in these cases and need to be improved. If the protein is multimeric, use the chain of your choice for seed fragments. Note that some multimeric x-ray crystal structures may not necessarily have equivalent active sites!
The active site RIN can be automatically determined from the seed fragments based on probe contacts, arpeggio contacts or distance. For all selection metrics the key output is a file, usually called res_atoms.dat or similar, containing a list of the identified active site residues ranked by the chosen metric and the atoms identified by the selection procedure. This file will be used as the input for the trimming procedure in part 3.
A. Using Probe contact count ranking
First run probe on the (modified) PDB file to generate a *.probe file of all contacts in the enzyme
# Example usage of probe:
$HOME/git/RINRUS/bin/probe -unformated -MC -self "all" 3bwm_h_modify.pdb > 3bwm_h_modify.probe
Then generate the active site RIN from the probe contacts with probe2rins.py
# Example usage of probe2rins:
python3 $HOME/git/RINRUS/bin/probe2rins.py -f 3bwm_h_modify.probe -s A:300,A:301,A:302
# All arguments for probe2rins:
-f FILE probe contacts file
-s SEED seed fragment(s) (e.g. A:300,A:301,A:302)
This produces freq_per_res.dat, rin_list.dat, res_atoms.dat, and *.sif.
Use res_atoms.dat as the input for the trimming procedure in part 3.
NOTE: Remember to replace the metal atom in the PDB before continuing/use the unmodified PDB for the remaining steps if it was replaced with O in the preprocessing.
B. Using Arpeggio contact count or contact type ranking
Make sure openbabel libraries are available to properly use RINRUS with arpeggio.
First, run arpeggio to generate the contact file (you need to make sure config.py is in the same directory as arpeggio.py)
# Example usage of arpeggio
python3 ~/git/RINRUS/bin/arpeggio/arpeggio.py 3bwm_h.pdb
Then generate the active site RIN from arpeggio contacts using arpeggio2rins.py
# Example usage of arpeggio2rins
python3 ~/git/RINRUS/bin/arpeggio2rins.py -f 3bwm_h.contacts -s A:300,A:301,A:302
# All arguments for arpeggio2rins:
-f FILE arpeggio contacts file
-s SEED seed fragment(s) (A:300,A:301,A:302)
This produces the files contact_counts.dat, contype_counts.dat and, node_info.dat. Both contact_counts.dat and contype_counts.dat have the same format as res_atoms.dat.
Use contact_counts.dat (residues ranked by number of contacts) or contype_counts.dat (residues ranked by number of interaction types) as the input for the trimming procedure in part 3.
C. Using distance ranking
There are two key options that determine how RINRUS calculates the distance between residues and the seed for distance-based selection and ranking.
- Distance type: distance can be calculated to the seed’s centre of mass or average Cartesian coordinates, or the closest seed atom.
- Hydrogens: distances can be calculated using all atoms or only heavy atoms (all hydrogens are ignored). This applies to both the residue-seed distance calculation and calculation of the seed COM/average centre.
Use dist_rank.py to select all fragments with any atoms within a cutoff radius of the seed. A limited set of seed atoms can be selected for the distance calculations by using the ‘-satom’ flag instead of ‘-s’.
# Example usage of dist_rank selecting residues within 5A of the full seed COM
python3 ~/git/RINRUS/bin/dist_rank.py -pdb 3bwm_h.pdb -s A:300,A:301,A:302 -max 5 -type mass
# All arguments for dist_rank
-type TYPE how to calculate distance from seed ('closest' or 'avg' or 'mass')
-noH ignore hydrogen atoms (true if flag present)
-max CUTOFF cut off distance in Å (default: 5)
-s SEED seed fragment(s) (e.g. A:300,A:301,A:302)
-satom ATOMS seed atoms (e.g. A:301:C8,A:301:N9,A:302:C1,A:302:N1)
This produces a file all_atoms_[type]_[max].dat listing all atoms within the cutoff distance, which are then grouped by residue IDs to give sorted_residues_[type]_[max].dat and res_atoms_by_residue.dat. The atom list is also grouped by functional groups (matching the SC/MC partitioning used for F-SAPT) to give sorted_FGs_[type]_[max].dat and res_atoms_by_FG.dat.
Use res_atoms_by_residue.dat or res_atoms_by_FG.dat as the input for the trimming procedure in part 3.
D. Manual selection and ranking
You can generate your own res_atoms.dat file using an existing res_atoms.dat file as a template or from scratch.
- The first two columns should list the chain and residue ID of a given residue.
- The third column is where the ranking value would go. This isn’t actually used by the trimming script so the value doesn’t matter as long as something is there.
- The rest of the line should be the selected atom(s).
The residues should be listed in the order you want them to be added to the model.
Example format:
A 300 554 O
A 40 491 CB CE CG HB2 HE1 HE3 HG3 O SD
A 141 478 CB CG HB3 O OD1 OD2
A 143 378 CD1 CD2 CE2 CE3 CG CH2 CZ2 CZ3 H NE1
A 91 335 CB CG2 H HB HG21 HG22 N
A 170 304 CG HD21 ND2 OD1
Use your new res_atoms.dat as the input for the trimming procedure in part 3.