Introduction

MDT prepares a raw frequency table, given information from MODELLER alignments and/or PDB files. It can also process the raw frequency table in several ways (e.g., normalization with Table.normalize(), smoothing with Table.smooth(), perform entropy calculations with Table.entropy_full(), and write out the data in various formats, including for plotting by ASGL (Table.write_asgl()) and use as restraints by MODELLER.

More precisely, MDT uses a sample of sequences, structures, and/or alignments to construct a table N(a,b,c,…,d) for features a, b, c, …, d. The sample for generating the frequencies N is obtained depending on the type of features a, b, c, …, d. The sample can contain individual proteins, pairs of proteins, pairs of residues in proteins, pairs of aligned residues, pairs of aligned pairs of residues, chemical bonds, angles, dihedral angles, and pairs of tuples of atoms. Some features work with triple alignments, too. All the needed features a, b, c, …, d are calculated automatically from the sequences, alignments, and/or PDB files. The feature bins are defined by the user when each feature is created.

MDT features

A ‘feature’ in MDT is simply some binnable property of your input alignment. Example features include the residue type, chi1 and Phi dihedral angles, sequence identity between two sequences, X-ray resolution, atom-atom distances, atom type, and bond length.

MDT understands that different features act on different sets of proteins, or parts of proteins, and will automatically scan over the correct range to collect necessary statistics (e.g. when you call Table.add_alignment()). For example, to collect statistics for the residue type feature, it is necessary to scan all residues in all proteins in the alignment. The X-ray resolution feature, on the other hand, only requires each protein in the alignment to be scanned, not each residue. The atom-atom distance feature requires scanning over all pairs of atoms in all proteins in the alignment, while the sequence identity feature requires scanning all pairs of proteins in the alignment. If you construct a table of multiple features, the most fine-grained of the features determines the scan - for example, a table of X-ray resolution against Φ dihedral would require a scan of all residues. See the scan types table for all of the scan types.

When choosing which proteins to scan, MDT also considers the features. It will scan each protein individually, all pairs of proteins, or all triples of proteins. The latter two scans only happen if you have features in your table that require multiple proteins (e.g. protein pair or aligned residue features) or you have single-protein features such as protein or residue features but you have asked to evaluate them on the second or third protein (by setting the protein argument to 1 or 2 rather than the default 0).

MDT also knows that some residue pair or atom pair features are symmetric, and will perform a non-redundant scan in this case. If, however, any feature in the table is asymmetric, a full scan is performed. If in doubt, you can query Table.symmetric to see whether a symmetric scan will be performed for the current set of features. (Currently, any tuple pair feature in your table forces a full scan.)

The feature bins determine how to convert a feature value into a frequency table. For most feature types, you can specify how many bins to use, and their value ranges - see Specification of bins for more information. The last bin is always reserved as an ‘undefined’ bin, for values that don’t fall into any other bin [1].

(Some features are predetermined by the setup of the system - for example, the residue type feature always has 22 bins - 20 for the standard amino acids, 1 for gaps in the alignment, and 1 for undefined.)

Type Example feature
Protein features.XRayResolution
Residue [2] features.Chi1Dihedral
Residue pair [2] [3] features.ResidueIndexDifference
Atom features.AtomType
Atom pair [3] features.AtomDistance
Atom tuple features.TupleType
Atom tuple pair features.TupleDistance
Chemical bond features.BondType
Chemical angle features.Angle
Chemical dihedral angle features.Dihedral

Dependent and independent features

An MDT Table object is simply a table of counts N(a,b,c,…,d) for features a, b, c, …, d. However, this is often used to generate a conditional PDF, p(x,y,…,z | a,b,…,c) for independent features a, b, …, c and dependent features x, y, …, z. By convention in MDT the dependent features are the last or rightmost features in the table, and so methods which are designed to deal with PDFs such as Table.smooth(), Table.super_smooth(), Table.normalize(), Table.offset_min(), Table.close() expect the dependent features to be the last features. If necessary you can reorder the features using Table.reshape() or Table.integrate().

Specification of bins

Most features take a bins argument when they are created, which specifies the bin ranges. This is simply a list of (start, end, symbol) triples, which specify the feature range for each bin, and the symbol to refer to it by. For example, the following creates an X-ray resolution feature, with 4 bins, the first for 0.51-1.4 Å, the second for 1.4-1.6 Å, and so on. Anything below 0.51 Å or 2.0 Å or above (or an undefined value) will be placed into a fifth ‘undefined’ bin.

xray = mdt.features.XRayResolution(mlib, bins=[(0.51, 1.4, "<1.4"),
                                               (1.4,  1.6, "1.4-1.6"),
                                               (1.6,  1.8, "1.6-1.8"),
                                               (1.8,  2.0, "1.8-2.0")])

Note

Bin ranges in MDT are half-closed, i.e. a feature value must be greater than or equal to the lower value of the range, and less than the upper value, to be counted in the bin. For example, in the case above, 1.0 Å would be placed into the first bin, and 1.4 Å into the second. (If you define bins with overlapping ranges, values will be placed into the first bin that matches.)

In most cases, a set of bins of equal width is desired, and it is tedious to specify these by hand. A utility function uniform_bins() is provided, which takes three arguments - the number of bins, the lower range of the first bin, and the width of each bin - and creates a set of bins; all bins are of the same size and follow after the first bin. For example, the following bins the atom-atom distance feature into 60 bins, each 0.5 Å wide, with the first bin starting at 0 Å. The first bin is thus 0-0.5 Å, the second 0.5-1.0 Å, and so on, up to bin 60 which is 29.5-30.0 Å. The additional ‘undefined’ bin thus counts anything below 0 Å, greater than or equal to 30.0 Å, or which could not be calculated for some reason.

atdist = mdt.features.AtomDistance(mlib, bins=mdt.uniform_bins(60, 0, 0.5))

Storage for bin data

By default, when a table is created in MDT it uses double precision floating point to store the counts. This allows large counts themselves to be accurately scored, and can also store floating point data such as PDFs. However, for very large tables, this may use a prohibitive amount of memory. Therefore, it is possible to change the data type used to store bin data, by specifying the bin_type parameter when creating a Table object. The same parameter can be given to Table.copy(), to make a copy of the table using a different data type for its storage. Note that other data types use less storage, but can also store a smaller range of counts. For example, the UnsignedInt8 data type uses only a single byte for each bin, but can only store integer counts between 0 and 255 (floating point values, or values outside of this range, will be truncated). MDT uses double precision floating point for all internal operations, but any storage of bin values uses the user-selected bin type. Thus you should be careful not to use an inappropriate bin type - for example, don’t use an integer bin type if you are planning to store PDFs or perform normalization, smoothing, etc.

Footnotes

[1]You can, however, remove the ‘undefined’ bin using Table.reshape() or by using the ‘shape’ argument when you create the Table object.
[2](1, 2) Residue and residue pair scans are also used for ‘one atom per residue’ features, such as features.ResidueDistance, which is the distance between the ‘special atom’ in two residues. This special atom is usually Cα, but can be overridden by specifying the distance_atoms parameter when creating the Library object.
[3](1, 2) When looking at pairs of atoms or residues, it is useful to extract information about the ‘other’ atom or residue in the pair. This other atom or residue is termed ‘pos2’ in MDT, and can be asked for when creating the feature. For example, when building a table of atom-atom distances (features.AtomDistance) it may be useful to tabulate it against the atom types of both the first and second atom. This is done by also using two copies of the AtomType, the second with pos2=True.