Statistical models of evolution

Models of sequence evolution are used in molecular phylogenetics and comparative genomics to describe substitution patterns that occur during evolution. Progressively more realistic models of evolution are regularly developed, which improve the accuracy of phylogenetic tree estimation, allow functional information to be inferred from sequences alignments, and offer fascinating insights into how genomes evolve.

The WAG model [publication]

WAG (Whelan And Goldman) is in an empirical model of globular protein evolution. It was estimated from 182 protein families (provided by David Jones) using a maximum likelihood procedure that takes into account the evolutionary relationships within each family. WAG is implemented in many widely used programs for phylogeny.

The WAG and WAG* model (see the paper for more details) are both available for download in the format used by Ziheng Yang's PAML package.


The SDT model [publication]

The SDT (single-doublet-triplet) model is a mechanistic model that describes codon evolution by point mutations and larger changes spanning multiple nucleotides. For 257/258 families from PANDIT we showed that SDT provided a significantly better fit than standard models of codon evolution describing only point mutation.

The precise cause of the high estimated levels of doublet and triplet mutations is unclear, but the high levels may be the result of gene conversion, inversion or recombination, or a series of rapid compensatory (or complementary) changes. 


If you want to try out the software used in this study please contact me.

Note: this study also clarifies the relationship between models of codon and nucleotide substitution.

Technical details

Markov models of sequence evolution typically describe the rate that characters in the data, such as nucleotides or amino acids, replace each other using one (or more) instantaneous rate matrices. The values in the instantaneous rate matrix, Q, are described by a series of fixed pre-specified values, a selection of free  parameters that adapt to the observed data, or both. Models containing only fixed parameters are frequently called empirical models, while models with only free parameters are often called mechanistic models.

The instantaneous rate matrix is used to calculate P(t) = eQt, whose elements pi,j(t)  describe the probability of character i replacing character j after time t. The P(t) matrix is required to perform likelihood calculations using Felsenstein's pruning algorithm. Likelihood can be used to optimise the parameters in a model, allowing biological inferences to be drawn from them. Alternatively, the likelihood function can be incorporated into a Bayesian inference  framework.

Good reviews and books on this subject include:

1. Inferring Phylogenies, Joe Felsenstein

2. Molecular phylogenetics: state-of-the-art methods for looking into the past, Whelan et al.

3. Models of molecular evolution and phylogeny, Lió and Goldman.

Last modified: 22 May, 2007