University of Göttingen | Faculty of Biology | Inst. of Microbiology and Genetics | Dep. of Bioinformatics

DIALIGN-TX [algorithm]

DIALIGN-TX involves many improvements compared to DIALIGN-T and DIALIGN 2.2. At this point we present those two having the greatest impact on the alignment quality.

New Improvements in DIALIGN-TX:

Combined Greedy and Progressive Strategy

The new method we developed initially computes a guide tree T for the set of input sequences based on their pairwise similarity scores. We divide the set of fragments contained in the respective optimal pairwise alignments into two subsets F and G where F consists of all fragments with weight scores below the average fragment score in all pairwise alignments, and G consists of the fragments with a weight above or equal to the average weight. In a first step, the set G is used to calculate an initial multiple alignment A in a "progressive" manner. During this progressive step two sub-alignments are merged using a conflict-graph that contains edges for every conflict between any pair of fragments between the to sub-alignments to be merged. Of course there may be conflicts that involve triples or more generally $m$-tuples of fragments, however we only take care of pair-conflicts since they have the highest probability. We resolve the pair-conflict by removing an vertex cover detected by the 2-approximation for the "weighted vertex cover" method of Clarkson (1983). The remainder is then now aligned greedy as implemented in the previous version of DIALIGN-T. The low-scoring fragments from set G are added later to A in a greedy way after the progressive phase has been completed, provided if they are consistent with A. In addition, we construct an alternative multiple alignment B using the greedy approach implemented in previous versions of DIALIGN and DIALIGN-T, respectively. The program finally returns either A or B, depending on which one of these two alignment has the highest score.

Anchor Points

DIALIGN-TX now support anchor points the same way DIALIGN 2.2 does and especially can now be combined with programs like CHAOS that are specialized in finding such anchor points that speed up the pairwise alignment phase of DIALIGN-TX as well as should have a positive impact on the alignment quality.

Reduced Sensitivity Mode

DIALIGN-TX also now comes with a reduced sensitivity option that reduces the sensitivity of the method when finding fragments in the pairwise alignment phase, however it does not have an influence on the speed of the method. By default in round one fragments that have a weight score below -log(0.5) are discarded during the pairwise alignment phase, however in rounds ≥2 this threshold is set back to 0.0. When choosing the sensitivity level 2 this threshold remains -log(0.5) and when choosing level 1 it is set to -log(0.75) for all subsequent rounds.

Fast Mode

For the case of large inputs and speed is more important than the quality DIALIGN-TX still produces reasonable alignments when using the newly introduced fast mode. By default during the pairwise alignment phase the fragment under consideration is extended whilst its average score is at least 4 for amino acids (note that our BLOSUM62 matrix has 0 for the lowest score possible) and 0.25 for nucletoids. The choice of the fast mode increases this threshold by 0.25 which has the effect that the extension of fragments during the pairwise alignment phase is interrupted way more often than by default, however in turn the sensitivity is reduced. We observed speed-ups up to factor 10 on various benchmark data when using this option while the alignment quality was still reasoanably high. We recommend to use this option for large input data containing sequences that are not too distantly related.

Previous Improvements in DIALIGN-T:

Excluding low-scoring Subfragments:

The segment-based approach finds the fragments to be incorporated into the alignment by considering at each position (i,j) in the comparison matrix all fragments (= segment pairs) starting at (i,j) up to a certain maximum length M. For protein alignment, the previous program DIALIGN 2.2 uses a default value of M = 40; M can be reduced to speed-up the program, but this may result in decreased alignment quality since in this case the resulting fragments would contain undesired regions of low scores. However, we could increase M to M=100 by parallely excluding regions of low quality. A region of low quality of length L (default L=4) is defined as such if it is of negative Needleman-Wunsch score. In a nutshell, DIALIGN-TX prohibits such regions in fragments of length T > 40 . Of course the parameters L and T can be chosen by the user, however the defaults have been carefully deducted from various experiments.

Weight Score Factors:

The segment-based approach of DIALIGN uses a greedy optimisation procedure for multiple alignment. The order in which fragments are included into the multiple alignment is determined based on their weight scores. A general problem with this greedy approach is that if a wrong fragment is accepted for multiple alignment, it cannot be removed later on. In DIALIGN-TX, we adopt the following approach: we multiply the weight score of each fragment by the square of the total weight score of the respective sequence pair divided by the overall weight score of all pairwise alignments. This biases fragments coming from a sequence pairs of high similarity and thus improves the alignment quality.

Further Information and Citation

More detailed descriptions of the methods can be found here:

Back to main page