AlignBucket: a tool to speed up “all-against-all” protein sequence alignments optimizing length constraints
Giuseppe Profiti 1,2,3, Piero Fariselli 1,2∗and Rita Casadio 2,3 1Department of Computer Science and Engineering, via Mura Anteo Zamboni 7, Bologna, Italy. 2Bologna Biocomputing group, via S. Giacomo 9/2, Bologna, Italy. 3Health Sciences and Technologies ICIR, via Tolara di Sopra 41/E, Ozzano dell’Emilia, Italy.
Motivation: The Next Generation Sequencing era requires reliable, fast and efficient approaches for the accurate annotation of the ever-increasing number of biological sequences and their variations.
Transfer of annotation upon similarity search is a standard approach.
The procedure of all-against-all protein comparison is a preliminary step of different available methods that annotate sequences based on information already present in databases. Given the actual volume of sequences, methods are necessary to pre-process data in order to reduce the time of sequence comparison.
Results: We present an algorithm that optimizes the partition of a large volume of sequences (the whole database) into sets where sequence length values (in residues) are constrained depending on a bounded minimal and expected alignment coverage. The idea is to optimally group protein sequences according to their length, and then computing the all-against-all sequence alignments among sequences that fall in a selected length range. We describe a mathematically optimal solution and we show that our method leads to a 5 fold speed-up in real world cases.
Availability: The software is available for downloading at http://www.biocomp.unibo.it/˜giuseppe/partitioning.html
Contact: email@example.com 1 INTRODUCTION “All-against-all” sequence alignments in protein datasets are routinely performed with programs such as BLAST (Altschul et al., 1990) or dynamic programming approaches (Needleman and Wunsch, 1970). By this, sequences group into subsets of sequence identity values that are important in several applications, including transfer of structural and functional features between protein sequences (for details see www.uniprot.org/program/ and www.ebi.ac.uk/GOA/ElectronicAnnotationMethods). Proteins that share at least some 30% of sequence identity, may overlap in their three-dimensional structures (Chothia and Lesk, 1986), while proteins sharing at least some 60% of sequence identity are likely to share also similar functions (Rost, 2002; Tian and Skolnick, 2003).
It is also important to consider the relative length of the pairwise aligned sequences before transferring annotations (Devos and Valencia, 2001). Multidomain proteins are less functionally conserved ∗to whom correspondence should be addressed than single-domain ones (Hegyi and Gerstein, 2001) and methods based on transfer of function after homology search can group proteins with different functions in the same cluster. Therefore, both sequence identity values and the percentage of overlapping positions over the alignment length (coverage) were constrained in developing methods for annotation problems (Piovesan et al., 2011; Miele et al., 2011; Piovesan et al., 2013). Evolutionary studies include alignments in the early stage of their workflow (Vilella et al., 2009;
Waterhouse et al., 2013), as well as different computational methods to speed up computation (Roth et al., 2008; Sonnhammer et al., 2014; Wittwer et al., 2014).
Considering the computational cost, the all-against-all similarity search is an extremely demanding process. Recently, in problems related to evolutionary studies, Wittwer et al., 2014, showed that by considering subsequence-level homology, the speed of the allagainst-all alignment process could be increased by a factor of four, while maintaining the possibility of retrieving homologous sequences. On the same line of research, here we present an algorithm that can speed up the computational time of the similarity search by constraining the expected alignment coverage. 2 METHODS
According to Altschul et al., 1990, the BLAST computational cost (c) of comparing all pairs of sequences with lengths ranging from s to t is: c(s, t) = t∑ k=s
Sk · k + t∑ j=s jSj (1) where k and j are sequence lengths (number of residues) and Sk is the number of sequences of length k in the database. Given a bound (δ) on the required minimum coverage (fraction of overlapping positions over the sequence alignment length, ≤ 1) of the resulting alignment, all the sequences in the database can be partitioned in different ways. The two extreme possibilities are: 1) a single set containing all the sequences, and the procedure of the all-against-all pairwise sequence comparison performs many worthless computations; 2) N sets (one for each possible sequence length k), including all the sequences in the range [dk × δe, bk ÷ δc], and the procedure of the all-against-all pairwise sequence comparison performs again many useless alignments. In case 1) many alignments will not meet the required bound δ, while in case 2) too many duplicated comparisons will be performed.
In order to reduce the alignment time, we introduce AlignBucket, a procedure that optimises sequence-length bounds for each subset of the database to minimize the computational cost of Eq. 1 (for theoretical details see Suppl. materials). 1
Associate Editor: Prof. Burkhard Rost © The Author (2015). Published by Oxford University Press. All rights reserved. For Permissions, please email: firstname.lastname@example.org
Bioinformatics Advance Access published July 30, 2015 at U niversity of Pittsburgh on A ugust 3, 2015 http://bioinform atics.oxfordjournals.org/
D ow nloaded from
Table 1. Optimal partitioning into sets as a function of a minimal bound coverage of the sequence alignment. δ Sets (#) Cost for sets % of full cost Speedup 0.0 1 3.85× 1017 100.00% 1.00 0.1 2 3.85× 1017 99.99% 1.00 0.2 3 3.85× 1017 99.85% 1.00 0.3 4 3.81× 1017 98.97% 1.01 0.4 5 3.56× 1017 92.30% 1.08 0.5 8 3.12× 1017 80.95% 1.24 0.6 11 2.53× 1017 65.66% 1.52 0.7 18 1.85× 1017 47.95% 2.09 0.8 30 1.19× 1017 30.82% 3.25 0.9 65 5.62× 1016 14.58% 6.86