Clustal

From Wikipedia, the free encyclopedia
CLUSTAL
Developer(s)
  • Des Higgins
  • Fabian Sievers
  • David Dineen
  • Andreas Wilm (all at the Conway Institute, UCD)
Stable release
1.2.2 / 1 July 2016; 7 years ago (2016-07-01)
Written inC++
Operating systemUNIX, Linux, MacOS, MS-Windows, FreeBSD, Debian
TypeBioinformatics tool
LicenceGNU General Public License, version 2[1]
Websitewww.clustal.org/omega/

Clustal is a series of computer programs used in bioinformatics for multiple sequence alignment.[2] There have been many versions of Clustal over the development of the algorithm that are listed below. The analysis of each tool and its algorithm is also detailed in their respective categories. Available operating systems listed in the sidebar are a combination of the software availability and may not be supported for every current version of the Clustal tools. Clustal Omega has the widest variety of operating systems out of all the Clustal tools.

Multiple sequence alignment of CDK4 protein generated with ClustalW. Arrows indicate point mutations.

History[edit]

There have been many variations of the Clustal software, all of which are listed below:

  • Clustal: The original software for multiple sequence alignments, created by Des Higgins in 1988, was based on deriving phylogenetic trees from pairwise sequences of amino acids or nucleotides.[3]
  • ClustalV: The second generation of the Clustal software was released in 1992 and was a rewrite of the original Clustal package. It introduced phylogenetic tree reconstruction on the final alignment, the ability to create alignments from existing alignments, and the option to create trees from alignments using a method called Neighbor joining.[4]
  • ClustalW: The third generation, released in 1994, greatly improved upon the previous versions. It improved upon the progressive alignment algorithm in various ways, including allowing individual sequences to be weighted down or up according to similarity or divergence respectively in a partial alignment. It also included the ability to run the program in batch mode from the command line.[5]
  • ClustalX: This version, released in 1997, was the first to have a graphical user interface.[6]
  • Clustal2: Released in 2007, this version updates versions of both ClustalW and ClustalX with higher accuracy and efficiency.[7]
  • ClustalΩ (Omega): The current standard version, which was released in 2011.[8][9]

The papers describing the Clustal software have been very highly cited, with two of them amongst the most cited papers of all time.[10]

The most recent version of the software is available for Windows, Mac OS, and Unix/Linux. It is also commonly used via a web interface on its home page or hosted by the European Bioinformatics Institute.

Name origin[edit]

The guide tree in the initial programs was constructed via a UPGMA cluster analysis of the pairwise alignments, hence the name CLUSTAL.[11]cf.[12] The first four versions in 1988 had Arabic numerals (1 to 4), whereas with the fifth version Des Higgins switched to Roman numeral V in 1992.[11]cf.[13][4] In 1994 and in 1997, for the next two versions, the letters after the letter V were used and made to correspond to W for Weighted and X for X Window.[11]cf.[14][6] The name omega was chosen to mark a change from the previous ones.[11]

Function[edit]

All variations of the Clustal software align sequences using a heuristic that progressively builds a multiple sequence alignment from a series of pairwise alignments. This method works by analyzing the sequences as a whole, then utilizing the UPGMA/Neighbor-joining method to generate a distance matrix. A guide tree is then calculated from the scores of the sequences in the matrix, then subsequently used to build the multiple sequence alignment by progressively aligning the sequences in order of similarity.[15] Essentially, Clustal creates multiple sequence alignments through three main steps:

  1. Do a pairwise alignment using the progressive alignment method
  2. Create a guide tree (or use a user-defined tree)
  3. Use the guide tree to carry out a multiple alignment

These steps are carried out automatically when you select "Do Complete Alignment". Other options are "Do Alignment from guide tree and phylogeny" and "Produce guide tree only".

Input/Output[edit]

This program accepts a wide range of input formats, including NBRF/PIR, FASTA, EMBL/Swiss-Prot, Clustal, GCC/MSF, GCG9 RSF, and GDE.

The output format can be one or many of the following: Clustal, NBRF/PIR, GCG/MSF, PHYLIP, GDE, or NEXUS.

Reading Multiple Sequence Alignment Output
Symbol Definition Meaning
* asterisk positions that have a single and fully conserved residue
: colon conserved: conservation between groups of strongly similar properties (score > 0.5 on the PAM 250 matrix)
. period semi-conserved: conservation between groups of weakly similar properties (score ≤ 0.5 on the PAM 250 matrix)
blank non-conserved

The same symbols are shown for both DNA/RNA alignments and protein alignments, so while * (asterisk) symbols are useful to both, the other consensus symbols should be ignored for DNA/RNA alignments.

Settings[edit]

Many settings can be adjusted to adapt the alignment algorithm to different circumstances. The main parameters are the gap opening penalty and the gap extension penalty.

Clustal and ClustalV[edit]

Brief summary[edit]

The original program in the Clustal series of software was developed in 1988 as a way to generate multiple sequence alignments on personal computers. ClustalV was released 4 years later and greatly improved upon the original, adding and altering a few key features, and was written in C instead of Fortran.

Algorithm[edit]

Both versions use the same fast approximate algorithm to calculate the similarity scores between sequences, which in turn produces the pairwise alignments. The algorithm works by calculating the similarity scores as the number of k-tuple matches between two sequences, accounting for a set penalty for gaps. The more similar the sequences, the higher the score, the more divergent, the lower the scores. Once the sequences are scored, a dendrogram is generated through the UPGMA to represent the ordering of the multiple sequence alignment. The higher ordered sets of sequences are aligned first, followed by the rest in descending order. The algorithm allows for very large data sets, and works fast. However, the speed is dependent on the range for the k-tuple matches chosen for the particular sequence type.[16]

Notable ClustalV improvements[edit]

Some of the most notable additions in ClustalV are profile alignments, and full command line interface options. The ability to use profile alignments allows the user to align two or more previous alignments or sequences to a new alignment and move misaligned sequences (low scored) further down the alignment order. This gives the user the option to gradually and methodically create multiple sequence alignments with more control than the basic option.[15] The option to run from the command line greatly expedites the multiple sequence alignment process. Sequences can be run with a simple command,

 clustalv nameoffile.seq

or

 clustalv /infile=nameoffile.seq

and the program will determine what type of sequence it is analyzing. When the program is completed, the output of the multiple sequence alignment as well as the dendrogram go to files with .aln and .dnd extensions respectively. The command line interface uses the default parameters, and doesn't allow for other options.[16]

ClustalW[edit]

Brief summary[edit]

Depicts the steps the ClustalW software algorithm uses for global alignments

ClustalW like the other Clustal tools is used for aligning multiple nucleotide or protein sequences in an efficient manner. It uses progressive alignment methods, which align the most similar sequences first and work their way down to the least similar sequences until a global alignment is created. ClustalW is a matrix-based algorithm, whereas tools like T-Coffee and Dialign are consistency-based. ClustalW has a fairly efficient algorithm that competes well against other software. This program requires three or more sequences in order to calculate a global alignment, for pairwise sequence alignment (only two sequences) other tools such as EMBOSS or LALIGN should be used.

Diagram showing neighbor-joining method in sequence alignment for bioinformatics

Algorithm[edit]

ClustalW uses progressive alignment methods as stated above. In these, the sequences with the best alignment score are aligned first, then progressively more distant groups of sequences are aligned. This heuristic approach is necessary due to the time and memory demand of finding the global optimal solution. The first step to the algorithm is computing a rough distance matrix between each pair of sequences, also known as pairwise sequence alignment. The next step is a neighbor-joining method that uses midpoint rooting to create an overall guide tree.[17] The process it uses to do this is shown in the detailed diagram for the method to the right. The guide tree is then used as a rough template to generate a global alignment.

Time complexity[edit]

ClustalW has a time complexity of because of its use of the neighbor-joining method. In the updated version (ClustalW2) there is an option built into the software to use UPGMA which is faster with large input sizes. The command line flag in order to use it instead of neighbor-joining is:

-clustering=UPGMA

For example, on a standard desktop, running UPGMA on 10,000 sequences would produce results in less than a minute while neighbor-joining would take over an hour.[18] By running the ClustalW algorithm with this adjustment, it saves significant amounts of time. ClustalW2 also has an option to use iterative alignment to increase alignment accuracy. While it is not necessarily faster or more efficient complexity-wise, the increase in accuracy is valuable and can be useful for smaller data sizes. These are the various command line flags to achieve this:

-Iteration=Alignment
-Iteration=Tree
-numiters

The first command line option refines the final alignment. The second option incorporates the scheme into the progressive alignment step of the algorithm. The third specifies the number of iteration cycles where the default value is set to 3.[18]

Accuracy and Results[edit]

The algorithm ClustalW uses provides a nearly optimal result. However, it does exceptionally well when the data set contains sequences with varied degrees of divergence. This is because in such data sets, the guide tree becomes less sensitive to noise. ClustalW was one of the first multiple sequence alignment algorithms to combine pairwise alignment and global alignment to increase speed, but this trade-off results in decreased accuracy.

ClustalW, when compared to other multiple sequence alignment algorithms in 2014, performed as one of the quickest while still maintaining an acceptable level of accuracy, but there was room for improvement compared to consistency-based competitors such as T-Coffee.[19] The accuracy of ClustalW when tested against MAFFT, T-Coffee, Clustal Omega, and other algorithms was the lowest for full-length sequences, but still considered acceptable. It had the most memory (RAM) efficient algorithm out of all those tested in the study.[19] Updates and improvements to the algorithm have been made in ClustalW2 to increase accuracy while maintaining its greatly valued speed.[18]

Clustal Omega[edit]

Brief summary[edit]

Flowchart depicting the step-by-step algorithm used in Clustal Omega.

ClustalΩ (alternatively written as Clustal O and Clustal Omega) is a fast and scalable program written in C and C++ used for multiple sequence alignment. It uses seeded guide trees and a new HMM engine that focuses on two profiles to generate these alignments.[20][21] The program requires three or more sequences in order to calculate the multiple sequence alignment, for two sequences use pairwise sequence alignment tools (EMBOSS, LALIGN). Clustal Omega is consistency-based and is widely viewed as one of the fastest online implementations of all multiple sequence alignment tools and still ranks high in accuracy, among both consistency-based and matrix-based algorithms.

Algorithm[edit]

The structure of a profile HMM used in the implementation of Clustal Omega is shown here.

Clustal Omega has five main steps in order to generate the multiple sequence alignment. The first is producing a pairwise alignment using the k-tuple method, also known as the word method. This, in summary, is a heuristic method that isn't guaranteed to find an optimal alignment solution, but is significantly more efficient than the dynamic programming method of alignment. After that, the sequences are clustered using the modified mBed method.[22] The mBed method calculates pairwise distance using sequence embedding. This step is followed by the k-means clustering method. Next, the guide tree is constructed using the UPGMA method. This is shown as multiple guide tree steps leading into one final guide tree construction because of the way the UPGMA algorithm works. At each step, (each diamond in the flowchart) the nearest two clusters are combined and is repeated until the final tree can be assessed. In the final step, the multiple sequence alignment is produced using HHAlign package from the HH-Suite, which uses two profile HMM's. A profile HMM is a linear state machine consisting of a series of nodes, each of which corresponds roughly to a position (column) in the alignment from which it was built.[23]

Time complexity[edit]

The exact way of computing an optimal alignment between N sequences has a computational complexity of for N sequences of length L making it prohibitive for even small numbers of sequences. Clustal Omega uses a modified version of mBed which has a complexity of ,[22][24] and produces guide trees that are just as accurate as those from conventional methods. The speed and accuracy of the guide trees in Clustal Omega is attributed to the implementation of a modified mBed algorithm. It also reduces the computational time and memory requirements to complete alignments on large datasets.

Accuracy and results[edit]

The accuracy of Clustal Omega on a small number of sequences is, on average, very similar to what are considered high quality sequence aligners. The difference comes when using large sets of data with hundreds of thousands of sequences. In these cases, Clustal Omega outperforms other algorithms across the board. Its completion time and overall quality is consistently better than other programs.[25] It is capable of running 100,000+ sequences on one processor in a few hours.

Clustal Omega uses the HHAlign package of the HH-Suite, which aligns two profile Hidden Markov Models instead of a profile-profile comparison. This improves the quality of the sensitivity and alignment significantly.[25] This, combined with the mBed method, gives Clustal Omega its advantage over other sequence aligners. The results end up being very accurate and very quick which is the optimal situation.

On data sets with non conserved terminal bases, Clustal Omega may be more accurate than Probcons and T-Coffee despite the fact that both of these are consistency-based algorithms, in contrast to Clustal Omega. On an efficiency test with programs that produce high accuracy scores, MAFFT was the fastest, closely followed by Clustal Omega. Both were faster than T-Coffee, however, MAFFT and Clustal Omega required more memory to run.[19]

Clustal2 (ClustalW/ClustalX)[edit]

Clustal2 is the packaged release of both the command-line ClustalW and graphical Clustal X. Neither are new tools, but are updated and improved versions of the previous implementations seen above. Both downloads come precompiled for many operating systems like Linux, Mac OS X and Windows (both XP and Vista). This release was designed in order to make the website more organized and user friendly, as well as updating the source codes to their most recent versions. Clustal2 is version 2 of both ClustalW and ClustalX, which is where it gets its name. Past versions can still be found on the website, however, every precompilation is now up to date.

See also[edit]

References[edit]

  1. ^ See file COPYING, in source archive [1] Archived 2021-06-12 at the Wayback Machine. Accessed 2014-01-15.
  2. ^ Chenna R, Sugawara H, Koike T, Lopez R, Gibson TJ, Higgins DG, Thompson JD (July 2003). "Multiple sequence alignment with the Clustal series of programs". Nucleic Acids Research. 31 (13): 3497–500. doi:10.1093/nar/gkg500. PMC 168907. PMID 12824352.
  3. ^ Higgins DG, Sharp PM (December 1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer". Gene. 73 (1): 237–44. doi:10.1016/0378-1119(88)90330-7. PMID 3243435.
  4. ^ a b Higgins DG, Bleasby AJ, Fuchs R (April 1992). "CLUSTAL V: improved software for multiple sequence alignment". Computer Applications in the Biosciences. 8 (2): 189–91. doi:10.1093/bioinformatics/8.2.189. PMID 1591615.
  5. ^ Thompson, J. D.; Higgins, D. G.; Gibson, T. J. (1994-11-11). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice". Nucleic Acids Research. 22 (22): 4673–4680. doi:10.1093/nar/22.22.4673. ISSN 0305-1048. PMC 308517. PMID 7984417.
  6. ^ a b Thompson JD, Gibson TJ, Plewniak F, Jeanmougin F, Higgins DG (December 1997). "The CLUSTAL_X windows interface: flexible strategies for multiple sequence alignment aided by quality analysis tools". Nucleic Acids Research. 25 (24): 4876–82. doi:10.1093/nar/25.24.4876. PMC 147148. PMID 9396791.
  7. ^ Dineen, David. "Clustal W and Clustal X Multiple Sequence Alignment". www.clustal.org. Archived from the original on 2018-04-16. Retrieved 2018-04-24.
  8. ^ Sievers F, Higgins DG (2014-01-01). "Clustal Omega, Accurate Alignment of Very Large Numbers of Sequences". In Russell DJ (ed.). Multiple Sequence Alignment Methods. Methods in Molecular Biology. Vol. 1079. Humana Press. pp. 105–116. doi:10.1007/978-1-62703-646-7_6. ISBN 9781627036450. PMID 24170397.
  9. ^ Sievers F, Higgins DG (2002-01-01). Clustal Omega. Vol. 48. John Wiley & Sons, Inc. pp. 3.13.1–16. doi:10.1002/0471250953.bi0313s48. ISBN 9780471250951. PMID 25501942. S2CID 1762688. {{cite book}}: |journal= ignored (help)
  10. ^ Van Noorden R, Maher B, Nuzzo R (October 2014). "The top 100 papers". Nature. 514 (7524): 550–3. Bibcode:2014Natur.514..550V. doi:10.1038/514550a. PMID 25355343.
  11. ^ a b c d Des Higgins, presentation at the SMBE 2012 conference in Dublin.
  12. ^ Higgins DG, Sharp PM (December 1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer". Gene. 73 (1): 237–44. doi:10.1016/0378-1119(88)90330-7. PMID 3243435.
  13. ^ Higgins DG, Sharp PM (April 1989). "Fast and sensitive multiple sequence alignments on a microcomputer". Computer Applications in the Biosciences. 5 (2): 151–3. doi:10.1093/bioinformatics/5.2.151. PMID 2720464.
  14. ^ Thompson JD, Higgins DG, Gibson TJ (November 1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice". Nucleic Acids Research. 22 (22): 4673–80. doi:10.1093/nar/22.22.4673. PMC 308517. PMID 7984417.
  15. ^ a b "CLUSTAL W Algorithm". Archived from the original on 2016-12-01. Retrieved 2018-04-24.
  16. ^ a b Higgins, Des (June 1991). "Clustal V Multiple Sequence Alignments. Documentation (Installation and Usage)". www.aua.gr. Archived from the original on 2023-04-12. Retrieved 2022-08-27.
  17. ^ "About CLUSTALW". www.megasoftware.net. Archived from the original on 2018-04-24. Retrieved 2018-04-24.
  18. ^ a b c Larkin, M.A.; Blackshields, G.; Brown, N.P.; Chenna, R.; McGettigan, P.A.; McWilliam, H.; Valentin, F.; Wallace, I.M.; Wilm, A. (2007-09-10). "Clustal W and Clustal X version 2.0". Bioinformatics. 23 (21): 2947–2948. doi:10.1093/bioinformatics/btm404. ISSN 1367-4803. PMID 17846036.
  19. ^ a b c Pais FS, Ruy PC, Oliveira G, Coimbra RS (March 2014). "Assessing the efficiency of multiple sequence alignment programs". Algorithms for Molecular Biology. 9 (1): 4. doi:10.1186/1748-7188-9-4. PMC 4015676. PMID 24602402.
  20. ^ EMBL-EBI. "Clustal Omega < Multiple Sequence Alignment < EMBL-EBI". www.ebi.ac.uk. Archived from the original on 2018-04-29. Retrieved 2018-04-18.
  21. ^ Dineen, David. "Clustal Omega, ClustalW and ClustalX Multiple Sequence Alignment". www.clustal.org. Archived from the original on 2010-05-29. Retrieved 2018-04-18.
  22. ^ a b Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (May 2010). "Sequence embedding for fast construction of guide trees for multiple sequence alignment". Algorithms for Molecular Biology. 5: 21. doi:10.1186/1748-7188-5-21. PMC 2893182. PMID 20470396.
  23. ^ "Profile HMM Analysis". www.biology.wustl.edu. Archived from the original on 2019-07-24. Retrieved 2018-05-01.
  24. ^ Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG (October 2011). "Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega". Molecular Systems Biology. 7 (1): 539. doi:10.1038/msb.2011.75. PMC 3261699. PMID 21988835.
  25. ^ a b Daugelaite J, O' Driscoll A, Sleator RD (2013). "An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics". ISRN Biomathematics. 2013: 1–14. doi:10.1155/2013/615630. ISSN 2090-7702.

External links[edit]