Sequence clustering – Knowledge and References

Explore chapters and articles related to this topic

Clustering Biological Data

Published in Charu C. Aggarwal, Chandan K. Reddy, Data Clustering, 2018

Chandan K. Reddy, Mohammad Al Hasan, Mohammed J. Zaki

A graph-based sequence clustering method represents the sequences in a similarity graph, in which a vertex represents a sequence and an edge represents the similarity relation between the corresponding pair of sequences. In such a representation, a partition of the similarity graph represents a clustering of the input sequences. The crucial requirement in a graph-based sequence clustering method is to obtain the similarity graph in an efficient manner. A brute-force approach to obtain a similarity graph computes the similarity values between all (n2) pairs of sequences (here, n is the number of sequences) and then uses a user-defined threshold to add edges between sequences in a similarity graph. Clearly this is inefficient, as it requires to compute O(n2) similarity scores explicitly, so many graph-based sequence clustering algorithms use an efficient method for similarity graph construction. Once a similarity graph is obtained, one of the many available graph-clustering [90] methods can be used to obtain the desired clustering. In Algorithm 35, we present a pseudocode for a graph-based sequence clustering algorithm; based on the specific method for the similarity routine (Line 4) and the graph clustering (Line 9), various graph-based sequence clustering methods can be obtained.

Strategy for the formation of microalgae-bacteria aggregates in high-rate algal ponds

View Article

Journal Information

Published in Environmental Technology, 2023

Antonio G. dos Santos Neto, Martín Barragán-Trinidad, Lourdinha Florêncio, Germán Buitrón

The sequences were analyzed using the QIIME software (Quantitative Insights Into Microbial Ecology) [30]. Sequence prefix replication and sequence clustering at 4% divergence was performed using the USEARCH algorithm [31]. Although it is true that a 3% divergence is accepted by default, when describing complex communities with a wide taxonomic range and consisting of species of variable diversity, it is advisable to use a more flexible divergence threshold [32]. The OTU (Operational Taxonomic Unit) selection was made using the UPARSE algorithm [33]. Chimera analysis was performed using UCHIME software run de novo [34]. The taxonomic classification of each OTU was carried out using its consensus sequence, where the sequence was analyzed in the RDP classifier by comparing it with high-quality sequences derived from the NCBI database.

Microbial consortia adaptation to substrate changes in anaerobic digestion

View Article

Journal Information

Published in Preparative Biochemistry & Biotechnology, 2022

Priyanka S. Dargode, Pooja P. More, Suhas S. Gore, Bhupal R. Asodekar, Manju B. Sharma, Arvind M. Lali

The raw sequence reads were filtered first to remove artifacts made during the PCR process using the Usearch61 algorithm. Further flashed/stitched sequences were used for Operational Taxonomic Unit (OTU) pick. Similar sequences were clustered, i.e., sequences coming from the same genus, together into one representative taxonomic unit called OTU. The basis of this sequence clustering was a minimum of 97% sequence similarity as implemented through the UCLUST algorithm. In the next step, a representative sequence for each of these OTUs was picked and taxonomic names to these sequences at 90% sequence similarity were assigned using the UCLUST algorithm.

Bacterial diversity of heavy crude oil based mud samples near Omani oil wells

View Article

Journal Information

Published in Petroleum Science and Technology, 2021

Abdullah Al-Sayegh, Yahya Al-Wahaibi, Sanket J. Joshi, Saif Al-Bahry, Abdulkadir Elshafie, Ali Al-Bemani

Sequences were first filtered out based on the Phred score (Q ≥ 20) and chimeric sequences. These filtered sequences were clustered into OTUs based on >97% similarity using “uclust” (Edgar 2010) method which is a similarity based sequence clustering algorithm. The identified OTUs were aligned against the 16S rRNA database (Greengenes database) of Bacteria and Archaea for taxonomic assignment. QIIME workflow was used for taxonomic assignment.