Explore chapters and articles related to this topic
Bioinformatics and Applications in Biotechnology
Published in Ram Chandra, R.C. Sobti, Microbes for Sustainable Development and Bioremediation, 2019
A major application of bioinformatics software, tools, and storage systems start with data generated through genome sequencing. Since NGS involves generation of readout, which could be upto 30–40 bp, reassembling a genome with millions of such readouts as in humans is a mammoth task. Whole-genome shotgun sequencing (WGS) samples the genomic DNA by making small readouts and algorithm packages such as SSAKE, SHARCGS, VCAKE, and Velvet, are widely used in the reassembly of the fragments. A special problem arises in de novo WGS assembly of NGS data because there is no draft reference to fall back upon and another challenge is finding the fragments from the repetitive region of DNA. A wayout is oversampling of target genomes with readouts from different random positions and then finding out the overlap regions and reconstructing the genomesand their resolution. The notion of k-mers is used which consist of consecutive bases and reduces the assembly complexity and computational cost considerably. An overlap graph, e.g., the De Bruijn graph, is used in which the nodes represent the readouts and the edges represent the overlaps, is analyzed and results in contigs and finally sequences. SSAKE is one of the earliest assemblers and chooses the reads with end to end confirmation and then the candidates with multiple extensions. VCAKE is also an iterative extension algorithm and can also incorporate imperfect matches during contig extension. These and many other widely used genome assembly programs have revolutionized the NGS data compilation and their availability (Miller et al., 2010).
A hybrid algorithm for identifying partially conserved regions in multiple sequence alignment
Published in International Journal of Computers and Applications, 2021
Gamage Kokila Kasuni Perera, Champi Thusangi Wannige
K-mer distance measure is one of the matrices used in alignment-free sequence comparison methods based on fixed-length substrings. A K-mer is a string that contains K characters, where K is some fixed integer. K-mer distance methods first calculate the frequencies or counts of all possible K-mers in the sequences. Then distance measures are defined based on these values (counts of each K-mer). The methods based on ‘spaced K-mers’, consider statistics of K-mers despite their position in the sequence. According to [19], in order to discover the sequence similarities, these spaced K-mer methods are better than the methods based on contiguous k-mers. The main reason for this superiority of methods based on spaced K-mers is that due to the insertions and deletions, even two closely related sequences may not show high similarity based on the position-specific K-mers. Therefore we have used a K-mer based sequence comparison to quantify the similarity between two sequences. We use the count of each K-mer in a sequence as attributes in the clustering method.
A Recurrent Neural Network approach for whole genome bacteria identification
Published in Applied Artificial Intelligence, 2021
Luis Lugo, Emiliano Barreto- Hernández
Regarding sequence representations, a one-hot vector (Giang Nguyen et al. 2016) encoding converts genomic or protein sequences into a two-dimensional numerical matrix. Another digital encoding, the k-mers representation (Rizzo et al. 2015), generates a ffixed-lengthrepresentation using occurrences of overlapping subsequences with a length of k. K-mers are small DNA or protein sequences of k length. Based on the occurrence of those small sequences, the system computes and spectral representation of input data. However, an important aspect of sequence labeling problems is the use of context in the input sequence. Based on results from Natural Language Processing (NLP), the seq2vec (Kimothi et al. 2016) extends the use of a single k length in a k-mers representation, to create a distributed representation of the sequences in a Euclidean space. The distributed representation, which considers different k’s, has the potential to capture contextual information in the original sequence. The dna2vec system (Ng 2017) uses a distributed representation as well, considering to generate vectors in a 100-dimensional space. The cosine similarity of those vectors is correlated to the Needleman–Wunsch similarity score.