Explore chapters and articles related to this topic
Genomic Informatics in the Healthcare System
Published in Salvatore Volpe, Health Informatics, 2022
Subsequent demultiplexing analysis involves performing a quality control (QC) check on the sequenced data (typically FASTQ) to assess read-length distribution, quality scores, guanine-cytosine (GC) content, overrepresented sequences, and k-mer content. The purpose of these checks is to determine if the sequences generated have indicators of poor sequence quality. Additional steps, such as adapter and poor-quality sequence trimming, may be required depending on the QC results and pipeline configuration. The QC check phase is followed by alignment of the overlapping reads in a FASTQ file against a reference human genome. Under the sequencing alignment stage, good alignment algorithms have code designed to overcome ambiguities of repetitive sequence and sequencing errors. The aligned reads, which are tagged with several metadata, including alignment scores, are outputted in a sequence alignment map (SAM) or binary form of SAM (BAM) format.
Cancer Informatics
Published in Trevor F. Cox, Medical Statistics for Cancer Studies, 2022
The aligned fragment sequences (short reads) are stored in SAM and BAM files. SAM files are in text format and BAM files are compressed versions of SAM files. Different applications of NGS technologies require different number of reads from individual samples. In the case of genomic DNA sequencing, the depth of coverage tells how many times a target nucleotide is sequenced in a sample.
Molecular Diagnosis of Autosomal Dominant Polycystic Kidney Disease
Published in Jinghua Hu, Yong Yu, Polycystic Kidney Disease, 2019
Matthew Lanktree, Amirreza Haghighi, Xueweng Song, York Pei
After sequencing, FASTQ files from the sequencer undergo quality control, and are demultiplexed by assigning to the proper sample using the unique oligonucleotide barcode. The trimmed raw sequence is aligned to the human reference genome (hg19, NCBI build GRCh37) and PKD1 targeted region using the Burrows-Wheeler Aligner BWA-MEM alignment algorithm (BWA-0.7.12).26 BWA sequence alignments are converted into an analysis-ready binary alignment (BAM) file using SAMtools, and PCR duplicate reads are marked using Picard tools–1.123. Local realignment and base recalibration are performed using the Genome Analysis Tool Kit (GATK 3.6).27 Using the BAM file as input, single nucleotide variations and small insertion or deletions (InDels) are detected simultaneously using GATK HalotypeCaller 3.6, which produces a variant call format (VCF) file containing all the observed variation. For detecting mosaic or somatic variants, both HalotypeCaller 3.6 and FreeBayes caller v0.9.20-8-gfef284a (https://github.com/ekg/freebayes/) are employed. Freebayes has a tunable allele frequency setting, and we set the alternate allele fraction ≥5% for maximum sensitivity. To exclude false-positive calls, all variants are visually inspected on the Golden Helix Genome Browser (Golden Helix, Bozeman, Montana, USA), which vallows for observation of the variants at the level of the individual read. Poly-T, -C, -A, -G stretches, GC-rich areas and InDel regions may influence the mapping qualities or variant calls, creating false-positive calls. For assessment of mosaic or somatic variants with low alternate allele fraction (≤5%), the recurrent variants observed in multiple unrelated samples are considered sequencing artifacts and are excluded.
Targeted long-read sequencing allows for rapid identification of pathogenic disease-causing variants in retinoblastoma
Published in Ophthalmic Genetics, 2022
Kenji Nakamichi, Andrew Stacey, Debarshi Mustafi
For the short-read data, base calls were generated in real-time on the Illumina NovaSeq6000 instrument. BAM files were aligned to a human reference (GRCh38) using Burrows-Wheeler Aligner; v0.7.15 (27). A Genome Analysis Toolkit (GATK) (28) (v4.2.6.1) based pipeline following the best practices was used. The reads were then filtered for pairing and minimum alignment mapping quality (MAPQ) score of 50, then the supplementary, secondary, and optical duplicates were removed. The filtered BAM file was variant called using GATK HaplotypeCaller, and the output variant call file (VCF) underwent base quality score recalibration (BQSR) using GATK BaseRecalibrator. The recalibration tables were then used with GATK ApplyBQSR to recalibrate the base quality scores, and the recalibrated BAM file then underwent a second round of variant calling using GATK HaplotypeCaller. The resulting variant call files underwent several variant quality score recalibration (VQSR) steps using GATK VariantRecalibrator with parameters tuned for WGS. The resulting recalibration table and tranches files were then applied using GATK ApplyVQSR sequentially in SNP and INDEL modes. The recalibrated VCF file was then split into SNPs and INDELs using GATK SelectVariants, and filtered using GATK VariantFiltration with tuned parameters. The VCF file was then split into passing variants with a minimum allele depth of 15.
Metagenomics reveals impact of geography and acute diarrheal disease on the Central Indian human gut microbiome
Published in Gut Microbes, 2020
Tanya M. Monaghan, Tim J. Sloan, Stephen R. Stockdale, Adam M. Blanchard, Richard D. Emes, Mark Wilcox, Rima Biswas, Rupam Nashine, Sonali Manke, Jinal Gandhi, Pratishtha Jain, Shrejal Bhotmange, Shrikant Ambalkar, Ashish Satav, Lorraine A. Draper, Colin Hill, Rajpal Singh Kashyap
Quality filtered reads, both paired and unpaired, were mapped onto the final Indian fecal virome using bowtie2 in ‘end-to-end’ mode (version 2.3.4.1).67 The read alignment outputs were converted to sorted bam files through samtools (version 1.7).68 The abundance and breadth of coverage of reads mapping to each contig were determined using the bedtools coverage function (version 2.26.0).69 Subsequently, in order to determine if a viral sequence was indeed present in a fecal virome, a breadth of coverage filtering was applied. This was designed to remove viruses where potentially 100 s of reads could map onto a single conserved region. Therefore, for viral sequences ≤5kb, 75% of the genome needed to be covered by aligned reads; sequences >5kb and ≤50kb, 50% of the genome needed to be covered; and >50kb, 25% of the genome needed to be covered.
Identification of causative variants in patients with non-syndromic hearing loss in the Minnan region, China by targeted next-generation sequencing
Published in Acta Oto-Laryngologica, 2019
Xiaohui Wu, Xingqiang Gao, Peng Han, Yulin Zhou
Image analysis, error estimation, and base calling were performed using Illumina Pipeline software (version 1.3.4). Clean reads with a length of 90 bp were aligned to the human genome reference sequence GRCh37/HG19 using Burrows-Wheeler Alignment (version 0.7.12) with default settings [11]. Picard (version 1.118, http://broadinstitute.github.io/picard/) was used to convert SAM files to BAM files and remove the alignments of duplicate reads. The BAM files were used for computing read coverage in the target region and sequencing depth, and single-nucleotide polymorphism (SNP) and insertion and deletion (InDel) calling. Local realignment around InDels, base quality score recalibration, and variant calling were performed using GATK (version 3.3.0, http://software.broadinstitute.org/gatk/).