Wiki¶
DADA2 (Divisive Amplicon Denoising Algorithm) pipeline overview¶
a)The main difference compared to OTU-clustering-based methods is that dada2 detects 'exact amplicon sequence variants' (ASVs), which unlike OTUs consist of a single unique sequence as opposed to a cluster of closely related (97% identical) sequences.
b) DADA2 uses sequence quality information to build an error model, using machine learning methods, alternating info on sample composition and error rates until convergence. DADA2 therefore performs error correction, assigning all relevant reads to an error-corrected sequence.
c) Each ASV has an associated quality estimate, which informs inference/denoising (can pool samples to improve inference, especially for low abundance ASVs)
d) The main steps of DADA2 are:
Once demultiplexed fastq files without non-biological nucleotides (strip primers) are in hand, the dada2 pipeline proceeds as follows:
- Filter and trim: filterAndTrim() (filters the forward and reverse reads jointly, outputting only those pairs of reads that both pass the filter)
- Dereplicate: derepFastq()
- Learn error rates: learnErrors()
- Infer sample composition: dada()
- Merge paired reads: mergePairs()
- Make sequence table: makeSequenceTable()
- Remove chimeras: removeBimeraDenovo() (Chimeric sequences are removed after ASV identification and not based on a database: chemeric seqs are IDed as a seq that can be exactly reconstructed from a left and right segment from two more abundant parent seqs) e) An important consideration: If using paired-end sequencing data, you must maintain a suitable overlap (>20nts) between the forward and reverse reads after trimming!
Updated by Katie Lennard almost 7 years ago ยท 3 revisions