Wiki » History » Version 4
Katie Lennard, 04/26/2018 10:36 AM
1 | 1 | Katie Lennard | # Wiki |
---|---|---|---|
2 | # Background |
||
3 | Ulas Karaoz from Lawrence Berkeley National Laboratory (one of Heather Jaspan's collaborators) has been advising me on how to setup a WGS pipeline for UCT as this is his area of expertise. The pipeline would be primarily aimed at profiling relatively complex metagenomics samples, both in terms of taxonomic composition and gene content. Accordingly a study profiling infant stool samples was selected as the test dataset. |
||
4 | # Starting material: software installed & input data |
||
5 | 4 | Katie Lennard | *Input data: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA290380 The relevant paper for this study is attached. This is a longitudinal study on 11 infants' stool samples Ulas suggested selecting 1 or 2 infants with maximum number of longitudinal samples (keeping it less than 8); The input data size is 3.25^10 basepairs (using this tool https://github.com/billzt/readfq), which is about 30gB (raw) which Ulas estimates will require 50-100GB memory to assemble. |
6 | 1 | Katie Lennard | *Andrew (HPC) suggests running this on the hex high mem machine: which consists of two nodes each of which has 1TB memory |
7 | 3 | Katie Lennard | *For QC: FastQC (based on the scripts from our 16S pipeline: requires a base script (fastqc.single.sh) a batch script (fastqc.batch.sh), a config file and a file listing all files to be checked (with full file path) |
8 | 4 | Katie Lennard | *Read trimming: Trimmomatic (Cutadapt perhaps a better option: more intuitive, trimmomatic unpredictable) |
9 | 2 | Katie Lennard | *Co-assembly: Megahit (https://github.com/voutcn/megahit) |
10 | 1 | Katie Lennard | *Index reference sequences (Megahit output): bowtie2 |
11 | *Map reads to assembled scaffolds: bedtools |
||
12 | 4 | Katie Lennard | *Prepare input file for Concoct: custom script from Ulas (input files = coverage table + contigs file) |
13 | *Binning (based on tetranucleotide and coverage based clustering): Concoct (https://github.com/BinPro/CONCOCT) |
||
14 | *Evaluate bins visually with the R script ClusterPlot.R (supplied with Concoct) |
||
15 | 2 | Katie Lennard | *Validate binning using single copy core genes: CheckM (http://ecogenomics.github.io/CheckM/) |