Project

General

Profile

Wiki » History » Version 12

Katie Lennard, 01/30/2019 10:33 AM

1 1 Katie Lennard
# Wiki
2
# Background
3
Ulas Karaoz from Lawrence Berkeley National Laboratory (one of Heather Jaspan's collaborators) has been advising me on how to setup a WGS pipeline for UCT as this is his area of expertise. The pipeline would be primarily aimed at profiling relatively complex metagenomics samples, both in terms of taxonomic composition and gene content. Accordingly a study profiling infant stool samples was selected as the test dataset.
4 6 Katie Lennard
# Test pipeline (from Ulas Karaoz) starting material: software installed & input data
5 4 Katie Lennard
*Input data: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA290380 The relevant paper for this study is attached. This is a longitudinal study on 11 infants' stool samples Ulas suggested selecting 1 or 2 infants with maximum number of longitudinal samples (keeping it less than 8); The input data size is 3.25^10 basepairs (using this tool https://github.com/billzt/readfq), which is about 30gB (raw) which Ulas estimates will require 50-100GB memory to assemble.
6 1 Katie Lennard
*Andrew (HPC) suggests running this on the hex high mem machine: which consists of two nodes each of which has 1TB memory
7 8 Katie Lennard
*All development scripts with README are attached under the 'Files' tab as scripts_for_redmine.tar.gz
8
*For QC: FastQC (based on the scripts from our 16S pipeline: requires base scripts (fastqc.single.sh and fastqc_combine.single.sh) a batch script (fastqc.batch.sh and fastqc_combine.batch.sh), a config file and a file listing all files to be checked (with full file path)
9
*Read trimming: Trimmomatic (Cutadapt perhaps a better option, more intuitive)
10 2 Katie Lennard
*Co-assembly: Megahit (https://github.com/voutcn/megahit)
11 1 Katie Lennard
*Index reference sequences (Megahit output): bowtie2
12
*Map reads to assembled scaffolds: bedtools
13
*Prepare input file for Concoct: custom script from Ulas (input files = coverage table + contigs file)
14
*Binning (based on tetranucleotide and coverage based clustering): Concoct (https://github.com/BinPro/CONCOCT)
15 4 Katie Lennard
*Evaluate bins visually with the R script ClusterPlot.R (supplied with Concoct)
16 9 Katie Lennard
*Validate binning using single copy core genes: CheckM (http://ecogenomics.github.io/CheckM ; CheckM paper https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4484387/)
17
> *CheckM additionally requires the following software: Prodigal for gene prediction (using the contigs file from megahit as input); hmmer to align results from Prodigal agains CheckM's HMMs of single copy core genes collection); pplacer to place bins in a reference taxonomy tree
18 5 Katie Lennard
# Methods research - for potential future implementation
19
*Metagenomics research and software development is progressing rapidly with several tools available for each step in the pipeline with no clear gold standards
20 7 Katie Lennard
*"Critically, methodological improvements are difficult to gauge due to the lack of a general standard for comparison" This issue is currently being addressed by a community driven initiative called Critical Assessment of Metagenome Interpretation (CAMI) with the aim being an independent, comprehensive and bias-free evaluation of methods https://www.nature.com/articles/nmeth.4458.pdf
21 1 Katie Lennard
* CAMI requires software containerization and standardization of user interfaces (using docker and biobox) See https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4607242/pdf/13742_2015_Article_87.pdf for details on biobox
22 6 Katie Lennard
* Current CAMI recommendations on methods:
23
> * Assembly (tested MEGAHIT, Minia, Meraga (Meraculous + MEGAHIT), A*(using the OperaMS Scaffolder), Ray Meta and Velour) - recommend MEGAHIT, Minia OR Meraga
24
> * Binning (tested MyCC, MaxBin 2.0,  MetaBAT, MetaWatt 3.5, CONCOCT, PhyloPythiaS+, taxator-tk, MEGAN6, and Kraken2) - "MetaWatt 3.5, followed by MaxBin 2.0, recovered the most genomes with high purity and completeness from all data sets"
25
> * Taxonomic profiling (tested CLARK; Common Kmers (an early version of MetaPalette);DUDes; FOCUS; MetaPhlAn 2.0; MetaPhyler; mOTU; a combination of Quikr, ARK and SEK (abbreviated Quikr); Taxy-Pro and TIPP) - "On the basis of the average of precision and recall, over all samples and taxonomic ranks, Taxy-Pro version 0 (mean = 0.616), MetaPhlAn 2.0 (mean = 0.603) and DUDes version 0 (mean = 0.596) performed best."
26 11 Katie Lennard
# Plan B: Setting up YAMP on hex 
27 12 Katie Lennard
*Due to difficulties in communicating with Ulas and getting the metagenomics assembly pipeline he suggested up and running on hex we decided to move to a more user-friendly, reference-based pipeline (YAMP: https://github.com/alesssia/YAMP). This pipeline needed improvement however as it was badly written and not setup for processing samples in parallel. The current work in progress repo is here: https://github.com/kviljoen/YAMP/tree/master and was setup according to the template we are using for all our Nextflow pipelines (i.e. based on Phil Ewel's nf-core guidelines)
28
*Still to do (as of 30/1/2019): Migrate pipeline to Ilifu cluster; add strainphlan.py step as a process for strain detection; look into adding on PAVIAN for pathogen tracking visualisation https://ccb.jhu.edu/software/pavian/