Project

General

Profile

Bug #46

Read binning with CONCOCT

Added by Katie Lennard almost 7 years ago. Updated almost 7 years ago.

Status:
New
Priority:
Normal
Assignee:
-
Start date:
12/04/2017
Due date:
% Done:

100%

Estimated time:

Description

CONCOCT “bins” metagenomic contigs. Metagenomic binning is the process of clustering sequences into clusters corresponding to operational taxonomic units of some level.
CONCOCT is a whole pipeline in itself but Ulas uses only the binning part of CONCOCT.

  • Before using concoct, the input files (QCed read files from trimmomatic) and contigs file (named final.contigs.fa from megahit) need to be prepared for concoct as follows:
  • The contigs file needs to be Indexed, using the 'bowtie2-build' command (produces a number of files with same name as input file but ending in extension .1 .2 .3 etc)
  • Individual trimmed read files need to be aligned to the indexed contigs file, using the 'bowtie2' command (script 'prepare_for_concoct.single.sh' and prepare_for_concoct.batch.sh) - can be run with batch script on hex
  • Index the original contigs file with samtools faidx command (this is needed to get file in the rigth format for the next step)
  • Then we can use the 'samtools view -bt' command to convert the output from 2. from sam to bam format
  • Sort bam file from 4. with 'samtools sort' so that reads occur in genome order
  • Index output from 5.
  • Locate, tag and removes duplicate reads from 6. (need MarkDuplicates.jar for this:download from here https://repo.jbei.org/users/mwornow/repos/seqvalidation/browse/tools/Picard-NERSC_version/MarkDuplicates.jar?at=master)
  • After removing duplicates sort output from 7. (again using 'samtools sort')
  • Index output from 8. ('samtools index')
  • Compute coverage profile of 9. using 'genomeCoverageBed' from bedtools2]
  • The next step is to take the coverage profiles (.smds.coverage files) from 10. and create a coverage table for input with concoct (python /opt/exp_soft/CONCOCT-0.4.0/scripts/gen_input_table.py)
  • Now we can run concoct (with coverage table and contigs file as input) > module load python/anaconda-python-2.7 > source activate concoct_env > /opt/exp_soft/CONCOCT-0.4.0/bin/concoct > For concoct you need to specify a) the max number of clusters (default=400) b) number of cores to use (according to this https://bitbucket.org/berkeleylab/metabat/wiki/ concoct uses 10 threads regardless of the number specified..so set to 10 currently: ppn=10) > The following warning/errors were produced but didn't seem to affect output: /opt/exp_soft/anaconda/python2.7/envs/concoct_env/lib/python2.7/site-packages/Bio/Seq.py:341: BiopythonDeprecationWarning: This method is obsolete; please use str(my_seq) instead of my_seq.tostring(). BiopythonDeprecationWarning) python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv >The final step in the binning process is to visually evaluate the output using the R script ClusterPlot_KL.R which produces a sort of color coded PCA of the clusters NB: concoct documentation recommends splitting larger contigs before running concoct so as to give more weight to larger contigs (I have not tested this yet)

History

#1

Updated by Katie Lennard almost 7 years ago

  • Description updated (diff)
  • Start date changed from 04/30/2018 to 12/04/2017
  • % Done changed from 0 to 100

Also available in: Atom PDF