Bug #46
Updated by Katie Lennard almost 7 years ago
CONCOCT “bins” metagenomic contigs. Metagenomic binning is the process of clustering sequences into clusters corresponding to operational taxonomic units of some level.
CONCOCT is a whole pipeline in itself but Ulas uses only the binning part of CONCOCT.
* Before using concoct, the input files (QCed read files from trimmomatic) and contigs file (named final.contigs.fa from megahit) need to be prepared for concoct as follows:
1. The contigs file needs to be Indexed, using the 'bowtie2-build' command (produces a number of files with same name as input file but ending in extension .1 .2 .3 etc)
2. Individual trimmed read files need to be aligned to the indexed contigs file, using the 'bowtie2' command (script 'prepare_for_concoct.single.sh' and prepare_for_concoct.batch.sh) - can be run with batch script on hex
3. Index the original contigs file with samtools faidx command (this is needed to get file in the rigth format for the next step)
4. Then we can use the 'samtools view -bt' command to convert the output from 2. from sam to bam format
5. Sort bam file from 4. with 'samtools sort' so that reads occur in genome order
6. Index output from 5.
7. Locate, tag and removes duplicate reads from 6. (need MarkDuplicates.jar for this:download from here https://repo.jbei.org/users/mwornow/repos/seqvalidation/browse/tools/Picard-NERSC_version/MarkDuplicates.jar?at=master)
8. After removing duplicates sort output from 7. (again using 'samtools sort')
9. Index output from 8. ('samtools index')
10. Compute coverage profile of 9. using 'genomeCoverageBed' from bedtools2] bedtools2
* The next step concoct is to take the coverage profiles (.smds.coverage files) from 10. and create a coverage table for input with concoct (python /opt/exp_soft/CONCOCT-0.4.0/scripts/gen_input_table.py)
* Now we can run concoct (with coverage table and contigs file used as input)
> module load python/anaconda-python-2.7
> follows
source activate concoct_env
> /opt/exp_soft/CONCOCT-0.4.0/bin/concoct
> For concoct you need to specify a) the max number of clusters (default=400) b) number of cores to use (according to this https://bitbucket.org/berkeleylab/metabat/wiki/ concoct uses 10 threads regardless of the number specified..so set to 10 currently: ppn=10)
> The following warning/errors were produced but didn't seem to affect output:
/opt/exp_soft/anaconda/python2.7/envs/concoct_env/lib/python2.7/site-packages/Bio/Seq.py:341: BiopythonDeprecationWarning: This method is obsolete; please use str(my_seq) instead of my_seq.tostring().
BiopythonDeprecationWarning)
python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv
python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv
python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv
python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv
>The final step in the binning process is to visually evaluate the output using the R script ClusterPlot_KL.R which produces a sort of color coded PCA of the clusters
NB: concoct documentation recommends splitting larger contigs before running concoct so as to give more weight to larger contigs (I have not tested this yet)
CONCOCT is a whole pipeline in itself but Ulas uses only the binning part of CONCOCT.
* Before using concoct, the input files (QCed read files from trimmomatic) and contigs file (named final.contigs.fa from megahit) need to be prepared for concoct as follows:
1. The contigs file needs to be Indexed, using the 'bowtie2-build' command (produces a number of files with same name as input file but ending in extension .1 .2 .3 etc)
2. Individual trimmed read files need to be aligned to the indexed contigs file, using the 'bowtie2' command (script 'prepare_for_concoct.single.sh' and prepare_for_concoct.batch.sh) - can be run with batch script on hex
3. Index the original contigs file with samtools faidx command (this is needed to get file in the rigth format for the next step)
4. Then we can use the 'samtools view -bt' command to convert the output from 2. from sam to bam format
5. Sort bam file from 4. with 'samtools sort' so that reads occur in genome order
6. Index output from 5.
7. Locate, tag and removes duplicate reads from 6. (need MarkDuplicates.jar for this:download from here https://repo.jbei.org/users/mwornow/repos/seqvalidation/browse/tools/Picard-NERSC_version/MarkDuplicates.jar?at=master)
8. After removing duplicates sort output from 7. (again using 'samtools sort')
9. Index output from 8. ('samtools index')
10. Compute coverage profile of 9. using 'genomeCoverageBed' from bedtools2] bedtools2
* The next step concoct is to take the coverage profiles (.smds.coverage files) from 10. and create a coverage table for input with concoct (python /opt/exp_soft/CONCOCT-0.4.0/scripts/gen_input_table.py)
* Now we can run concoct (with coverage table and contigs file used as input)
> module load python/anaconda-python-2.7
> follows
source activate concoct_env
> /opt/exp_soft/CONCOCT-0.4.0/bin/concoct
> For concoct you need to specify a) the max number of clusters (default=400) b) number of cores to use (according to this https://bitbucket.org/berkeleylab/metabat/wiki/ concoct uses 10 threads regardless of the number specified..so set to 10 currently: ppn=10)
> The following warning/errors were produced but didn't seem to affect output:
/opt/exp_soft/anaconda/python2.7/envs/concoct_env/lib/python2.7/site-packages/Bio/Seq.py:341: BiopythonDeprecationWarning: This method is obsolete; please use str(my_seq) instead of my_seq.tostring().
BiopythonDeprecationWarning)
python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv
python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv
python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv
python: symbol lookup error: /opt/exp_soft/anaconda/python2.7/lib/python2.7/site-packages/numexpr/../../../libmkl_vml_def.so: undefined symbol: mkl_serv_getenv
>The final step in the binning process is to visually evaluate the output using the R script ClusterPlot_KL.R which produces a sort of color coded PCA of the clusters
NB: concoct documentation recommends splitting larger contigs before running concoct so as to give more weight to larger contigs (I have not tested this yet)