Support #50
Updated by Katie Lennard almost 7 years ago
The quality of the binning results from CONCOCT can be examined by looking at single copy core genes (i.e. genes that expected to present across all taxa with only 1 copy - more indicates contamination)
Note that Ulas has his own scripts for validating single copy core genes (described below) but CONCOCT also has options for doing this, and CheckM is another option which might be worth comparing (described here https://concoct.readthedocs.io/en/latest/complete_example.html#validation-using-single-copy-core-genes)
**A) ULAS's **A)** **ULAS's pipeline** (NB: after he sent me all these scripts he said the only reason he used these custom scripts was because CheckM wasn't available yet at the time, and he then suggested I use checkM - not sure why he sent me the scripts in the first place). So ignore this pipeline (validation_by_single_copy_core_genes.sh on hex)
~~1. Find genes on the contigs (output from megahit) and functionally annotate these using Prodigal (output is a .faa file)
2. Use hmmer to search a HMM profile file (.hmm extension) against a sequence database (our .faa file from prodigal). The .hmm file can be built from an alignment file using hmmbuild
3. Next we use a series of .R scripts (from Ulas) to format the data and create a table of single copy core genes for each sample
>index-fasta_KL.R (input: contigs.fa file + prodigal .faa file -> count number of ORFs/scaffold (he says scaffold but we're using the contigs file; output: orf2faa.RData, scaffolds2norfs.RData, scaffold2fa.RData, scaffold2stats.RData)
>readBins_KL.R (input: binning result from concoct + scaffold2stats.RData -> Make scaffold/binning summary including IDs of scaffolds that weren't binned; output: scaffold2bin.RData)
>writeScaffoldOrfIds_KL.R (input: scaffold2fa.RData, orf2faa.RData -> Just get the scaffold IDs and orf IDs; output: scaffoldids.RData, orfids.RData)
>extractEssenSingleCopy_KL.R (input: scaffoldids.RData, orfids.RData, EssenSingleCopy_domain2gene.txt (provided by Ulas), hits table from hmmer)
~~
**B) **B)** ** checkM** (currently preferred option) - https://github.com/Ecogenomics/CheckM/wiki
1. For checkM I had to format the input so that we had one fasta file for each bin (the output from concoct only provided a mapping between contig IDs and bin IDs, not fasta files). The script I made for this is 'split_bins_fasta.sh' on hex
2. checkM requires prodigal (installed on hex), hmmer (installed on hex) and pplacer (requested install on hex)
3. Recommended checkM workflow: 'lineage_wf'
4. Several useful functions are available as part of checkM e.g. to check bin uniqueness and do QC plots
Note that Ulas has his own scripts for validating single copy core genes (described below) but CONCOCT also has options for doing this, and CheckM is another option which might be worth comparing (described here https://concoct.readthedocs.io/en/latest/complete_example.html#validation-using-single-copy-core-genes)
**A) ULAS's **A)** **ULAS's pipeline** (NB: after he sent me all these scripts he said the only reason he used these custom scripts was because CheckM wasn't available yet at the time, and he then suggested I use checkM - not sure why he sent me the scripts in the first place). So ignore this pipeline (validation_by_single_copy_core_genes.sh on hex)
~~1. Find genes on the contigs (output from megahit) and functionally annotate these using Prodigal (output is a .faa file)
2. Use hmmer to search a HMM profile file (.hmm extension) against a sequence database (our .faa file from prodigal). The .hmm file can be built from an alignment file using hmmbuild
3. Next we use a series of .R scripts (from Ulas) to format the data and create a table of single copy core genes for each sample
>index-fasta_KL.R (input: contigs.fa file + prodigal .faa file -> count number of ORFs/scaffold (he says scaffold but we're using the contigs file; output: orf2faa.RData, scaffolds2norfs.RData, scaffold2fa.RData, scaffold2stats.RData)
>readBins_KL.R (input: binning result from concoct + scaffold2stats.RData -> Make scaffold/binning summary including IDs of scaffolds that weren't binned; output: scaffold2bin.RData)
>writeScaffoldOrfIds_KL.R (input: scaffold2fa.RData, orf2faa.RData -> Just get the scaffold IDs and orf IDs; output: scaffoldids.RData, orfids.RData)
>extractEssenSingleCopy_KL.R (input: scaffoldids.RData, orfids.RData, EssenSingleCopy_domain2gene.txt (provided by Ulas), hits table from hmmer)
~~
**B) **B)** ** checkM** (currently preferred option) - https://github.com/Ecogenomics/CheckM/wiki
1. For checkM I had to format the input so that we had one fasta file for each bin (the output from concoct only provided a mapping between contig IDs and bin IDs, not fasta files). The script I made for this is 'split_bins_fasta.sh' on hex
2. checkM requires prodigal (installed on hex), hmmer (installed on hex) and pplacer (requested install on hex)
3. Recommended checkM workflow: 'lineage_wf'
4. Several useful functions are available as part of checkM e.g. to check bin uniqueness and do QC plots