Mandy currently analyses her CRIPRi-seq experiments with MAGecK-VISPR, but could not get the visualization component (VISPR) to work. MAGeCK analyses the sequencing counts and summarizes at the sgRNA and gene level with typically 4-5 sgRNAs selected as 'good' for each gene. There are two methods to choose from, the one is a rank-based method (MAGeCK-RRA) which is suitable for two-class comparisons, the other is a GLM-based method (MAGeCK-RLE) which is suitable for multiclass comparisons and more complex experiments. To date Mandy has mainly used MAGeCK-RRA but has also recently tried MAGeCK-RLE. The output from MAGeCK is imported into R along with an annotation file for functional annotation that Mandy generated herself for MTB. The subset of ~2000 genes that they are targeting in the M. smegmatis system are all closely related to M. tuberculosis as this is the main interest. Genes from the M. smeg experiment can therefore be mapped to the Tuberculist DB genes and COG functions. Mandy has not found a suitable tool to automate functional enrichment analyses with the MTB DB being the main limitation. For instance the R package MAGecKFlute does have capability for functional enrichment but when M. smeg is selected as organism with organism="msmeg"
it does not find it. So according to Mandy:
For the functional analysis I had been originally using Cog categories and Tuburculist categories from a previous paper. However, for the publication I need updated lists.
For the updated tuburculist categories (now stored in the mycobrowser database) I just downloaded an up to date GTF file from which I have extracted the annotations and appended to my data
For the COG categories I could not find a gene-COG category list for Mtb and so I took the liberty of generating my own through eggNOG (using the M. smegmatis protein coding regions).
*For the output I sometimes have multiple COG categories for one gene which I am not sure how to handle in subsequent enrichment analysis. *
I wrote Mandy a little R script to sum these categories
For the enrichment analysis I am stuck on what method to use. I have previously used GO functions (many years back) with software such as FattiGO etc. However for mycobacteria I have the two functional lists that I am using (COG and Mycobrowser/Tuberculist functional categories) rather than GO functions.
Mandy decided that since the COGs were more complete (a large proportion of genes had no GO annotation) she would focus on those.
To test for enrichment I need to use a Fishers exact test. I have my data in R and would ideally like to do this as part of my data processing and visualization pipeline however I am not sure how to do this and how to put my data in the correct format. I can do this outside of R if needed.
Mandy has since found suitable scripts to do this with MTC in R
From the Ps_assembly_annotated_contigs folder imported into R (one .tsv file per sample). Hypothetical proteins and proteins with no annotation removed. Sample from during Ps outbreak compared to samples from before/after outbreak in terms of gene content: 213 significant hits after multiple testing correction
I met with Pieter and Mike they are happy with the results, just needed some guidance on interpretation. With regards to the the uniformity seen across urban samples they were apparently all collected from the same sight and the negative control run by CPGR was negative so shouldn't be a problem. I will write the methods section for bioinformatic work done for their manuscript.
There were several ~80 taxa (merged at lowest available taxonomic annotation) that were differentially abundant or differentially absent/present (|FC| >= 1.5 %; presence >= 60% in at least one group; FDR . <= 0.05) when comparing fermented (N=5) vs. unfermented (N=4) milk. Lactococcus and Leuconostoc species were present at higher levels in the the fermented products as expected. The majority of hits were however actually ones that were (strangely?) uniformly present in the unfermented milk (3 of urban and 1 of rural origin). It could be that the urban samples were all collected from the same site which may explain the otherwise odd level of similarity (the elution buffer negative control did not however have any detectable amplicons based on the CPGR's report). One urban sample was clearly infected with Salmonella enterica which was the dominant organism in this sample.
VF results used:
from Ilifu /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB_cov_80_diverge_10/srst2/Ps_VF_cov_80_ID_90_report_compiledResults.txt
AMR results used:
from Ilifu /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/AMR_cov_80_diverge_10/srst2/Ps_AMR_cov_80_ID_90_report_compiledResults.txt
MLST results used:
Manually curated MLST results as previously described
SNP-based phylogenetic tree used:
The 'core SNPs" phylogenetic tree generated in the Tychus alignment module was used to generate publication quality plots in R. This core SNPs tree was generated using kSNP3 and is built based on SNPs that are present in all isolates considered (from Ilifu /cbio/users/katie/Nicol/SNPsAndPhylogenies_Ps_ref_genome/Trees/tree.core.tre)
R plots (phylogenetic SNP tree with heatmap of VFs and AMRs) were generated using the R packages ggtree, ggimage, GeneMates (heatMapPAM() function) and multiple Fisher's exact tests and MTC was conducted with metagenomeSeq's fitPA() function
The attached plot was up to date on 3/9/19
MLST results generated with srst2 that were classified as 'uncertain' (designated '?') were manually checked. The majority of uncertain hits were classified as such based on the fact that they had 1 or 2 low coverage bases at the first or last 2bp of the read. By doing multiple sequence alignments for all alleles for each of the 7 markers (acsA, aroE, guaA, mutL, nuoD, ppsA, trpE) I could establish whether the first 2bp and last 2bp were in fact necessary to distinguish from all the other alleles. In most cases these bases were not discriminatory and the 'uncertain' assignment could be passed. Alignment was done with MAFFT and viewed in Jalview. Example attached from acsA. In cases of SNPs (designated '*') srst2 short read alignment results were compared to the P. aeruginosa assembled contigs (from the Tychus assembly module).
nuoD: no difference in first or last two bases across all alleles in MLST file used
Results of manual curation:
acsA: only type 130 (acsA_130) differs in the last base from all other alleles. Type 16 and 11 vs. type 130 à several other changes so that 16 can be confidently distinguished from 130 without the last base
ppsA: several allele changes in first two and last two bases of seq, but manual check with mafft/jalview showed that ppsA_4, ppsA_33 and ppsA_6 can be distinguished from all other seqs independent of the first 2 and last2 bases.
aroE: handful of types with one bp change in 2nd bp of sequence but manual check with jalview shows all can still be distinguished without use of first two bases.
guaA: handful of types with one bp change in 2nd bp of sequence but manual check with jalview shows all can still be distinguished without use of first two bases.
mutL: Types 11 and 29 cannot be distinguished from type 216 if ignoring the first two bases
trpE: manual check with jalview shows all can still be distinguished without use of first two bases.
Note: srst2 by default flags a call as uncertain if –min_depth (the average depth across the entire allele) is less than 5. We will lower this to 4.
SNP analysis is done with freebayes (on quality trimmed and filtered reads) with default settings ( -C 2 -min_coverage 0) after which consensus fastas are built against the reference genome for P.aerug. These consensus fastas are fed to kSNP3 along with the reference genome for SNP calling and phylogenetic tree building. I ran a test of a small subset of samples (/ceph/cbio/users/katie/Nicol/Ps_small_test on Ilifu) to test the effect of different coverage cutoffs, including -C 10 and there was no change in the resulting kSNP3 phylogenetic tree structure. Freebayes uses information across all samples (bayesian approach) to call samples, making it robust to low coverage bases
Both read trim strategies and taxonomic DB used will be considered to try and improve species-level resolution compared to the data supplied by CPGR. CPGR used GreenGenes 13.5 and Illumina's commercial pipeline (for which the process is unknown).
Primary inspection of default parameters with our dada2 pipeline showed only 64/595 ASVs identified down to species level (The 595 is the filtered table after removing very low abundance ASVs from the original ~2900). See Wiki tab for more.
Mainly they are looking for improved species-level resolution with a few specific species they are hoping to see. I will run the data through our dada2 pipeline and take it from there.
The data was copied to ilifu /ceph/cbio/users/katie/Levin/raw/