Meeting with Mandy
Discussion of data analysis challenges
Mandy currently analyses her CRIPRi-seq experiments with MAGecK-VISPR, but could not get the visualization component (VISPR) to work. MAGeCK analyses the sequencing counts and summarizes at the sgRNA and gene level with typically 4-5 sgRNAs selected as 'good' for each gene. There are two methods to choose from, the one is a rank-based method (MAGeCK-RRA) which is suitable for two-class comparisons, the other is a GLM-based method (MAGeCK-RLE) which is suitable for multiclass comparisons and more complex experiments. To date Mandy has mainly used MAGeCK-RRA but has also recently tried MAGeCK-RLE. The output from MAGeCK is imported into R along with an annotation file for functional annotation that Mandy generated herself for MTB. The subset of ~2000 genes that they are targeting in the M. smegmatis system are all closely related to M. tuberculosis as this is the main interest. Genes from the M. smeg experiment can therefore be mapped to the Tuberculist DB genes and COG functions. Mandy has not found a suitable tool to automate functional enrichment analyses with the MTB DB being the main limitation. For instance the R package MAGecKFlute does have capability for functional enrichment but when M. smeg is selected as organism with organism="msmeg"
it does not find it. So according to Mandy:
For the functional analysis I had been originally using Cog categories and Tuburculist categories from a previous paper. However, for the publication I need updated lists.
For the updated tuburculist categories (now stored in the mycobrowser database) I just downloaded an up to date GTF file from which I have extracted the annotations and appended to my data
For the COG categories I could not find a gene-COG category list for Mtb and so I took the liberty of generating my own through eggNOG (using the M. smegmatis protein coding regions).
*For the output I sometimes have multiple COG categories for one gene which I am not sure how to handle in subsequent enrichment analysis. *
I wrote Mandy a little R script to sum these categories
For the enrichment analysis I am stuck on what method to use. I have previously used GO functions (many years back) with software such as FattiGO etc. However for mycobacteria I have the two functional lists that I am using (COG and Mycobrowser/Tuberculist functional categories) rather than GO functions.
Mandy decided that since the COGs were more complete (a large proportion of genes had no GO annotation) she would focus on those.
To test for enrichment I need to use a Fishers exact test. I have my data in R and would ideally like to do this as part of my data processing and visualization pipeline however I am not sure how to do this and how to put my data in the correct format. I can do this outside of R if needed.
Mandy has since found suitable scripts to do this with MTC in R
Comments