Wiki¶
#Blackburn lipid biomarkers of active vs. latent vs no TB analysis overview
Brief description of the project/data from Zandi¶
The objective of the random forest analysis is to identify small organic molecules differentially dysregulated between active TB, LTBI and nonTB infected TB subjects.
• Zandi supplied four datafiles (raw, no normalization). One for + ESI mode, another for – ESI mode and then two more where samples for which they had LAM-status were separated by ESI mode. However these last two files are unnecessary and should NOT be used as the same samples are also present in the first two datafiles with slightly different values because they data were processed twice for these LAM-status samples (once in isolation and another time together with all other samples).
• Samples were TB-negative, TB-positive or TB-latent.
• The distribution of the data seems strange to me for proteomics/lipidomics data that is usually zero-inflated. In these datasets there are almost no zeros..Zandi’s explanation for this:
You are spot on, the reason why the data is not zero inflated like Sue's data is because the data sent to you only contains "chromatographic peaks" data, that is, only peaks with an isotopic series were considered.
Intensities of these peaks (m/z) belonging to the same isotopic series were summed and represented by the monoisotopic m/z. So, basically collapsing the m/z's of the same isotopic series to 1 and removing all other m/z without a series, because these might be systemic noise.
• From heatmap looks like normalization should be performed. Zandi normally uses quantile normalization. For edgeR testing however raw data should be used as normalization factors are calculated from raw data for modeling.
So, I’ve used quantile-normalized, log2-transformed values for all heatmaps and for RF analysis and raw values for edgeR analyses
Results summary¶
The results indicate clear differences between all three groups (TB, LTBI, nonTB). Analyses performed (and repeated for input data in ESI+ and ESI- mode) were a) random forest on all three categories as outcome and b) edgeR differential abundance testing, subsetting the data to compare two categories at a time e.g LTBI vs. TB. Regarding differences by LAM status, there were N=5 TB+ samples that were LAM+ and N=5 TB+ samples that were LAM- (edgeR was performed to compare these two groups but there were no significant features). See Files section for results
Updated by Katie Lennard over 6 years ago · 2 revisions