Wiki¶
#Blackburn lipid biomarkers of active TB analysis overview
Brief description of the data from Suereta¶
We analyzed extracted lipids form intact aluminum foil disks using chloroform methanol. Samples were collected by different filter sizes; i.e. 01, 2.5 and 10 micrometer in size. You will notice in the excel sheet 'labeled separated_filters' there are samples with more than one entries (_10, _2.5, _01) which refer to the different filter sizes and for some filters there are more than one file. There is also another sheet labeled combined_filters which combined all the separate filters for a single sample.
The extracts mixed with 2,5-Dihydroxybenzoic acid and were analyzed using the MALDI Mass spectrometer.
The raw data were analyzed using an in-house developed pipeline, in combination with mycobacterial database, to identify M. tuberculosis lipids. In a separate search we used Metabosearch to identify human lipids.
Further Zandi, used the R package Mass Spectrometry Wavelet on both the separate and combined datasets. I will be attaching the data set from this analysis to the email.
In the sample list you will notice batch 1 (controls: E_127 to E_168; cases: TDRS_006 to TDRS_024) and batch 2 (TDRS_027 to TDRS_083). These were analyzed on two separate occasions on the MALDI-MS. We can see a batch effect when we look at the data after our analysis.
Important observations (Katie)¶
Strong technical artifacts in the dataset that are apparently also present within batch 1 between cases and controls, based on heavy autocorrelation detected in the dataset that currently precludes detection of robust biomarkers.
See plots under the files section for more.
Summary of results for Jonathan, Suereta and Zandi:
*Herewith a summary of results for RF analysis performed on the raw data (dataset 1) from Suereta.
RF analysis on raw data including all 2279 lipids (N=16 active and N=10 latent): as with Zandi's data there is no stability in the IDs of 'top' biomarkers selected. This in my opinion is because results are driven by distributional differences between samples as opposed to biological ones.
If we try filter down the data to those features that are present in at least 15% of samples from batch 1 (this leaves 323/2279 features) in an effort to isolate biologically relevant features: We are still unable to identify reliable biomarkers, with a new set of 'top features' identified with each new randomly selected start seed.
To me, the sparsity of positive counts in dataset 1 (for the majority of samples) indicates that any standard attempt at normalization will introduce artifacts into the dataset because it’s not just an issue of overall signal intensity, it looks like a detection/sensitivity issue in most samples.
I did however perform differential abundance testing with edgeR (which calculates a normalization factor for each sample that is included in the model): similar to the RF results it seems that all significant results are seemingly due to technical artifacts and not biologically relevant differences. This conclusion is based on the fact that all differentially abundant features are increased in the control group (latent) compared to the TB-active group, which mirrors the pattern we see on the initial exploratory heatmap of more features detected in the control vs. the TB-active group.
I've attached the results, but I do not recommend using these (The duplicate set of RF results are exactly the same (performed on features present in at least 15% of samples) except for the start seed)*
Updated by Katie Lennard over 6 years ago · 2 revisions