Wiki¶
Library prep summary¶
Sample concentration and quality was assessed by Eukaryote Total RNA Pico on Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Samples were treated with DNAse prior to library preparation. Library preparation was performed with SMARTer Stranded Total RNA (Clontech Inc, Mountain View, CA) following manufacturer’s instructions. Average final library size is between 300-400 bp. Illumina 8-nt dual-indices were used for multiplexing. Samples were pooled and sequenced on Illumina HiSeq X sequencer for 150 bp read length in paired-end mode, with an output of 80 million reads per sample.
Library prep QC¶
Sample QC reports attached. Mostly VERY low RIN scores.
Data location¶
Data is available in the form of compressed fastq files. Approximately 600 GB after unzipping the files. .fastq files and QC reports were downloaded from the genohub server using aws cli tools (see attached screenshot for instructions). Andrew installed aws cli on worker nodes on HPC to facilitate the transfer. Note that there were two rounds of data, the first dated Jan 2020, the second July 2020. The reason for the second round is because I noticed strange TTTT.... repeats in the QCed data. On query the company realised that they had mistakenly performed smallRNA library prep instead of total RNA. They therefore had to redo the libraries and rerun the data. The second round of data was downloaded to hpc with
aws s3 cp s3://genohub8705245/20028-01-new/ . --recursive
to
/scratch/kviljoen/Sonwabile_RNAseq/raw/
du -sh for second download was 317GB with 40 fastq.gz files
Bioinformatic analyses requested¶
Standard RNA sequencing analysis including quality assessment, data normalization, alignment, gene mapping, pairwise comparisons, functional enrichment and visualization.
Papers envisaged¶
Data from this analysis will be incorporated in a manuscript phenotyping the changes in immune cells (T regulatory and Th17 cells) during infancy or as a stand-alone manuscript. The authors will include the team in the Clive Gray and Heather Jaspan group involved in this work together with the Bioinfomatician from CBIO who is willing to collaborate with this analysis.
RNAseq QC¶
Preliminary QC indicates substantial rRNA content, high levels of duplication, a very high proportion of reads to short to map as well as Illumina adapter contamination. The Illumina adapters are usually removed by this pipeline but in this case they seem to have been missed (maybe because they are not right at the end of the read and occur at relatively variable positions across reads). I will therefore use bbduk (as implemented in the YAMP pipeline and now in https://github.com/kviljoen/fastq_QC)
The default phred score for bbduk trimming in fastq_QC pipeline is 10 (regions with average quality BELOW this will be trimmed). I did however notice severe levels of TTTTTTTT repeats (of varying lengths, in some cases the whole read) after trimming with default phred score of 10. So I raised this to 15 (as most of these T repeats had quality scores of 12 (ASCII '-').
#Stranded library
SMARTer Stranded RNA kit: https://github.com/kviljoen/RNAseq/blob/master/docs/usage.md#library-strandedness So for this library prep, see here https://chipster.csc.fi/manual/library-type-summary.html
we should use the flag --forwardStranded
#Downstream R analysis summary
Downstream analysis was performed in R. The major limitation is sample size per group, with age, gut-homing status and HIV exposures status included in the study leaving only 2-3 samples per group. Each infant has a GHneg and GHpos sample, but there are not enough samples (degrees of freedom) to code infant ID into the model. The only way to do the differential analysis was therefore to separate by GHneg and GHpos first and lump birth and week 15 samples together, with HIV exposure status as the main variable of interest. This study was poorly designed, with low RNA quality, which together greatly limit any conclusions that can be drawn from these results.
Differential abundance testing was conducted with limma's voom and voomwithqualityweights functions, which includes sample quality weights in the model. Limma's function lmfit() was used on the voomwithqualityweights normalized data. Downstream heatmaps were done with normalized data as input. There were no significant results after MTC, but results with unadjusted p-values < 0.05 were used for SPIA pathway analysis (performed separately for GHneg and GHpos). SPIA pathway analysis did not produce any signficant results after MTC, but certain pathways of potential interest with p < 0.05 were presented as heatmaps (of relevant genes) on request form Sonwabile for publication. Results are attached under the Files tab.
Updated by Katie Lennard about 4 years ago · 7 revisions