Wiki » History » Revision 6
« Previous |
Revision 6/7
(diff)
| Next »
Katie Lennard, 07/06/2020 03:44 PM
Wiki¶
Library prep summary¶
Sample concentration and quality was assessed by Eukaryote Total RNA Pico on Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA). Samples were treated with DNAse prior to library preparation. Library preparation was performed with SMARTer Stranded Total RNA (Clontech Inc, Mountain View, CA) following manufacturer’s instructions. Average final library size is between 300-400 bp. Illumina 8-nt dual-indices were used for multiplexing. Samples were pooled and sequenced on Illumina HiSeq X sequencer for 150 bp read length in paired-end mode, with an output of 80 million reads per sample.
Library prep QC¶
Sample QC reports attached. Mostly VERY low RIN scores.
Data location¶
Data is available in the form of compressed fastq files. Approximately 600 GB after unzipping the files. .fastq files and QC reports were downloaded from the genohub server using aws cli tools (see attached screenshot for instructions). Andrew installed aws cli on worker nodes on HPC to facilitate the transfer. Note that there were two rounds of data, the first dated Jan 2020, the second July 2020. The reason for the second round is because I noticed strange TTTT.... repeats in the QCed data. On query the company realised that they had mistakenly performed smallRNA library prep instead of total RNA. They therefore had to redo the libraries and rerun the data. The second round of data was downloaded to hpc with
aws s3 cp s3://genohub8705245/20028-01-new/ . --recursive
to
/scratch/kviljoen/Sonwabile_RNAseq/raw/
du -sh for second download was 317GB with 40 fastq.gz files
Bioinformatic analyses requested¶
Standard RNA sequencing analysis including quality assessment, data normalization, alignment, gene mapping, pairwise comparisons, functional enrichment and visualization.
Papers envisaged¶
Data from this analysis will be incorporated in a manuscript phenotyping the changes in immune cells (T regulatory and Th17 cells) during infancy or as a stand-alone manuscript. The authors will include the team in the Clive Gray and Heather Jaspan group involved in this work together with the Bioinfomatician from CBIO who is willing to collaborate with this analysis.
RNAseq QC¶
Preliminary QC indicates substantial rRNA content, high levels of duplication, a very high proportion of reads to short to map as well as Illumina adapter contamination. The Illumina adapters are usually removed by this pipeline but in this case they seem to have been missed (maybe because they are not right at the end of the read and occur at relatively variable positions across reads). I will therefore use bbduk (as implemented in the YAMP pipeline and now in https://github.com/kviljoen/fastq_QC)
The default phred score for bbduk trimming in fastq_QC pipeline is 10 (regions with average quality BELOW this will be trimmed). I did however notice severe levels of TTTTTTTT repeats (of varying lengths, in some cases the whole read) after trimming with default phred score of 10. So I raised this to 15 (as most of these T repeats had quality scores of 12 (ASCII '-').
#Stranded library
SMARTer Stranded RNA kit: https://github.com/kviljoen/RNAseq/blob/master/docs/usage.md#library-strandedness So for this library prep, see here https://chipster.csc.fi/manual/library-type-summary.html
we should use the flag --forwardStranded
Updated by Katie Lennard almost 5 years ago · 6 revisions