Wiki¶
RNAseq pipeline execution¶
Because I did not have enough available disk space on HPC as user to execute Severin's run I had to log in through his account (kmddon001) and run the pipeline from there.
Attached run report. I had lots of trouble with the java-based processes with Andrew killing jobs here because they would grab more cores than allocated which would slow down other users' jobs. In the end I had to use an entire node for each java-dependent job (cpu=40). The pipeline runs for several days.
QC summary¶
This library was prepared with rRNA depletion (as opposed to polyA enrichment) method which probably explains the high proportion of intergenic and pseudogenes that we're seeing. Specifically this paper (attached) shows that ribo-minus depletion results in very different profiles from polyA method, particularly for blood samples where they found they needed ~220% more reads to get the same exonic coverage as with polyA. They also mention: 'Our evaluation revealed that a small number of lncRNAs and small RNAs made up a large fraction of the reads in the rRNA depletion RNA sequencing data.'; and that 'A very high portion of reads (more than half in blood and one third in colon) mapped to intronic regions in the rRNA depleted libraries. The pattern in Fig. 2A indicates many immature and/or nascent RNA transcripts were captured in the rRNA depletion RNA-seq.'
The STAR alignment results list a large proportion of reads as 'unmapped: too short'. This is not because the actual reads were too short but because the reads just couldn't be mapped. This could be due to rRNA contamination as suggested [[[https://sites.google.com/site/dvanichkina/ngs#TOC-If-the-number-of-unmapped-reads-too-short-is-very-high]]]
If your samples are "total RNA", depleted with Ribo-Zero or Ribo-Minus kits, it is possible that the depletion did not work well. rRNA are typically multi-mappers (and you get plenty of those), however, not all rRNA repeats make it into the main chromosomal assembly, and in this case they will not be mapped and will be reported as "alignment too short". We have recently had many cases like that in our lab for human tissues.'
Finally there is a high level of read duplication: e.g. ~ 30% of reads have duplication rates of > 1000. This requires further investigation.
Looking at the downstream gene counts from the process featureCounts ~ 40-50% of mapped reads are to the pseudogene ENSG00000226958. The annotation for this gene seems to be ambiguous with some records classifying it as CTD-2328D6.1 and others to RNA28S5 (RNA, 28S ribosomal 5).
Updated by Katie Lennard over 5 years ago ยท 1 revisions