Wiki » History » Revision 4
Revision 3 (Ephie Geza, 12/05/2022 09:13 AM) → Revision 4/6 (Ephie Geza, 01/24/2023 05:21 PM)
# Wiki
## AIM: To develop a predictive algorithm to determine whether an infectious or other non-infectious cause is likely or not.
The aim will be achieved based on
1. Human RNASeq & downstream analysis as noted specifically related to immune system genes
1. Assess the human immune system genes DNA in particular but not limited to interferon, cytokines and chemokines)
## Sample data for all the participants is on ilifu in
/cbio/projects/017/definitive/
## Detailed information regarding participants is provided in a txt file
/cbio/projects/017/patients_clinical_details.txt
Of the planned 47 participants, COVC04, COVC07, COVC23 and COVC30 were excluded based on the clinical notes shared by Ruan Marais on 18 July 2022 on slack: https://cbio.slack.com/files/U02LWC4GQTE/F03PZ1H8J0J/table_1_-_clinical_details.xlsx.
As at **10 August 2022**, one participant: COVC26 is outstanding in **/cbio/projects/017/definitive/**, as such the metadata file excludes this participant.
> /cbio/projects/017/metadata.txt
`metadata.txt` is a file that consists of the three columns of
> /cbio/projects/017/patients_clinical_details.txt
It was created by reading the .xsls file in R and write the "samplename", "COVID-19 status" and "Neurological symptoms due to COVID-19"
## Important things to note:
We perform the RNA seq gene count using the
nf-core/rnaseq pipeline.
`nf-core/rnaseq` does read quality checks using **FASTQC** , read trimming by **TrimGalore** , read mapping by **STAR** & quantification by **SALMON**.
To run the pipeline, we create a **samplesheet.csv** for the analysis by using **fastq_dir_to_samplesheet.py** obtained from the **nf-core** by using **wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py**. And changed the file permissions to executable
``` shell
chmod 755 fastq_dir_to_samplesheet.py
```
Run the script
``` shell
./fastq_dir_to_samplesheet.py /cbio/projects/017/definitive/ /cbio/projects/017/analysis/samplesheet.csv --strandedness reverse
```
## Run the `nf-core/rnaseq` pipeline,
``` shell
sbatch /cbio/projects/017/rnaseq/rnaseq-pipeline.sh
```
Upon getting the quantification results **(star_salmon)**, downstream analysis is done using **R programming** language on a local machine. The **working directory** is
> /home/ephie/UCT-DATA_ANALYST/BioinformaticsSupportTeam/ruan/definitive/ /home/ephie/UCT-DATA_ANALYST/BioinformaticsSupportTeam/ruan/definitive/results/
using the **R**. We have different versions, that is,
## v0 **R script**
Details of this analysis and the results are given under the <https://bst.cbio.uct.ac.za/redmine/attachments/198>. ``` shell
/home/ephie/UCT-DATA_ANALYST/BioinformaticsSupportTeam/ruan/definitive/dge_downstream.R
```
We grouped the samples based on encephalitic (yes or no), COVID-19 status (possible or unlikely) and immunosupression (yes or no)
## v1
Details of the analysis and the design are provide in <https://bst.cbio.uct.ac.za/redmine/attachments/196>.
## v2
Details of the analysis and the design are provide in <https://bst.cbio.uct.ac.za/redmine/attachments/197>
Generally, the downstream analysis was done with use **DESeq2** for differential gene expression analysis, and **R packages** including **ggplot** and others. In short, the we do **R script** does
1. Count normalization that i.e creation of the DESeq2Dataset object.
1. Exploratory data analysis (PCA & hierarchical clustering) - identifying outliers & sources of variation in the data:
1. Running the DESeq2 using the "DESeq2" function
1. Check the fit of the dispersion estimates: using "plotDispEsts"
1. Create contrasts to perform Wald testing on the shrunken log2 fold changes between specific conditions:
1. Output significant results
1. Visualize results: volcano plots, heat-maps, normalized counts plots of top genes, etc.
1. Take note of all the versions of all tools used in the DE analysis:
We grouped the samples based on encephalitic (yes or no), COVID-19 status (possible or unlikely) and immunosupression (yes or no)
## AIM: To develop a predictive algorithm to determine whether an infectious or other non-infectious cause is likely or not.
The aim will be achieved based on
1. Human RNASeq & downstream analysis as noted specifically related to immune system genes
1. Assess the human immune system genes DNA in particular but not limited to interferon, cytokines and chemokines)
## Sample data for all the participants is on ilifu in
/cbio/projects/017/definitive/
## Detailed information regarding participants is provided in a txt file
/cbio/projects/017/patients_clinical_details.txt
Of the planned 47 participants, COVC04, COVC07, COVC23 and COVC30 were excluded based on the clinical notes shared by Ruan Marais on 18 July 2022 on slack: https://cbio.slack.com/files/U02LWC4GQTE/F03PZ1H8J0J/table_1_-_clinical_details.xlsx.
As at **10 August 2022**, one participant: COVC26 is outstanding in **/cbio/projects/017/definitive/**, as such the metadata file excludes this participant.
> /cbio/projects/017/metadata.txt
`metadata.txt` is a file that consists of the three columns of
> /cbio/projects/017/patients_clinical_details.txt
It was created by reading the .xsls file in R and write the "samplename", "COVID-19 status" and "Neurological symptoms due to COVID-19"
## Important things to note:
We perform the RNA seq gene count using the
nf-core/rnaseq pipeline.
`nf-core/rnaseq` does read quality checks using **FASTQC** , read trimming by **TrimGalore** , read mapping by **STAR** & quantification by **SALMON**.
To run the pipeline, we create a **samplesheet.csv** for the analysis by using **fastq_dir_to_samplesheet.py** obtained from the **nf-core** by using **wget -L https://raw.githubusercontent.com/nf-core/rnaseq/master/bin/fastq_dir_to_samplesheet.py**. And changed the file permissions to executable
``` shell
chmod 755 fastq_dir_to_samplesheet.py
```
Run the script
``` shell
./fastq_dir_to_samplesheet.py /cbio/projects/017/definitive/ /cbio/projects/017/analysis/samplesheet.csv --strandedness reverse
```
## Run the `nf-core/rnaseq` pipeline,
``` shell
sbatch /cbio/projects/017/rnaseq/rnaseq-pipeline.sh
```
Upon getting the quantification results **(star_salmon)**, downstream analysis is done using **R programming** language on a local machine. The **working directory** is
> /home/ephie/UCT-DATA_ANALYST/BioinformaticsSupportTeam/ruan/definitive/ /home/ephie/UCT-DATA_ANALYST/BioinformaticsSupportTeam/ruan/definitive/results/
using the **R**. We have different versions, that is,
## v0 **R script**
Details of this analysis and the results are given under the <https://bst.cbio.uct.ac.za/redmine/attachments/198>. ``` shell
/home/ephie/UCT-DATA_ANALYST/BioinformaticsSupportTeam/ruan/definitive/dge_downstream.R
```
We grouped the samples based on encephalitic (yes or no), COVID-19 status (possible or unlikely) and immunosupression (yes or no)
## v1
Details of the analysis and the design are provide in <https://bst.cbio.uct.ac.za/redmine/attachments/196>.
## v2
Details of the analysis and the design are provide in <https://bst.cbio.uct.ac.za/redmine/attachments/197>
Generally, the downstream analysis was done with use **DESeq2** for differential gene expression analysis, and **R packages** including **ggplot** and others. In short, the we do **R script** does
1. Count normalization that i.e creation of the DESeq2Dataset object.
1. Exploratory data analysis (PCA & hierarchical clustering) - identifying outliers & sources of variation in the data:
1. Running the DESeq2 using the "DESeq2" function
1. Check the fit of the dispersion estimates: using "plotDispEsts"
1. Create contrasts to perform Wald testing on the shrunken log2 fold changes between specific conditions:
1. Output significant results
1. Visualize results: volcano plots, heat-maps, normalized counts plots of top genes, etc.
1. Take note of all the versions of all tools used in the DE analysis:
We grouped the samples based on encephalitic (yes or no), COVID-19 status (possible or unlikely) and immunosupression (yes or no)