Wiki » History » Revision 9
Revision 8 (Katie Lennard, 09/21/2022 09:11 AM) → Revision 9/26 (Katie Lennard, 09/21/2022 01:43 PM)
# Wiki
# Data location:
The data was transferred from Athena medmicro):
```
/MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1A_results_17022022
/MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1B_results_21022022
```
to Ilifu:
```
/scratch3/users/katiel/Clinton/CRE_study_August_2022/
```
# Reference data:
Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18);
Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3);
Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and
Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1).
```
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes
```
# Objectives workflow:
![workflow.png]()
# QC:
11 sample failed QC phred scores before trimming and filtering; none failed after filtering and trimming. Filtering and trimming were executed as follows:
```
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
```
QC reports can be found in the 'files' tab
# AMR profiling
The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as:
## ARGannot
```
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot/coverage_80_run --min_gene_cov 80
```
Individual results files compiled as:
```
srst2 --prev_output *results.txt --output ARGannot_AMRs
```
## CARD DB:
This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with:
```
wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta
```
Pipeline execution as:
```
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD/coverage_80_run --min_gene_cov 80
```
Individual results files compiled as:
```
srst2 --prev_output *results.txt --output CARD_AMRs
```
# Virulence factors
Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module. This was installed as a virtual environment on Ilifu as follows (from an interactive node):
```
module add python/3.9.0
virtualenv .srst2_biopython_venv
. .srst2_biopython_venv/bin/activate
pip install biopython==1.68
```
*Note: biopython 1.68 to avoid error with later versions of "ImportError: Bio.Alphabet has been removed from Biopython"
The module can now be accessed any time by:
```
. .srst2_biopython_venv/bin/activate
```
Build genus-specific DB:
```
python /cbio/users/katiel/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella
```
was used to create the VF DB Klebsiella.fsa
The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter
cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server:
```
singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash
```
then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:
then:
```
cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout
```
Repeat for other .fsa DBs
NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use biopython environment again):
Note: here I had issues with the python version being used (we don't have the srst2-recommended python 2.75 on Ilifu its' too old, so the env was built with python3
The syntax has changed a bit when trying to execute VFDB_cdhit_to_csv.py I got 'python NameError: name 'file' is not defined'. This could be fixed by editing the actual script to replace file() with open()
```
. /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/.srst2_biopython_venv/bin/activate
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv
```
# Data location:
The data was transferred from Athena medmicro):
```
/MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1A_results_17022022
/MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1B_results_21022022
```
to Ilifu:
```
/scratch3/users/katiel/Clinton/CRE_study_August_2022/
```
# Reference data:
Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18);
Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3);
Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and
Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1).
```
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes
```
# Objectives workflow:
![workflow.png]()
# QC:
11 sample failed QC phred scores before trimming and filtering; none failed after filtering and trimming. Filtering and trimming were executed as follows:
```
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
```
QC reports can be found in the 'files' tab
# AMR profiling
The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as:
## ARGannot
```
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot/coverage_80_run --min_gene_cov 80
```
Individual results files compiled as:
```
srst2 --prev_output *results.txt --output ARGannot_AMRs
```
## CARD DB:
This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with:
```
wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta
```
Pipeline execution as:
```
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD/coverage_80_run --min_gene_cov 80
```
Individual results files compiled as:
```
srst2 --prev_output *results.txt --output CARD_AMRs
```
# Virulence factors
Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module. This was installed as a virtual environment on Ilifu as follows (from an interactive node):
```
module add python/3.9.0
virtualenv .srst2_biopython_venv
. .srst2_biopython_venv/bin/activate
pip install biopython==1.68
```
*Note: biopython 1.68 to avoid error with later versions of "ImportError: Bio.Alphabet has been removed from Biopython"
The module can now be accessed any time by:
```
. .srst2_biopython_venv/bin/activate
```
Build genus-specific DB:
```
python /cbio/users/katiel/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella
```
was used to create the VF DB Klebsiella.fsa
The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter
cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server:
```
singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash
```
then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:
then:
```
cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout
```
Repeat for other .fsa DBs
NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use biopython environment again):
Note: here I had issues with the python version being used (we don't have the srst2-recommended python 2.75 on Ilifu its' too old, so the env was built with python3
The syntax has changed a bit when trying to execute VFDB_cdhit_to_csv.py I got 'python NameError: name 'file' is not defined'. This could be fixed by editing the actual script to replace file() with open()
```
. /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/.srst2_biopython_venv/bin/activate
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv
```