Project

General

Profile

Wiki » History » Version 20

Katie Lennard, 11/09/2022 07:49 PM

1 1 Katie Lennard
# Wiki
2
3
# Data location: 
4
5
The data was transferred from Athena medmicro):
6
7
``` 
8
/MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1A_results_17022022 
9
/MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1B_results_21022022
10
```
11
 
12
to Ilifu:
13
14
```
15
/scratch3/users/katiel/Clinton/CRE_study_August_2022/
16
```
17
18 4 Katie Lennard
# Reference data:
19 1 Katie Lennard
20
Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18); 
21
Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3); 
22 2 Katie Lennard
Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and 
23 1 Katie Lennard
Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1).
24
25 2 Katie Lennard
```
26
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes
27
```
28
29 4 Katie Lennard
# Objectives workflow:
30 2 Katie Lennard
![workflow.png]()
31 3 Katie Lennard
32 4 Katie Lennard
# QC:
33 17 Katie Lennard
11 sample failed QC and had to be rerun. Note that they accidentally reran these 11 (study1A) twice – once on 28 Feb and once on 22 September. These runs were merged by combining samples e.g.:
34 3 Katie Lennard
35 1 Katie Lennard
```
36 17 Katie Lennard
cat KLEB-CRE-GSH-0016_S11_L001_R2_001.fastq.gz >> merged_reads/G-16_S11_L001_R2_001.fastq.gz
37
``` 
38
39
file location:
40 1 Katie Lennard
```
41 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/11_double_rerun_merged/merged_reads
42
```
43 1 Katie Lennard
44 17 Katie Lennard
Next these 11 merged-run samples were joined in one folder via symlinks with run B (passed QC):
45 1 Katie Lennard
46
```
47 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined
48 1 Katie Lennard
```
49
50 17 Katie Lennard
Filtering and trimming were executed as follows:
51
52
```
53
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
54
```
55
QC reports can be found in the 'files' tab
56
57 18 Katie Lennard
Note: to agree with srst2 file naming specifications I renamd the trimmed files from e.g. *_R1.fq to *_1.fq (remove R) using e.g.
58
```
59
for f in *.fq; do mv -v "$f" "${f/_R/_}";done
60
```
61 17 Katie Lennard
62 4 Katie Lennard
# AMR profiling
63
The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as:
64 6 Katie Lennard
65
## ARGannot
66 1 Katie Lennard
```
67 6 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot/coverage_80_run --min_gene_cov 80
68 1 Katie Lennard
```
69 7 Katie Lennard
Individual results files compiled as:
70 5 Katie Lennard
71 7 Katie Lennard
```
72
srst2 --prev_output *results.txt --output ARGannot_AMRs
73
```
74
75 6 Katie Lennard
## CARD DB: 
76 1 Katie Lennard
77 6 Katie Lennard
This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with:
78
79 1 Katie Lennard
```
80
wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta
81 6 Katie Lennard
```
82
83
Pipeline execution as:
84
85
```
86
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD/coverage_80_run --min_gene_cov 80
87 7 Katie Lennard
```
88
89
Individual results files compiled as:
90
91
```
92
srst2 --prev_output *results.txt --output CARD_AMRs
93 5 Katie Lennard
```
94 8 Katie Lennard
95
# Virulence factors
96
97 10 Katie Lennard
Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module (use the /cbio/users/katie/singularity_containers/srst2_v2.simg singularity container for this)
98
NB: in order to use the correct python version (2.7.5) for srst2 I first had to comment out the lines at the end of my .bashrc file relating to conda initialize
99 8 Katie Lennard
100
Build genus-specific DB:
101
```
102 10 Katie Lennard
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella 
103 8 Katie Lennard
```
104
was used to create the VF DB Klebsiella.fsa 
105
106 1 Katie Lennard
The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter
107 8 Katie Lennard
108
cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server:
109
```
110 1 Katie Lennard
singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash
111
```
112
113
then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:
114
115
```
116
 cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout
117
```
118
119
Repeat for other .fsa DBs
120 8 Katie Lennard
121 10 Katie Lennard
NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use srst2_v2.simg again)
122 8 Katie Lennard
123 9 Katie Lennard
```
124
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv
125 10 Katie Lennard
```
126
127
Next convert the resulting csv table to a SRST2-compatible sequence database using:
128
129
130
```
131
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/csv_to_gene_db.py -t Klebsiella_cdhit90.csv -o Klebsiella_VF_clustered.fasta -s 5
132
133
```
134
135
The actual VF typing can now be done using this clustered DB:
136
137
```
138
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_VF_clustered.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_VFs/coverage_80_run --min_gene_cov 80
139 9 Katie Lennard
```
140 11 Katie Lennard
141 19 Katie Lennard
Same for other genera using:
142
```
143
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_VF_clustered.fasta
144
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_VF_clustered.fasta
145
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Enterobacter_VF_clustered.fasta
146
```
147
148 11 Katie Lennard
Again combine individual sample results files with e.g.
149
```
150
srst2 --prev_output *genes* --output Klebsiella_VFs
151
```
152
153
# MLST
154 12 Katie Lennard
MLST profiles were downloaded for E. coli and K. pneumoniae as:
155
156
```
157
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#1'
158
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#2'
159
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Klebsiella pneumoniae'
160 14 Katie Lennard
161
```
162
163
Note: MLST profiles not available for Serratia marecescens or Enterobacter cloacae
164
165 1 Katie Lennard
MLST profiling execution:
166 15 Katie Lennard
167
```
168
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/Klebsiella_MLSTs
169
```
170 16 Katie Lennard
171
```
172 15 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_1_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#1.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli1_MLSTs
173
```
174
175 1 Katie Lennard
```
176 16 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_2_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#2.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli2_MLSTs
177
```
178 20 Katie Lennard
179
# Tychus alignment module
180
181
Note the ilifu branch of the Tychus repo should be used. Because we don't have a main.nf file we can't use the standard `nextflow pull` syntax so did a git clone instead
182
git clone --branch ilifu https://github.com/kviljoen/Tychus/
183
184
A list of fasata files for reference genomes was created here
185
NB: Error when trying to use E_coli_ATCC_25922_CP009072_1.fasta as --genome (but not e.g. Serratia)
186
187
NB: error in makephylogenies process:
188
189
``` .command.sh: 7: [: missing ]
190
  mv: cannot stat 'kSNP3_results/*.tre': No such file or directory
191
```
192
If you go into the working directory you'll find a NameErrors.txt file which states that some of the genome file names are illegal. The one that applies here is:  A file name may contian only one dot ('.') character, that which separates the file ID from the extension.
193
        EcoSME175.fasta is legal, EcoSME17.5.fasta is not
194
195
So all files were renamed from e.g. E_coli_ATCC_25922_CP009072.1.fasta to E_coli_ATCC_25922_CP009072_1.fasta
196
197
```
198
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list
199
```
200
201
Alignment run example against Serratia:
202
203
```
204
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-10-10-fastq_QC/bbduk/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
205
```