Project

General

Profile

Wiki » History » Version 25

Katie Lennard, 03/27/2023 11:42 AM

1 1 Katie Lennard
# Wiki
2
3
# Data location: 
4
5 22 Katie Lennard
The data was transferred from Athena medmicro) by e.g.:
6 1 Katie Lennard
7
```
8 22 Katie Lennard
rsync -avvP -e "ssh -i /home/katie/.ssh/id_rsa" /mnt/athena/medmicro/Clinton/CRE\ Pfizer\ Feb\ 2022/CRE\ study_4_results_11112022 katiel@transfer.ilifu.ac.za:/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/
9
```
10 1 Katie Lennard
 
11
to Ilifu:
12
13
```
14
/scratch3/users/katiel/Clinton/CRE_study_August_2022/
15
```
16
17 4 Katie Lennard
# Reference data:
18 1 Katie Lennard
19
Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18); 
20
Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3); 
21 2 Katie Lennard
Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and 
22 1 Katie Lennard
Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1).
23
24 2 Katie Lennard
```
25
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes
26
```
27
28 1 Katie Lennard
# Objectives workflow:
29 2 Katie Lennard
![workflow.png]()
30 3 Katie Lennard
31 4 Katie Lennard
# QC:
32 22 Katie Lennard
11 sample from Run 1 failed QC and had to be rerun. Note that they accidentally reran these 11 (study1A) twice – once on 28 Feb and once on 22 September. These runs were merged by combining samples e.g.:
33 3 Katie Lennard
34 1 Katie Lennard
```
35 17 Katie Lennard
cat KLEB-CRE-GSH-0016_S11_L001_R2_001.fastq.gz >> merged_reads/G-16_S11_L001_R2_001.fastq.gz
36
``` 
37
38
file location:
39 1 Katie Lennard
```
40 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/11_double_rerun_merged/merged_reads
41
```
42 1 Katie Lennard
43 17 Katie Lennard
Next these 11 merged-run samples were joined in one folder via symlinks with run B (passed QC):
44 1 Katie Lennard
45
```
46 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined
47 1 Katie Lennard
```
48
49 17 Katie Lennard
Filtering and trimming were executed as follows:
50
51
```
52 1 Katie Lennard
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
53 17 Katie Lennard
```
54
QC reports can be found in the 'files' tab
55
56 21 Katie Lennard
57 22 Katie Lennard
Runs 2 and 3 were combined with symlinks under: 
58 21 Katie Lennard
59
```
60
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined
61
```
62
63
FastQC was done (all samples passed):
64
65
```
66
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
67
```
68 1 Katie Lennard
69
and can be found here:
70
71
```
72
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-10-fastq_QC
73
```
74 22 Katie Lennard
   
75
Run 4 was added next and QCed:
76 21 Katie Lennard
77 24 Katie Lennard
```
78
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-12-fastq_QC
79
```
80
   
81 25 Katie Lennard
Run 5 was added and QCed:
82 22 Katie Lennard
83 25 Katie Lennard
```
84
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2023-03-24-fastq_QC
85
```
86
87 22 Katie Lennard
Note: to agree with srst2 file naming specifications I renamed the trimmed files from e.g. *_R1.fq to *_1.fq (remove R) using e.g.
88 18 Katie Lennard
```
89
for f in *.fq; do mv -v "$f" "${f/_R/_}";done
90
```
91 17 Katie Lennard
92 4 Katie Lennard
# AMR profiling
93 6 Katie Lennard
The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as:
94 1 Katie Lennard
95 6 Katie Lennard
## ARGannot
96 25 Katie Lennard
97
Run 1-5 combined:
98
99 1 Katie Lennard
```
100 25 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot_run_1to5/coverage_80_run --min_gene_cov 80
101 1 Katie Lennard
```
102 7 Katie Lennard
103 25 Katie Lennard
Individual results files compiled as:
104 7 Katie Lennard
```
105
srst2 --prev_output *results.txt --output ARGannot_AMRs
106
```
107 6 Katie Lennard
108
## CARD DB: 
109
110 1 Katie Lennard
This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with:
111
112 6 Katie Lennard
```
113 1 Katie Lennard
wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta
114 6 Katie Lennard
```
115 1 Katie Lennard
116 25 Katie Lennard
Pipeline execution as (run1-5) :
117 6 Katie Lennard
118
```
119 25 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD_run1to5/coverage_80_run --min_gene_cov 80
120 7 Katie Lennard
```
121
122
Individual results files compiled as:
123
124 1 Katie Lennard
```
125
srst2 --prev_output *results.txt --output CARD_AMRs
126
```
127
128 25 Katie Lennard
#Plasmids
129
130
PlasmidFinder plasmids run 1-5 (note min gene coverage of 50%, why?)
131
132
```
133
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/data/PlasmidFinder.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_plasmidFinder_run1to5 --min_cov 50
134
```
135
136
Now combine profiles for all samples:
137
138
```
139
srst2 --prev_output *results.txt --output plasmidFinder
140
```
141
142 8 Katie Lennard
# Virulence factors
143
144 10 Katie Lennard
Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module (use the /cbio/users/katie/singularity_containers/srst2_v2.simg singularity container for this)
145
NB: in order to use the correct python version (2.7.5) for srst2 I first had to comment out the lines at the end of my .bashrc file relating to conda initialize
146 8 Katie Lennard
147
Build genus-specific DB:
148
```
149 10 Katie Lennard
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella 
150 8 Katie Lennard
```
151
was used to create the VF DB Klebsiella.fsa 
152
153 1 Katie Lennard
The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter
154 8 Katie Lennard
155
cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server:
156 1 Katie Lennard
```
157
singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash
158
```
159
160
then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:
161
162
```
163
 cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout
164
```
165 8 Katie Lennard
166 10 Katie Lennard
Repeat for other .fsa DBs
167 8 Katie Lennard
168 9 Katie Lennard
NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use srst2_v2.simg again)
169
170 10 Katie Lennard
```
171
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv
172
```
173
174
Next convert the resulting csv table to a SRST2-compatible sequence database using:
175
176
177 1 Katie Lennard
```
178 10 Katie Lennard
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/csv_to_gene_db.py -t Klebsiella_cdhit90.csv -o Klebsiella_VF_clustered.fasta -s 5
179 1 Katie Lennard
180 10 Katie Lennard
```
181
182 25 Katie Lennard
The actual VF typing can now be done using this clustered DB (run1-5):
183 10 Katie Lennard
184
```
185 25 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_VF_clustered.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_VFs_run1to5/coverage_80_run --min_gene_cov 80
186 9 Katie Lennard
```
187 19 Katie Lennard
188
Same for other genera using:
189
```
190
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_VF_clustered.fasta
191
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_VF_clustered.fasta
192
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Enterobacter_VF_clustered.fasta
193 11 Katie Lennard
```
194
195
Again combine individual sample results files with e.g.
196
```
197
srst2 --prev_output *genes* --output Klebsiella_VFs
198 12 Katie Lennard
```
199
200
# MLST
201
MLST profiles were downloaded for E. coli and K. pneumoniae as:
202
203
```
204 14 Katie Lennard
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#1'
205
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#2'
206
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Klebsiella pneumoniae'
207
208 1 Katie Lennard
```
209
210 15 Katie Lennard
Note: MLST profiles not available for Serratia marecescens or Enterobacter cloacae
211 1 Katie Lennard
212 15 Katie Lennard
MLST profiling execution:
213
214 1 Katie Lennard
```
215 25 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/Klebsiella_MLSTs
216 16 Katie Lennard
```
217 1 Katie Lennard
218
```
219 25 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_1_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#1.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/E_coli1_MLSTs
220 1 Katie Lennard
```
221
222 16 Katie Lennard
```
223 25 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_2_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#2.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/E_coli2_MLSTs
224 15 Katie Lennard
```
225
226 25 Katie Lennard
Again combine individual sample results files with e.g.
227
```
228
srst2 --prev_output *results* --output Klebsiella_MLSTs
229
```
230
231 1 Katie Lennard
# Combining runs
232
233 23 Katie Lennard
For prelim analysis run1 was combined with the output from runs 2-4 with e.g. (from directory /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_combined_output_run1-4/CARD)
234
235
```
236
ln -s ../../srst2_CARD_run2to4/coverage_80_run/srst2/*genes* ./
237 1 Katie Lennard
ln -s ../../srst2_CARD_v3/coverage_80_run/srst2/*genes* ./
238 23 Katie Lennard
```
239
240 25 Katie Lennard
*Note that I subsequently reran everything after receiving run 5 data (so runs 1-5 all together)
241
242 20 Katie Lennard
# Tychus alignment module
243
244
Note the ilifu branch of the Tychus repo should be used. Because we don't have a main.nf file we can't use the standard `nextflow pull` syntax so did a git clone instead
245
git clone --branch ilifu https://github.com/kviljoen/Tychus/
246
247 25 Katie Lennard
A list of fasta files for reference genomes was created here
248 20 Katie Lennard
NB: Error when trying to use E_coli_ATCC_25922_CP009072_1.fasta as --genome (but not e.g. Serratia)
249
250
NB: error in makephylogenies process:
251
252
``` .command.sh: 7: [: missing ]
253
  mv: cannot stat 'kSNP3_results/*.tre': No such file or directory
254
```
255
If you go into the working directory you'll find a NameErrors.txt file which states that some of the genome file names are illegal. The one that applies here is:  A file name may contian only one dot ('.') character, that which separates the file ID from the extension.
256
        EcoSME175.fasta is legal, EcoSME17.5.fasta is not
257
258 1 Katie Lennard
So all files were renamed from e.g. E_coli_ATCC_25922_CP009072.1.fasta to E_coli_ATCC_25922_CP009072_1.fasta
259
260
```
261
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list
262
```
263 20 Katie Lennard
264
Alignment run example against Serratia:
265
266
```
267
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-10-10-fastq_QC/bbduk/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
268 25 Katie Lennard
```
269
270
Run 1–5 tychus alingment run:
271
```
272
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia_run1to5 --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
273 20 Katie Lennard
```