Project

General

Profile

Wiki » History » Version 24

Katie Lennard, 03/06/2023 10:16 AM

1 1 Katie Lennard
# Wiki
2
3
# Data location: 
4
5 22 Katie Lennard
The data was transferred from Athena medmicro) by e.g.:
6 1 Katie Lennard
7
```
8 22 Katie Lennard
rsync -avvP -e "ssh -i /home/katie/.ssh/id_rsa" /mnt/athena/medmicro/Clinton/CRE\ Pfizer\ Feb\ 2022/CRE\ study_4_results_11112022 katiel@transfer.ilifu.ac.za:/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/
9
```
10 1 Katie Lennard
 
11
to Ilifu:
12
13
```
14
/scratch3/users/katiel/Clinton/CRE_study_August_2022/
15
```
16
17 4 Katie Lennard
# Reference data:
18 1 Katie Lennard
19
Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18); 
20
Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3); 
21 2 Katie Lennard
Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and 
22 1 Katie Lennard
Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1).
23
24 2 Katie Lennard
```
25
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes
26
```
27
28 1 Katie Lennard
# Objectives workflow:
29 2 Katie Lennard
![workflow.png]()
30 3 Katie Lennard
31 4 Katie Lennard
# QC:
32 22 Katie Lennard
11 sample from Run 1 failed QC and had to be rerun. Note that they accidentally reran these 11 (study1A) twice – once on 28 Feb and once on 22 September. These runs were merged by combining samples e.g.:
33 3 Katie Lennard
34 1 Katie Lennard
```
35 17 Katie Lennard
cat KLEB-CRE-GSH-0016_S11_L001_R2_001.fastq.gz >> merged_reads/G-16_S11_L001_R2_001.fastq.gz
36
``` 
37
38
file location:
39 1 Katie Lennard
```
40 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/11_double_rerun_merged/merged_reads
41
```
42 1 Katie Lennard
43 17 Katie Lennard
Next these 11 merged-run samples were joined in one folder via symlinks with run B (passed QC):
44 1 Katie Lennard
45
```
46 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined
47 1 Katie Lennard
```
48
49 17 Katie Lennard
Filtering and trimming were executed as follows:
50
51
```
52 1 Katie Lennard
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
53 17 Katie Lennard
```
54
QC reports can be found in the 'files' tab
55
56 21 Katie Lennard
57 22 Katie Lennard
Runs 2 and 3 were combined with symlinks under: 
58 21 Katie Lennard
59
```
60
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined
61
```
62
63
FastQC was done (all samples passed):
64
65
```
66
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
67
```
68 1 Katie Lennard
69
and can be found here:
70
71
```
72
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-10-fastq_QC
73
```
74 22 Katie Lennard
   
75
Run 4 was added next and QCed:
76 21 Katie Lennard
77 24 Katie Lennard
```
78
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-12-fastq_QC
79
```
80
   
81 22 Katie Lennard
82
Note: to agree with srst2 file naming specifications I renamed the trimmed files from e.g. *_R1.fq to *_1.fq (remove R) using e.g.
83 18 Katie Lennard
```
84
for f in *.fq; do mv -v "$f" "${f/_R/_}";done
85
```
86 17 Katie Lennard
87 4 Katie Lennard
# AMR profiling
88
The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as:
89 6 Katie Lennard
90
## ARGannot
91 1 Katie Lennard
```
92 6 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot/coverage_80_run --min_gene_cov 80
93 1 Katie Lennard
```
94 7 Katie Lennard
Individual results files compiled as:
95 5 Katie Lennard
96 7 Katie Lennard
```
97
srst2 --prev_output *results.txt --output ARGannot_AMRs
98
```
99
100 6 Katie Lennard
## CARD DB: 
101 1 Katie Lennard
102 6 Katie Lennard
This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with:
103
104 1 Katie Lennard
```
105
wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta
106 6 Katie Lennard
```
107
108
Pipeline execution as:
109
110
```
111
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD/coverage_80_run --min_gene_cov 80
112 7 Katie Lennard
```
113
114
Individual results files compiled as:
115
116
```
117
srst2 --prev_output *results.txt --output CARD_AMRs
118 5 Katie Lennard
```
119 8 Katie Lennard
120
# Virulence factors
121
122 10 Katie Lennard
Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module (use the /cbio/users/katie/singularity_containers/srst2_v2.simg singularity container for this)
123
NB: in order to use the correct python version (2.7.5) for srst2 I first had to comment out the lines at the end of my .bashrc file relating to conda initialize
124 8 Katie Lennard
125
Build genus-specific DB:
126
```
127 10 Katie Lennard
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella 
128 8 Katie Lennard
```
129
was used to create the VF DB Klebsiella.fsa 
130
131 1 Katie Lennard
The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter
132 8 Katie Lennard
133
cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server:
134
```
135 1 Katie Lennard
singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash
136
```
137
138
then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:
139
140
```
141
 cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout
142
```
143
144
Repeat for other .fsa DBs
145 8 Katie Lennard
146 10 Katie Lennard
NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use srst2_v2.simg again)
147 8 Katie Lennard
148 9 Katie Lennard
```
149
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv
150 10 Katie Lennard
```
151
152
Next convert the resulting csv table to a SRST2-compatible sequence database using:
153
154
155
```
156
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/csv_to_gene_db.py -t Klebsiella_cdhit90.csv -o Klebsiella_VF_clustered.fasta -s 5
157
158
```
159
160
The actual VF typing can now be done using this clustered DB:
161
162
```
163
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_VF_clustered.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_VFs/coverage_80_run --min_gene_cov 80
164 9 Katie Lennard
```
165 11 Katie Lennard
166 19 Katie Lennard
Same for other genera using:
167
```
168
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_VF_clustered.fasta
169
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_VF_clustered.fasta
170
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Enterobacter_VF_clustered.fasta
171
```
172
173 11 Katie Lennard
Again combine individual sample results files with e.g.
174
```
175
srst2 --prev_output *genes* --output Klebsiella_VFs
176
```
177
178
# MLST
179 12 Katie Lennard
MLST profiles were downloaded for E. coli and K. pneumoniae as:
180
181
```
182
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#1'
183
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#2'
184
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Klebsiella pneumoniae'
185 14 Katie Lennard
186
```
187
188
Note: MLST profiles not available for Serratia marecescens or Enterobacter cloacae
189
190 1 Katie Lennard
MLST profiling execution:
191 15 Katie Lennard
192
```
193
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/Klebsiella_MLSTs
194
```
195 16 Katie Lennard
196
```
197 15 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_1_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#1.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli1_MLSTs
198
```
199
200 1 Katie Lennard
```
201 16 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_2_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#2.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli2_MLSTs
202
```
203 20 Katie Lennard
204 23 Katie Lennard
# Combining runs
205
206
For prelim analysis run1 was combined with the output from runs 2-4 with e.g. (from directory /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_combined_output_run1-4/CARD)
207
208
```
209
ln -s ../../srst2_CARD_run2to4/coverage_80_run/srst2/*genes* ./
210
ln -s ../../srst2_CARD_v3/coverage_80_run/srst2/*genes* ./
211
```
212
213 20 Katie Lennard
# Tychus alignment module
214
215
Note the ilifu branch of the Tychus repo should be used. Because we don't have a main.nf file we can't use the standard `nextflow pull` syntax so did a git clone instead
216
git clone --branch ilifu https://github.com/kviljoen/Tychus/
217
218
A list of fasata files for reference genomes was created here
219
NB: Error when trying to use E_coli_ATCC_25922_CP009072_1.fasta as --genome (but not e.g. Serratia)
220
221
NB: error in makephylogenies process:
222
223
``` .command.sh: 7: [: missing ]
224
  mv: cannot stat 'kSNP3_results/*.tre': No such file or directory
225
```
226
If you go into the working directory you'll find a NameErrors.txt file which states that some of the genome file names are illegal. The one that applies here is:  A file name may contian only one dot ('.') character, that which separates the file ID from the extension.
227
        EcoSME175.fasta is legal, EcoSME17.5.fasta is not
228
229
So all files were renamed from e.g. E_coli_ATCC_25922_CP009072.1.fasta to E_coli_ATCC_25922_CP009072_1.fasta
230
231
```
232
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list
233
```
234
235
Alignment run example against Serratia:
236
237
```
238
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-10-10-fastq_QC/bbduk/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
239
```