Project

General

Profile

Wiki » History » Version 23

Katie Lennard, 11/14/2022 09:45 AM

1 1 Katie Lennard
# Wiki
2
3
# Data location: 
4
5 22 Katie Lennard
The data was transferred from Athena medmicro) by e.g.:
6 1 Katie Lennard
7
```
8 22 Katie Lennard
rsync -avvP -e "ssh -i /home/katie/.ssh/id_rsa" /mnt/athena/medmicro/Clinton/CRE\ Pfizer\ Feb\ 2022/CRE\ study_4_results_11112022 katiel@transfer.ilifu.ac.za:/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/
9
```
10 1 Katie Lennard
 
11
to Ilifu:
12
13
```
14
/scratch3/users/katiel/Clinton/CRE_study_August_2022/
15
```
16
17 4 Katie Lennard
# Reference data:
18 1 Katie Lennard
19
Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18); 
20
Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3); 
21 2 Katie Lennard
Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and 
22 1 Katie Lennard
Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1).
23
24 2 Katie Lennard
```
25
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes
26
```
27
28 1 Katie Lennard
# Objectives workflow:
29 2 Katie Lennard
![workflow.png]()
30 3 Katie Lennard
31 4 Katie Lennard
# QC:
32 22 Katie Lennard
11 sample from Run 1 failed QC and had to be rerun. Note that they accidentally reran these 11 (study1A) twice – once on 28 Feb and once on 22 September. These runs were merged by combining samples e.g.:
33 3 Katie Lennard
34 1 Katie Lennard
```
35 17 Katie Lennard
cat KLEB-CRE-GSH-0016_S11_L001_R2_001.fastq.gz >> merged_reads/G-16_S11_L001_R2_001.fastq.gz
36
``` 
37
38
file location:
39 1 Katie Lennard
```
40 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/11_double_rerun_merged/merged_reads
41
```
42 1 Katie Lennard
43 17 Katie Lennard
Next these 11 merged-run samples were joined in one folder via symlinks with run B (passed QC):
44 1 Katie Lennard
45
```
46 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined
47 1 Katie Lennard
```
48
49 17 Katie Lennard
Filtering and trimming were executed as follows:
50
51
```
52 1 Katie Lennard
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
53 17 Katie Lennard
```
54
QC reports can be found in the 'files' tab
55
56 21 Katie Lennard
57 22 Katie Lennard
Runs 2 and 3 were combined with symlinks under: 
58 21 Katie Lennard
59
```
60
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined
61
```
62
63
FastQC was done (all samples passed):
64
65
```
66
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
67
```
68 1 Katie Lennard
69
and can be found here:
70
71
```
72
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-10-fastq_QC
73
```
74 22 Katie Lennard
   
75
Run 4 was added next and QCed:
76 21 Katie Lennard
77 22 Katie Lennard
78
79
80
Note: to agree with srst2 file naming specifications I renamed the trimmed files from e.g. *_R1.fq to *_1.fq (remove R) using e.g.
81 18 Katie Lennard
```
82
for f in *.fq; do mv -v "$f" "${f/_R/_}";done
83
```
84 17 Katie Lennard
85 4 Katie Lennard
# AMR profiling
86
The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as:
87 6 Katie Lennard
88
## ARGannot
89 1 Katie Lennard
```
90 6 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot/coverage_80_run --min_gene_cov 80
91 1 Katie Lennard
```
92 7 Katie Lennard
Individual results files compiled as:
93 5 Katie Lennard
94 7 Katie Lennard
```
95
srst2 --prev_output *results.txt --output ARGannot_AMRs
96
```
97
98 6 Katie Lennard
## CARD DB: 
99 1 Katie Lennard
100 6 Katie Lennard
This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with:
101
102 1 Katie Lennard
```
103
wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta
104 6 Katie Lennard
```
105
106
Pipeline execution as:
107
108
```
109
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD/coverage_80_run --min_gene_cov 80
110 7 Katie Lennard
```
111
112
Individual results files compiled as:
113
114
```
115
srst2 --prev_output *results.txt --output CARD_AMRs
116 5 Katie Lennard
```
117 8 Katie Lennard
118
# Virulence factors
119
120 10 Katie Lennard
Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module (use the /cbio/users/katie/singularity_containers/srst2_v2.simg singularity container for this)
121
NB: in order to use the correct python version (2.7.5) for srst2 I first had to comment out the lines at the end of my .bashrc file relating to conda initialize
122 8 Katie Lennard
123
Build genus-specific DB:
124
```
125 10 Katie Lennard
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella 
126 8 Katie Lennard
```
127
was used to create the VF DB Klebsiella.fsa 
128
129 1 Katie Lennard
The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter
130 8 Katie Lennard
131
cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server:
132
```
133 1 Katie Lennard
singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash
134
```
135
136
then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:
137
138
```
139
 cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout
140
```
141
142
Repeat for other .fsa DBs
143 8 Katie Lennard
144 10 Katie Lennard
NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use srst2_v2.simg again)
145 8 Katie Lennard
146 9 Katie Lennard
```
147
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv
148 10 Katie Lennard
```
149
150
Next convert the resulting csv table to a SRST2-compatible sequence database using:
151
152
153
```
154
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/csv_to_gene_db.py -t Klebsiella_cdhit90.csv -o Klebsiella_VF_clustered.fasta -s 5
155
156
```
157
158
The actual VF typing can now be done using this clustered DB:
159
160
```
161
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_VF_clustered.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_VFs/coverage_80_run --min_gene_cov 80
162 9 Katie Lennard
```
163 11 Katie Lennard
164 19 Katie Lennard
Same for other genera using:
165
```
166
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_VF_clustered.fasta
167
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_VF_clustered.fasta
168
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Enterobacter_VF_clustered.fasta
169
```
170
171 11 Katie Lennard
Again combine individual sample results files with e.g.
172
```
173
srst2 --prev_output *genes* --output Klebsiella_VFs
174
```
175
176
# MLST
177 12 Katie Lennard
MLST profiles were downloaded for E. coli and K. pneumoniae as:
178
179
```
180
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#1'
181
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#2'
182
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Klebsiella pneumoniae'
183 14 Katie Lennard
184
```
185
186
Note: MLST profiles not available for Serratia marecescens or Enterobacter cloacae
187
188 1 Katie Lennard
MLST profiling execution:
189 15 Katie Lennard
190
```
191
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/Klebsiella_MLSTs
192
```
193 16 Katie Lennard
194
```
195 15 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_1_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#1.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli1_MLSTs
196
```
197
198 1 Katie Lennard
```
199 16 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_2_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#2.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli2_MLSTs
200
```
201 20 Katie Lennard
202 23 Katie Lennard
# Combining runs
203
204
For prelim analysis run1 was combined with the output from runs 2-4 with e.g. (from directory /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_combined_output_run1-4/CARD)
205
206
```
207
ln -s ../../srst2_CARD_run2to4/coverage_80_run/srst2/*genes* ./
208
ln -s ../../srst2_CARD_v3/coverage_80_run/srst2/*genes* ./
209
```
210
211 20 Katie Lennard
# Tychus alignment module
212
213
Note the ilifu branch of the Tychus repo should be used. Because we don't have a main.nf file we can't use the standard `nextflow pull` syntax so did a git clone instead
214
git clone --branch ilifu https://github.com/kviljoen/Tychus/
215
216
A list of fasata files for reference genomes was created here
217
NB: Error when trying to use E_coli_ATCC_25922_CP009072_1.fasta as --genome (but not e.g. Serratia)
218
219
NB: error in makephylogenies process:
220
221
``` .command.sh: 7: [: missing ]
222
  mv: cannot stat 'kSNP3_results/*.tre': No such file or directory
223
```
224
If you go into the working directory you'll find a NameErrors.txt file which states that some of the genome file names are illegal. The one that applies here is:  A file name may contian only one dot ('.') character, that which separates the file ID from the extension.
225
        EcoSME175.fasta is legal, EcoSME17.5.fasta is not
226
227
So all files were renamed from e.g. E_coli_ATCC_25922_CP009072.1.fasta to E_coli_ATCC_25922_CP009072_1.fasta
228
229
```
230
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list
231
```
232
233
Alignment run example against Serratia:
234
235
```
236
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-10-10-fastq_QC/bbduk/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
237
```