Wiki » History » Version 26
Katie Lennard, 03/28/2023 07:35 AM
1 | 1 | Katie Lennard | # Wiki |
---|---|---|---|
2 | |||
3 | # Data location: |
||
4 | |||
5 | 22 | Katie Lennard | The data was transferred from Athena medmicro) by e.g.: |
6 | 1 | Katie Lennard | |
7 | ``` |
||
8 | 22 | Katie Lennard | rsync -avvP -e "ssh -i /home/katie/.ssh/id_rsa" /mnt/athena/medmicro/Clinton/CRE\ Pfizer\ Feb\ 2022/CRE\ study_4_results_11112022 katiel@transfer.ilifu.ac.za:/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/ |
9 | ``` |
||
10 | 1 | Katie Lennard | |
11 | to Ilifu: |
||
12 | |||
13 | ``` |
||
14 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ |
||
15 | ``` |
||
16 | |||
17 | 4 | Katie Lennard | # Reference data: |
18 | 1 | Katie Lennard | |
19 | Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18); |
||
20 | Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3); |
||
21 | 2 | Katie Lennard | Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and |
22 | 1 | Katie Lennard | Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1). |
23 | |||
24 | 2 | Katie Lennard | ``` |
25 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes |
||
26 | ``` |
||
27 | |||
28 | 1 | Katie Lennard | # Objectives workflow: |
29 | 2 | Katie Lennard | ![workflow.png]() |
30 | 3 | Katie Lennard | |
31 | 4 | Katie Lennard | # QC: |
32 | 22 | Katie Lennard | 11 sample from Run 1 failed QC and had to be rerun. Note that they accidentally reran these 11 (study1A) twice – once on 28 Feb and once on 22 September. These runs were merged by combining samples e.g.: |
33 | 3 | Katie Lennard | |
34 | 1 | Katie Lennard | ``` |
35 | 17 | Katie Lennard | cat KLEB-CRE-GSH-0016_S11_L001_R2_001.fastq.gz >> merged_reads/G-16_S11_L001_R2_001.fastq.gz |
36 | ``` |
||
37 | |||
38 | file location: |
||
39 | 1 | Katie Lennard | ``` |
40 | 17 | Katie Lennard | /scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/11_double_rerun_merged/merged_reads |
41 | ``` |
||
42 | 1 | Katie Lennard | |
43 | 17 | Katie Lennard | Next these 11 merged-run samples were joined in one folder via symlinks with run B (passed QC): |
44 | 1 | Katie Lennard | |
45 | ``` |
||
46 | 17 | Katie Lennard | /scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined |
47 | 1 | Katie Lennard | ``` |
48 | |||
49 | 17 | Katie Lennard | Filtering and trimming were executed as follows: |
50 | |||
51 | ``` |
||
52 | 1 | Katie Lennard | nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu |
53 | 17 | Katie Lennard | ``` |
54 | QC reports can be found in the 'files' tab |
||
55 | |||
56 | 21 | Katie Lennard | |
57 | 22 | Katie Lennard | Runs 2 and 3 were combined with symlinks under: |
58 | 21 | Katie Lennard | |
59 | ``` |
||
60 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined |
||
61 | ``` |
||
62 | |||
63 | FastQC was done (all samples passed): |
||
64 | |||
65 | ``` |
||
66 | nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined/*_R{1,2}_001.fastq.gz' -profile ilifu |
||
67 | ``` |
||
68 | 1 | Katie Lennard | |
69 | and can be found here: |
||
70 | |||
71 | ``` |
||
72 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-10-fastq_QC |
||
73 | ``` |
||
74 | 22 | Katie Lennard | |
75 | Run 4 was added next and QCed: |
||
76 | 21 | Katie Lennard | |
77 | 24 | Katie Lennard | ``` |
78 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-12-fastq_QC |
||
79 | ``` |
||
80 | |||
81 | 25 | Katie Lennard | Run 5 was added and QCed: |
82 | 22 | Katie Lennard | |
83 | 25 | Katie Lennard | ``` |
84 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/2023-03-24-fastq_QC |
||
85 | ``` |
||
86 | |||
87 | 22 | Katie Lennard | Note: to agree with srst2 file naming specifications I renamed the trimmed files from e.g. *_R1.fq to *_1.fq (remove R) using e.g. |
88 | 18 | Katie Lennard | ``` |
89 | for f in *.fq; do mv -v "$f" "${f/_R/_}";done |
||
90 | ``` |
||
91 | 17 | Katie Lennard | |
92 | 4 | Katie Lennard | # AMR profiling |
93 | 6 | Katie Lennard | The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as: |
94 | 1 | Katie Lennard | |
95 | 6 | Katie Lennard | ## ARGannot |
96 | 25 | Katie Lennard | |
97 | Run 1-5 combined: |
||
98 | |||
99 | 1 | Katie Lennard | ``` |
100 | 25 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot_run_1to5/coverage_80_run --min_gene_cov 80 |
101 | 1 | Katie Lennard | ``` |
102 | 7 | Katie Lennard | |
103 | 25 | Katie Lennard | Individual results files compiled as: |
104 | 7 | Katie Lennard | ``` |
105 | srst2 --prev_output *results.txt --output ARGannot_AMRs |
||
106 | ``` |
||
107 | 6 | Katie Lennard | |
108 | ## CARD DB: |
||
109 | |||
110 | 1 | Katie Lennard | This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with: |
111 | |||
112 | 6 | Katie Lennard | ``` |
113 | 1 | Katie Lennard | wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta |
114 | 6 | Katie Lennard | ``` |
115 | 1 | Katie Lennard | |
116 | 25 | Katie Lennard | Pipeline execution as (run1-5) : |
117 | 6 | Katie Lennard | |
118 | ``` |
||
119 | 25 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD_run1to5/coverage_80_run --min_gene_cov 80 |
120 | 7 | Katie Lennard | ``` |
121 | |||
122 | Individual results files compiled as: |
||
123 | |||
124 | 1 | Katie Lennard | ``` |
125 | srst2 --prev_output *results.txt --output CARD_AMRs |
||
126 | ``` |
||
127 | |||
128 | 26 | Katie Lennard | # Plasmids |
129 | 25 | Katie Lennard | |
130 | PlasmidFinder plasmids run 1-5 (note min gene coverage of 50%, why?) |
||
131 | |||
132 | ``` |
||
133 | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/data/PlasmidFinder.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_plasmidFinder_run1to5 --min_cov 50 |
||
134 | ``` |
||
135 | |||
136 | Now combine profiles for all samples: |
||
137 | |||
138 | ``` |
||
139 | srst2 --prev_output *results.txt --output plasmidFinder |
||
140 | ``` |
||
141 | |||
142 | 8 | Katie Lennard | # Virulence factors |
143 | |||
144 | 10 | Katie Lennard | Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module (use the /cbio/users/katie/singularity_containers/srst2_v2.simg singularity container for this) |
145 | NB: in order to use the correct python version (2.7.5) for srst2 I first had to comment out the lines at the end of my .bashrc file relating to conda initialize |
||
146 | 8 | Katie Lennard | |
147 | Build genus-specific DB: |
||
148 | ``` |
||
149 | 10 | Katie Lennard | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella |
150 | 8 | Katie Lennard | ``` |
151 | was used to create the VF DB Klebsiella.fsa |
||
152 | |||
153 | 1 | Katie Lennard | The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter |
154 | 8 | Katie Lennard | |
155 | cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server: |
||
156 | 1 | Katie Lennard | ``` |
157 | singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash |
||
158 | ``` |
||
159 | |||
160 | then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity: |
||
161 | |||
162 | ``` |
||
163 | cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout |
||
164 | ``` |
||
165 | 8 | Katie Lennard | |
166 | 10 | Katie Lennard | Repeat for other .fsa DBs |
167 | 8 | Katie Lennard | |
168 | 9 | Katie Lennard | NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use srst2_v2.simg again) |
169 | |||
170 | 10 | Katie Lennard | ``` |
171 | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv |
||
172 | ``` |
||
173 | |||
174 | Next convert the resulting csv table to a SRST2-compatible sequence database using: |
||
175 | |||
176 | |||
177 | 1 | Katie Lennard | ``` |
178 | 10 | Katie Lennard | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/csv_to_gene_db.py -t Klebsiella_cdhit90.csv -o Klebsiella_VF_clustered.fasta -s 5 |
179 | 1 | Katie Lennard | |
180 | 10 | Katie Lennard | ``` |
181 | |||
182 | 25 | Katie Lennard | The actual VF typing can now be done using this clustered DB (run1-5): |
183 | 10 | Katie Lennard | |
184 | ``` |
||
185 | 25 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_VF_clustered.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_VFs_run1to5/coverage_80_run --min_gene_cov 80 |
186 | 9 | Katie Lennard | ``` |
187 | 19 | Katie Lennard | |
188 | Same for other genera using: |
||
189 | ``` |
||
190 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_VF_clustered.fasta |
||
191 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_VF_clustered.fasta |
||
192 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Enterobacter_VF_clustered.fasta |
||
193 | 11 | Katie Lennard | ``` |
194 | |||
195 | Again combine individual sample results files with e.g. |
||
196 | ``` |
||
197 | srst2 --prev_output *genes* --output Klebsiella_VFs |
||
198 | 12 | Katie Lennard | ``` |
199 | |||
200 | # MLST |
||
201 | MLST profiles were downloaded for E. coli and K. pneumoniae as: |
||
202 | |||
203 | ``` |
||
204 | 14 | Katie Lennard | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#1' |
205 | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#2' |
||
206 | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Klebsiella pneumoniae' |
||
207 | |||
208 | 1 | Katie Lennard | ``` |
209 | |||
210 | 15 | Katie Lennard | Note: MLST profiles not available for Serratia marecescens or Enterobacter cloacae |
211 | 1 | Katie Lennard | |
212 | 15 | Katie Lennard | MLST profiling execution: |
213 | |||
214 | 1 | Katie Lennard | ``` |
215 | 25 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/Klebsiella_MLSTs |
216 | 16 | Katie Lennard | ``` |
217 | 1 | Katie Lennard | |
218 | ``` |
||
219 | 25 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_1_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#1.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/E_coli1_MLSTs |
220 | 1 | Katie Lennard | ``` |
221 | |||
222 | 16 | Katie Lennard | ``` |
223 | 25 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_2_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#2.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/E_coli2_MLSTs |
224 | 15 | Katie Lennard | ``` |
225 | |||
226 | 25 | Katie Lennard | Again combine individual sample results files with e.g. |
227 | ``` |
||
228 | srst2 --prev_output *results* --output Klebsiella_MLSTs |
||
229 | ``` |
||
230 | |||
231 | 1 | Katie Lennard | # Combining runs |
232 | |||
233 | 23 | Katie Lennard | For prelim analysis run1 was combined with the output from runs 2-4 with e.g. (from directory /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_combined_output_run1-4/CARD) |
234 | |||
235 | ``` |
||
236 | ln -s ../../srst2_CARD_run2to4/coverage_80_run/srst2/*genes* ./ |
||
237 | 1 | Katie Lennard | ln -s ../../srst2_CARD_v3/coverage_80_run/srst2/*genes* ./ |
238 | 23 | Katie Lennard | ``` |
239 | |||
240 | 25 | Katie Lennard | *Note that I subsequently reran everything after receiving run 5 data (so runs 1-5 all together) |
241 | |||
242 | 20 | Katie Lennard | # Tychus alignment module |
243 | |||
244 | Note the ilifu branch of the Tychus repo should be used. Because we don't have a main.nf file we can't use the standard `nextflow pull` syntax so did a git clone instead |
||
245 | git clone --branch ilifu https://github.com/kviljoen/Tychus/ |
||
246 | |||
247 | 25 | Katie Lennard | A list of fasta files for reference genomes was created here |
248 | 20 | Katie Lennard | NB: Error when trying to use E_coli_ATCC_25922_CP009072_1.fasta as --genome (but not e.g. Serratia) |
249 | |||
250 | NB: error in makephylogenies process: |
||
251 | 1 | Katie Lennard | |
252 | 26 | Katie Lennard | ``` |
253 | .command.sh: 7: [: missing ] |
||
254 | mv: cannot stat 'kSNP3_results/*.tre': No such file or directory |
||
255 | 20 | Katie Lennard | ``` |
256 | If you go into the working directory you'll find a NameErrors.txt file which states that some of the genome file names are illegal. The one that applies here is: A file name may contian only one dot ('.') character, that which separates the file ID from the extension. |
||
257 | EcoSME175.fasta is legal, EcoSME17.5.fasta is not |
||
258 | |||
259 | 1 | Katie Lennard | So all files were renamed from e.g. E_coli_ATCC_25922_CP009072.1.fasta to E_coli_ATCC_25922_CP009072_1.fasta |
260 | |||
261 | ``` |
||
262 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list |
||
263 | ``` |
||
264 | 20 | Katie Lennard | |
265 | Alignment run example against Serratia: |
||
266 | |||
267 | 1 | Katie Lennard | ``` |
268 | nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-10-10-fastq_QC/bbduk/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment |
||
269 | ``` |
||
270 | |||
271 | 26 | Katie Lennard | Run 1–5 tychus alignment run: |
272 | ``` |
||
273 | nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_run1_5_Klebsiella --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae_HS11286_CP003200_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment |
||
274 | ``` |
||
275 | |||
276 | |||
277 | Run 1-5 for Klebsiella: |
||
278 | |||
279 | |||
280 | 25 | Katie Lennard | ``` |
281 | nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia_run1to5 --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment |
||
282 | 20 | Katie Lennard | ``` |