Project

General

Profile

Wiki » History » Version 21

Katie Lennard, 11/11/2022 07:53 AM

1 1 Katie Lennard
# Wiki
2
3
# Data location: 
4
5
The data was transferred from Athena medmicro):
6
7
``` 
8
/MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1A_results_17022022 
9
/MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1B_results_21022022
10
```
11
 
12
to Ilifu:
13
14
```
15
/scratch3/users/katiel/Clinton/CRE_study_August_2022/
16
```
17
18 4 Katie Lennard
# Reference data:
19 1 Katie Lennard
20
Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18); 
21
Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3); 
22 2 Katie Lennard
Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and 
23 1 Katie Lennard
Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1).
24
25 2 Katie Lennard
```
26
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes
27
```
28
29 4 Katie Lennard
# Objectives workflow:
30 2 Katie Lennard
![workflow.png]()
31 3 Katie Lennard
32 4 Katie Lennard
# QC:
33 17 Katie Lennard
11 sample failed QC and had to be rerun. Note that they accidentally reran these 11 (study1A) twice – once on 28 Feb and once on 22 September. These runs were merged by combining samples e.g.:
34 3 Katie Lennard
35 1 Katie Lennard
```
36 17 Katie Lennard
cat KLEB-CRE-GSH-0016_S11_L001_R2_001.fastq.gz >> merged_reads/G-16_S11_L001_R2_001.fastq.gz
37
``` 
38
39
file location:
40 1 Katie Lennard
```
41 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/11_double_rerun_merged/merged_reads
42
```
43 1 Katie Lennard
44 17 Katie Lennard
Next these 11 merged-run samples were joined in one folder via symlinks with run B (passed QC):
45 1 Katie Lennard
46
```
47 17 Katie Lennard
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined
48 1 Katie Lennard
```
49
50 17 Katie Lennard
Filtering and trimming were executed as follows:
51
52
```
53
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
54
```
55
QC reports can be found in the 'files' tab
56
57 21 Katie Lennard
58
Runs 2 and 3 were combined with symlinks under 
59
60
```
61
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined
62
```
63
64
FastQC was done (all samples passed):
65
66
```
67
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
68
```
69
70
and can be found here:
71
72
```
73
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-10-fastq_QC
74
```
75
76 18 Katie Lennard
Note: to agree with srst2 file naming specifications I renamd the trimmed files from e.g. *_R1.fq to *_1.fq (remove R) using e.g.
77
```
78
for f in *.fq; do mv -v "$f" "${f/_R/_}";done
79
```
80 17 Katie Lennard
81 4 Katie Lennard
# AMR profiling
82
The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as:
83 6 Katie Lennard
84
## ARGannot
85 1 Katie Lennard
```
86 6 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot/coverage_80_run --min_gene_cov 80
87 1 Katie Lennard
```
88 7 Katie Lennard
Individual results files compiled as:
89 5 Katie Lennard
90 7 Katie Lennard
```
91
srst2 --prev_output *results.txt --output ARGannot_AMRs
92
```
93
94 6 Katie Lennard
## CARD DB: 
95 1 Katie Lennard
96 6 Katie Lennard
This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with:
97
98 1 Katie Lennard
```
99
wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta
100 6 Katie Lennard
```
101
102
Pipeline execution as:
103
104
```
105
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD/coverage_80_run --min_gene_cov 80
106 7 Katie Lennard
```
107
108
Individual results files compiled as:
109
110
```
111
srst2 --prev_output *results.txt --output CARD_AMRs
112 5 Katie Lennard
```
113 8 Katie Lennard
114
# Virulence factors
115
116 10 Katie Lennard
Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module (use the /cbio/users/katie/singularity_containers/srst2_v2.simg singularity container for this)
117
NB: in order to use the correct python version (2.7.5) for srst2 I first had to comment out the lines at the end of my .bashrc file relating to conda initialize
118 8 Katie Lennard
119
Build genus-specific DB:
120
```
121 10 Katie Lennard
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella 
122 8 Katie Lennard
```
123
was used to create the VF DB Klebsiella.fsa 
124
125 1 Katie Lennard
The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter
126 8 Katie Lennard
127
cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server:
128
```
129 1 Katie Lennard
singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash
130
```
131
132
then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:
133
134
```
135
 cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout
136
```
137
138
Repeat for other .fsa DBs
139 8 Katie Lennard
140 10 Katie Lennard
NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use srst2_v2.simg again)
141 8 Katie Lennard
142 9 Katie Lennard
```
143
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv
144 10 Katie Lennard
```
145
146
Next convert the resulting csv table to a SRST2-compatible sequence database using:
147
148
149
```
150
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/csv_to_gene_db.py -t Klebsiella_cdhit90.csv -o Klebsiella_VF_clustered.fasta -s 5
151
152
```
153
154
The actual VF typing can now be done using this clustered DB:
155
156
```
157
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_VF_clustered.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_VFs/coverage_80_run --min_gene_cov 80
158 9 Katie Lennard
```
159 11 Katie Lennard
160 19 Katie Lennard
Same for other genera using:
161
```
162
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_VF_clustered.fasta
163
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_VF_clustered.fasta
164
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Enterobacter_VF_clustered.fasta
165
```
166
167 11 Katie Lennard
Again combine individual sample results files with e.g.
168
```
169
srst2 --prev_output *genes* --output Klebsiella_VFs
170
```
171
172
# MLST
173 12 Katie Lennard
MLST profiles were downloaded for E. coli and K. pneumoniae as:
174
175
```
176
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#1'
177
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#2'
178
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Klebsiella pneumoniae'
179 14 Katie Lennard
180
```
181
182
Note: MLST profiles not available for Serratia marecescens or Enterobacter cloacae
183
184 1 Katie Lennard
MLST profiling execution:
185 15 Katie Lennard
186
```
187
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/Klebsiella_MLSTs
188
```
189 16 Katie Lennard
190
```
191 15 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_1_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#1.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli1_MLSTs
192
```
193
194 1 Katie Lennard
```
195 16 Katie Lennard
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_2_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#2.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli2_MLSTs
196
```
197 20 Katie Lennard
198
# Tychus alignment module
199
200
Note the ilifu branch of the Tychus repo should be used. Because we don't have a main.nf file we can't use the standard `nextflow pull` syntax so did a git clone instead
201
git clone --branch ilifu https://github.com/kviljoen/Tychus/
202
203
A list of fasata files for reference genomes was created here
204
NB: Error when trying to use E_coli_ATCC_25922_CP009072_1.fasta as --genome (but not e.g. Serratia)
205
206
NB: error in makephylogenies process:
207
208
``` .command.sh: 7: [: missing ]
209
  mv: cannot stat 'kSNP3_results/*.tre': No such file or directory
210
```
211
If you go into the working directory you'll find a NameErrors.txt file which states that some of the genome file names are illegal. The one that applies here is:  A file name may contian only one dot ('.') character, that which separates the file ID from the extension.
212
        EcoSME175.fasta is legal, EcoSME17.5.fasta is not
213
214
So all files were renamed from e.g. E_coli_ATCC_25922_CP009072.1.fasta to E_coli_ATCC_25922_CP009072_1.fasta
215
216
```
217
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list
218
```
219
220
Alignment run example against Serratia:
221
222
```
223
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-10-10-fastq_QC/bbduk/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
224
```