Wiki » History » Version 21
Katie Lennard, 11/11/2022 07:53 AM
| 1 | 1 | Katie Lennard | # Wiki |
|---|---|---|---|
| 2 | |||
| 3 | # Data location: |
||
| 4 | |||
| 5 | The data was transferred from Athena medmicro): |
||
| 6 | |||
| 7 | ``` |
||
| 8 | /MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1A_results_17022022 |
||
| 9 | /MedMicro/Clinton/CRE Pfizer Feb 2022/CRE study_1B_results_21022022 |
||
| 10 | ``` |
||
| 11 | |||
| 12 | to Ilifu: |
||
| 13 | |||
| 14 | ``` |
||
| 15 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ |
||
| 16 | ``` |
||
| 17 | |||
| 18 | 4 | Katie Lennard | # Reference data: |
| 19 | 1 | Katie Lennard | |
| 20 | Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18); |
||
| 21 | Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3); |
||
| 22 | 2 | Katie Lennard | Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and |
| 23 | 1 | Katie Lennard | Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1). |
| 24 | |||
| 25 | 2 | Katie Lennard | ``` |
| 26 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes |
||
| 27 | ``` |
||
| 28 | |||
| 29 | 4 | Katie Lennard | # Objectives workflow: |
| 30 | 2 | Katie Lennard | ![workflow.png]() |
| 31 | 3 | Katie Lennard | |
| 32 | 4 | Katie Lennard | # QC: |
| 33 | 17 | Katie Lennard | 11 sample failed QC and had to be rerun. Note that they accidentally reran these 11 (study1A) twice – once on 28 Feb and once on 22 September. These runs were merged by combining samples e.g.: |
| 34 | 3 | Katie Lennard | |
| 35 | 1 | Katie Lennard | ``` |
| 36 | 17 | Katie Lennard | cat KLEB-CRE-GSH-0016_S11_L001_R2_001.fastq.gz >> merged_reads/G-16_S11_L001_R2_001.fastq.gz |
| 37 | ``` |
||
| 38 | |||
| 39 | file location: |
||
| 40 | 1 | Katie Lennard | ``` |
| 41 | 17 | Katie Lennard | /scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/11_double_rerun_merged/merged_reads |
| 42 | ``` |
||
| 43 | 1 | Katie Lennard | |
| 44 | 17 | Katie Lennard | Next these 11 merged-run samples were joined in one folder via symlinks with run B (passed QC): |
| 45 | 1 | Katie Lennard | |
| 46 | ``` |
||
| 47 | 17 | Katie Lennard | /scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined |
| 48 | 1 | Katie Lennard | ``` |
| 49 | |||
| 50 | 17 | Katie Lennard | Filtering and trimming were executed as follows: |
| 51 | |||
| 52 | ``` |
||
| 53 | nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu |
||
| 54 | ``` |
||
| 55 | QC reports can be found in the 'files' tab |
||
| 56 | |||
| 57 | 21 | Katie Lennard | |
| 58 | Runs 2 and 3 were combined with symlinks under |
||
| 59 | |||
| 60 | ``` |
||
| 61 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined |
||
| 62 | ``` |
||
| 63 | |||
| 64 | FastQC was done (all samples passed): |
||
| 65 | |||
| 66 | ``` |
||
| 67 | nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined/*_R{1,2}_001.fastq.gz' -profile ilifu |
||
| 68 | ``` |
||
| 69 | |||
| 70 | and can be found here: |
||
| 71 | |||
| 72 | ``` |
||
| 73 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-10-fastq_QC |
||
| 74 | ``` |
||
| 75 | |||
| 76 | 18 | Katie Lennard | Note: to agree with srst2 file naming specifications I renamd the trimmed files from e.g. *_R1.fq to *_1.fq (remove R) using e.g. |
| 77 | ``` |
||
| 78 | for f in *.fq; do mv -v "$f" "${f/_R/_}";done |
||
| 79 | ``` |
||
| 80 | 17 | Katie Lennard | |
| 81 | 4 | Katie Lennard | # AMR profiling |
| 82 | The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as: |
||
| 83 | 6 | Katie Lennard | |
| 84 | ## ARGannot |
||
| 85 | 1 | Katie Lennard | ``` |
| 86 | 6 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot/coverage_80_run --min_gene_cov 80 |
| 87 | 1 | Katie Lennard | ``` |
| 88 | 7 | Katie Lennard | Individual results files compiled as: |
| 89 | 5 | Katie Lennard | |
| 90 | 7 | Katie Lennard | ``` |
| 91 | srst2 --prev_output *results.txt --output ARGannot_AMRs |
||
| 92 | ``` |
||
| 93 | |||
| 94 | 6 | Katie Lennard | ## CARD DB: |
| 95 | 1 | Katie Lennard | |
| 96 | 6 | Katie Lennard | This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with: |
| 97 | |||
| 98 | 1 | Katie Lennard | ``` |
| 99 | wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta |
||
| 100 | 6 | Katie Lennard | ``` |
| 101 | |||
| 102 | Pipeline execution as: |
||
| 103 | |||
| 104 | ``` |
||
| 105 | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD/coverage_80_run --min_gene_cov 80 |
||
| 106 | 7 | Katie Lennard | ``` |
| 107 | |||
| 108 | Individual results files compiled as: |
||
| 109 | |||
| 110 | ``` |
||
| 111 | srst2 --prev_output *results.txt --output CARD_AMRs |
||
| 112 | 5 | Katie Lennard | ``` |
| 113 | 8 | Katie Lennard | |
| 114 | # Virulence factors |
||
| 115 | |||
| 116 | 10 | Katie Lennard | Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module (use the /cbio/users/katie/singularity_containers/srst2_v2.simg singularity container for this) |
| 117 | NB: in order to use the correct python version (2.7.5) for srst2 I first had to comment out the lines at the end of my .bashrc file relating to conda initialize |
||
| 118 | 8 | Katie Lennard | |
| 119 | Build genus-specific DB: |
||
| 120 | ``` |
||
| 121 | 10 | Katie Lennard | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella |
| 122 | 8 | Katie Lennard | ``` |
| 123 | was used to create the VF DB Klebsiella.fsa |
||
| 124 | |||
| 125 | 1 | Katie Lennard | The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter |
| 126 | 8 | Katie Lennard | |
| 127 | cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server: |
||
| 128 | ``` |
||
| 129 | 1 | Katie Lennard | singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash |
| 130 | ``` |
||
| 131 | |||
| 132 | then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity: |
||
| 133 | |||
| 134 | ``` |
||
| 135 | cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout |
||
| 136 | ``` |
||
| 137 | |||
| 138 | Repeat for other .fsa DBs |
||
| 139 | 8 | Katie Lennard | |
| 140 | 10 | Katie Lennard | NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use srst2_v2.simg again) |
| 141 | 8 | Katie Lennard | |
| 142 | 9 | Katie Lennard | ``` |
| 143 | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv |
||
| 144 | 10 | Katie Lennard | ``` |
| 145 | |||
| 146 | Next convert the resulting csv table to a SRST2-compatible sequence database using: |
||
| 147 | |||
| 148 | |||
| 149 | ``` |
||
| 150 | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/csv_to_gene_db.py -t Klebsiella_cdhit90.csv -o Klebsiella_VF_clustered.fasta -s 5 |
||
| 151 | |||
| 152 | ``` |
||
| 153 | |||
| 154 | The actual VF typing can now be done using this clustered DB: |
||
| 155 | |||
| 156 | ``` |
||
| 157 | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_VF_clustered.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_VFs/coverage_80_run --min_gene_cov 80 |
||
| 158 | 9 | Katie Lennard | ``` |
| 159 | 11 | Katie Lennard | |
| 160 | 19 | Katie Lennard | Same for other genera using: |
| 161 | ``` |
||
| 162 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_VF_clustered.fasta |
||
| 163 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_VF_clustered.fasta |
||
| 164 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Enterobacter_VF_clustered.fasta |
||
| 165 | ``` |
||
| 166 | |||
| 167 | 11 | Katie Lennard | Again combine individual sample results files with e.g. |
| 168 | ``` |
||
| 169 | srst2 --prev_output *genes* --output Klebsiella_VFs |
||
| 170 | ``` |
||
| 171 | |||
| 172 | # MLST |
||
| 173 | 12 | Katie Lennard | MLST profiles were downloaded for E. coli and K. pneumoniae as: |
| 174 | |||
| 175 | ``` |
||
| 176 | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#1' |
||
| 177 | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#2' |
||
| 178 | python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Klebsiella pneumoniae' |
||
| 179 | 14 | Katie Lennard | |
| 180 | ``` |
||
| 181 | |||
| 182 | Note: MLST profiles not available for Serratia marecescens or Enterobacter cloacae |
||
| 183 | |||
| 184 | 1 | Katie Lennard | MLST profiling execution: |
| 185 | 15 | Katie Lennard | |
| 186 | ``` |
||
| 187 | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/Klebsiella_MLSTs |
||
| 188 | ``` |
||
| 189 | 16 | Katie Lennard | |
| 190 | ``` |
||
| 191 | 15 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_1_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#1.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli1_MLSTs |
| 192 | ``` |
||
| 193 | |||
| 194 | 1 | Katie Lennard | ``` |
| 195 | 16 | Katie Lennard | nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-09-19-fastq_QC/bbduk/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_2_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#2.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs/E_coli2_MLSTs |
| 196 | ``` |
||
| 197 | 20 | Katie Lennard | |
| 198 | # Tychus alignment module |
||
| 199 | |||
| 200 | Note the ilifu branch of the Tychus repo should be used. Because we don't have a main.nf file we can't use the standard `nextflow pull` syntax so did a git clone instead |
||
| 201 | git clone --branch ilifu https://github.com/kviljoen/Tychus/ |
||
| 202 | |||
| 203 | A list of fasata files for reference genomes was created here |
||
| 204 | NB: Error when trying to use E_coli_ATCC_25922_CP009072_1.fasta as --genome (but not e.g. Serratia) |
||
| 205 | |||
| 206 | NB: error in makephylogenies process: |
||
| 207 | |||
| 208 | ``` .command.sh: 7: [: missing ] |
||
| 209 | mv: cannot stat 'kSNP3_results/*.tre': No such file or directory |
||
| 210 | ``` |
||
| 211 | If you go into the working directory you'll find a NameErrors.txt file which states that some of the genome file names are illegal. The one that applies here is: A file name may contian only one dot ('.') character, that which separates the file ID from the extension. |
||
| 212 | EcoSME175.fasta is legal, EcoSME17.5.fasta is not |
||
| 213 | |||
| 214 | So all files were renamed from e.g. E_coli_ATCC_25922_CP009072.1.fasta to E_coli_ATCC_25922_CP009072_1.fasta |
||
| 215 | |||
| 216 | ``` |
||
| 217 | /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list |
||
| 218 | ``` |
||
| 219 | |||
| 220 | Alignment run example against Serratia: |
||
| 221 | |||
| 222 | ``` |
||
| 223 | nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-10-10-fastq_QC/bbduk/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment |
||
| 224 | ``` |