Wiki¶
Data location:¶
The data was transferred from Athena medmicro) by e.g.:
rsync -avvP -e "ssh -i /home/katie/.ssh/id_rsa" /mnt/athena/medmicro/Clinton/CRE\ Pfizer\ Feb\ 2022/CRE\ study_4_results_11112022 katiel@transfer.ilifu.ac.za:/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/
to Ilifu:
/scratch3/users/katiel/Clinton/CRE_study_August_2022/
Reference data:¶
Klebsiella pneumoniae – strain HS11286 (GenBank accession no. CP003200.1) (n=18);
Serratia marcescens – strain KS10 (GenBank accession no. CP027798.1) (n=3);
Escherichia coli – strain ATCC 25922 (GenBank accession no. CP009072.1) (n=1); and
Enterobacter cloacae – strain ATCC 13047 (GenBank accession no. NC_014121.1) (n=1).
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_genomes
Objectives workflow:¶
QC:¶
11 sample from Run 1 failed QC and had to be rerun. Note that they accidentally reran these 11 (study1A) twice – once on 28 Feb and once on 22 September. These runs were merged by combining samples e.g.:
cat KLEB-CRE-GSH-0016_S11_L001_R2_001.fastq.gz >> merged_reads/G-16_S11_L001_R2_001.fastq.gz
file location:
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/11_double_rerun_merged/merged_reads
Next these 11 merged-run samples were joined in one folder via symlinks with run B (passed QC):
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined
Filtering and trimming were executed as follows:
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_1A_B_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
QC reports can be found in the 'files' tab
Runs 2 and 3 were combined with symlinks under:
/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined
FastQC was done (all samples passed):
nextflow run kviljoen/fastq_QC --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/raw/study_2_3_combined/*_R{1,2}_001.fastq.gz' -profile ilifu
and can be found here:
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-10-fastq_QC
Run 4 was added next and QCed:
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-11-12-fastq_QC
Run 5 was added and QCed:
/scratch3/users/katiel/Clinton/CRE_study_August_2022/2023-03-24-fastq_QC
Note: to agree with srst2 file naming specifications I renamed the trimmed files from e.g. *_R1.fq to *_1.fq (remove R) using e.g.
for f in *.fq; do mv -v "$f" "${f/_R/_}";done
AMR profiling¶
The preference from Clinton is to do AMR profiling with the ResFinder DB. I'm getting errors there that I think relate to the header formatting though so in the interim have run with the ARG_annot DB that we used for previous projects as:
ARGannot¶
Run 1-5 combined:
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/ARGannot_r3.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_ARGannot_run_1to5/coverage_80_run --min_gene_cov 80
Individual results files compiled as:
srst2 --prev_output *results.txt --output ARGannot_AMRs
CARD DB:¶
This database is the recommended by srst2 and has been formatted by them already. The DB was downloaded with:
wget https://github.com/katholt/srst2/blob/master/data/CARD_v3.0.8_SRST2.fasta?raw=true -O CARD_v3.0.8_SRST2.fasta
Pipeline execution as (run1-5) :
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/CARD_v3.0.8_SRST2.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_CARD_run1to5/coverage_80_run --min_gene_cov 80
Individual results files compiled as:
srst2 --prev_output *results.txt --output CARD_AMRs
Plasmids¶
PlasmidFinder plasmids run 1-5 (note min gene coverage of 50%, why?)
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/data/PlasmidFinder.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_plasmidFinder_run1to5 --min_cov 50
Now combine profiles for all samples:
srst2 --prev_output *results.txt --output plasmidFinder
Virulence factors¶
Building the relevant VFDB for Klebsiella requires a python script that needs the biopython module (use the /cbio/users/katie/singularity_containers/srst2_v2.simg singularity container for this)
NB: in order to use the correct python version (2.7.5) for srst2 I first had to comment out the lines at the end of my .bashrc file relating to conda initialize
Build genus-specific DB:
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDBgenus.py --infile /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/VFDB_setB_nt.fas --genus Klebsiella
was used to create the VF DB Klebsiella.fsa
The same procedure (as last year ;) was executed for Escherichia, Serratia and Enterobacter
cd-hit (needed to build vfdb as outlined here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2) docker images was pulled from here https://quay.io/repository/biocontainers/cd-hit?tab=tags and converted to singularity image on BST server:
singularity exec /cbio/users/katie/singularity_containers/cd-hit.simg /bin/bash
then run CD-HIT to cluster the sequences for this genus, at 90% nucleotide identity:
cd-hit -i Klebsiella.fsa -o Klebsiella_cdhit90 -c 0.9 > Klebsiella_cdhit90.stdout
Repeat for other .fsa DBs
NExt parse the cluster output and tabulate the results using the specific Virulence gene DB compatible script (use srst2_v2.simg again)
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/VFDB_cdhit_to_csv_KLedit.py --cluster_file Klebsiella_cdhit90.clstr --infile Klebsiella.fsa --outfile Klebsiella_cdhit90.csv
Next convert the resulting csv table to a SRST2-compatible sequence database using:
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/database_clustering/csv_to_gene_db.py -t Klebsiella_cdhit90.csv -o Klebsiella_VF_clustered.fasta -s 5
The actual VF typing can now be done using this clustered DB (run1-5):
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --gene_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_VF_clustered.fasta --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_VFs_run1to5/coverage_80_run --min_gene_cov 80
Same for other genera using:
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_VF_clustered.fasta /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_VF_clustered.fasta /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Enterobacter_VF_clustered.fasta
Again combine individual sample results files with e.g.
srst2 --prev_output *genes* --output Klebsiella_VFs
MLST¶
MLST profiles were downloaded for E. coli and K. pneumoniae as:
python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#1' python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Escherichia coli#2' python /cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/VFDB/srst2/scripts/getmlst.py --species 'Klebsiella pneumoniae'
Note: MLST profiles not available for Serratia marecescens or Enterobacter cloacae
MLST profiling execution:
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/Klebsiella_MLSTs
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_1_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#1.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/E_coli1_MLSTs
nextflow run kviljoen/uct-srst2 --reads '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' -profile ilifu --mlst_definitions /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/E_coli_2_definitions --mlst_db /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Escherichia_coli#2.fasta --mlst_delimiter _ --outdir /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_MLSTs_run_1to5/E_coli2_MLSTs
Again combine individual sample results files with e.g.
srst2 --prev_output *results* --output Klebsiella_MLSTs
Combining runs¶
For prelim analysis run1 was combined with the output from runs 2-4 with e.g. (from directory /scratch3/users/katiel/Clinton/CRE_study_August_2022/srst2_combined_output_run1-4/CARD)
ln -s ../../srst2_CARD_run2to4/coverage_80_run/srst2/*genes* ./ ln -s ../../srst2_CARD_v3/coverage_80_run/srst2/*genes* ./
*Note that I subsequently reran everything after receiving run 5 data (so runs 1-5 all together)
Tychus alignment module¶
Note the ilifu branch of the Tychus repo should be used. Because we don't have a main.nf file we can't use the standard nextflow pull
syntax so did a git clone instead
git clone --branch ilifu https://github.com/kviljoen/Tychus/
A list of fasta files for reference genomes was created here
NB: Error when trying to use E_coli_ATCC_25922_CP009072_1.fasta as --genome (but not e.g. Serratia)
NB: error in makephylogenies process:
.command.sh: 7: [: missing ] mv: cannot stat 'kSNP3_results/*.tre': No such file or directory
If you go into the working directory you'll find a NameErrors.txt file which states that some of the genome file names are illegal. The one that applies here is: A file name may contian only one dot ('.') character, that which separates the file ID from the extension.
EcoSME175.fasta is legal, EcoSME17.5.fasta is not
So all files were renamed from e.g. E_coli_ATCC_25922_CP009072.1.fasta to E_coli_ATCC_25922_CP009072_1.fasta
/scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list
Alignment run example against Serratia:
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/2022-10-10-fastq_QC/bbduk/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
Run 1–5 tychus alignment run:
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_run1_5_Klebsiella --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Klebsiella_pneumoniae_HS11286_CP003200_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
Run 1-5 for Klebsiella:
nextflow alignment.nf --alignment_out_dir /scratch3/users/katiel/Clinton/CRE_study_August_2022/Tychus_alignment_Serratia_run1to5 --genome /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/Serratia_marescens_KS10_CP027798_1.fasta --read_pairs '/scratch3/users/katiel/Clinton/CRE_study_August_2022/run_1to5_cleaned/*_{1,2}.fq' --user_genome_paths /scratch3/users/katiel/Clinton/CRE_study_August_2022/ref_files/full_fasta_list -profile alignment
Updated by Katie Lennard about 2 years ago · 26 revisions