Revision 16 - History - Wiki - Pathogen outbreak study - Pseudomonas single isolate WGS - Redmine

Wiki » History » Revision 16

Revision 15 (Katie Lennard, 06/26/2019 11:15 AM) → Revision 16/17 (Katie Lennard, 10/22/2019 09:50 AM)

# Wiki

## Study background
This study was prompted by an unusual outbreak of wild type Pseudomonas that coincided with the Cape Town drought. Preliminary molecular analysis suggests clonality, the interest is therefore to try an establish how this outbreak came about and whether the drought is in some way responsible. Pseudomonas are waterborne opportunistic pathogens that can form biofilms in plumbing pipes. One hypothesis is therefore that the drought, with decreased water pressure allowed increased biofilm formation and subsequently increased concentrations in drinking water. The data will include WGS of blood culture isolates and water samples from before, during, and after the outbreak (96 samples).

## Pipeline tool options considered
* Tychus: A Nextflow-based pipeline for pathogen WGS assembly and annotation (Repo: https://github.com/Abdo-Lab/Tychus Paper: https://www.biorxiv.org/content/biorxiv/early/2018/03/16/283101.full.pdf)
* Advice from Arash: Use Velvet for assembly and Prokka for annotation. If the species genome is very diverse across different strains build a pan-genome and consider it as a reference genome. Use Roary (https://academic.oup.com/bioinformatics/article/31/22/3691/240757) or Pyseer or BPGA for pan-genome construction and then perform a gene presence/absence statistical analysis across different populations by Scoary tool. Roary is installed on CHPC and its output files are compatible with Scoary and R.
* Options from Nicky:
https://www.pathogensurveillance.net/software:
A) Microreact (Interactive visualisation of trees, geographic data, and temporal data) - not immediately useful for Pseudomonas (only 1 entry in their database, vs. e.g ~4000 for Staph)
B) Pathogenwatch (Processing and Visualisation of Microbial Genome Sequences in Pylogenetic and Geographical Contexts). Here you can upload your assemblies to do MLST and AMR profiling - can test this once we have assemblies, although seemingly also limited to a handful of pathogens currently (https://pathogen.watch/)

## Implementation of Tychus (nextflow pipeline)
* Singularity images (one for the Tychus alignment module and one for the Tychus assembly module) built on BST server based on Dockerfiles in Tychus. First tested on hex so added the relevant bind points to the Dockerfiles and then did docker build -t Tychus_alignment . from folder on BST with dockerfile (git clone repo first). Singularity image then built from docker image using docker2singularity

## Tychus pipeline parameters to consider

Trimmomatic configurable variables include trim length, quality (phred) scores, sliding window and specified adapters file (https://github.com/kviljoen/Tychus/blob/ilifu/nextflow.config). Parameters not specifed in nextflow.config would have to be changed in the main scripts (alignment.nf and assembly.nf)

## Round 2 analyses (post-Tychus)

Additional analyses were requested by Clinton after the first round of results (from the Tychus pipeline). Below requests (Clinton) + suggested analyses (me, boldface) + feedback on suggested analyses (Clinton)
Proposed analyses:

* We would like to extract the in silico MLST profiles from these genomes. - **Use srst2 (https://github.com/katholt/srst2#mlst-results)** This looks great. Could you include the resistance gene ID option in srst2 as well, with ArgANNOT3 database as recommended? Always good to query multiple database for resistance gene ID.

* Reconstruct the phylogenetic tree to include certain outgroups (Burkholderia cepacia, Pseudomonas fluorescens, Pseudomonas putida). This will allow us to root the tree and get a better context for evolution. - **OK, it looks like there is a way to do this with kSNP3 but I'll have to write a separate script for this to the main pipeline.** Is it not perhaps possible to simply include these genomes in the pipeline as a sample, just a thought? Ideally 3 separate trees, one for each f the outgroups, since we won’t know what it will look like with these included?

* Are you able to assist constructing a phylogenetic heatmap (see image below) or even 2-dimensional? This would include the phylogenetic data on one side, and some additional data, such as presence of certain genes, etc. on the other? - **I'm good at doing annotated heatmaps in R for other data types, I just need to think about how one would include the phylogenetic data on one side - so the data matrix would be presence/absence of certain genes? - i.e. the colour values** I’ve heard of an online resource (https://microreact.org/showcase) which is quite simple to use. I should be able to do this once we have the tree we need.

* For the plasmid resistome results, we have found hit which is present in all the outbreak isolates and only a few of the non-outbreak isolates. The gene fractions for these results only go up to approximately 60%. Does this mean that only 60% of the reference plasmid is covered? If so, is the rest of the plasmid unique, or perhaps absent? We would like to compare this plasmid from all the isolates to see how similar they are to the reference (CP002153.1) as well as to each other. Can you assist with plasmid assembly and constructing a plasmid map (see below)? - **Hard to say exactly what is happening here, I'm assuming this won't be part of the assembly results either since it's a plasmid. I have to think about this one. I haven't done plasmid maps before so if someone in your lab has that might be a shorter turnaround.** This is becoming a common analysis for bacteria since resistance and virulence are often carried on them. Can I suggest using plasmidSPADES, which uses an algorithm to assemble potential plasmids from WGS data. The problem is, how to compare them once we have the contigs. We could start with a tree and then try to draw the comparative map after?

* For the virulence factors we have identified 3 factors (NP_253217, NP_251844, NP_251850) present in all the outbreak isolates and only a few of the non-outbreak isolates. Could you extract these sequences from the relevant contigs and blast, and do a multiple alignment for comparison of each one? These factors confer different levels of virulence depending on the mutations present. - **Sure, I can use the Spn scripts for this** . Great stuff!!!

## srst2 MLST, AMR, VF, plasmid analysis
* srst2 was implemented as a nextflow pipeline () on Ilifu for MLST analysis. Clinton requested that we also do AMR analysis with the ARGannot database. Once I did AMR analysis and compared the results with those obtained with the Tychus pipeline I noticed some differences in the results between the two pipelines - the .bam file alignments and coverage results differed. It turns out this is due to differences in bowtie2 parameters between the two pipelines. Whereas Tychus just uses default parameters, srst2 has been optimized for local alignments with short reads using the --very-sensitive-local and -a flags. The srst2 output is alos a lot more informative and provides a measure of divergence (from the reference gene), number of SNPs, annotation, gene length etc. I therefore decided it would be a good idea to redo the AMR, VF and plasmid analyses using srst2.

For VF analysis, a VF database was created for Pseudomonas as described here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2 , I also had to rebuild the singularity image for srst2 to include biopython (needed to run script VFDBgenus.py) for which the dockerfile is in the uct-srst2 repo and build a singularity image for cd-hit from the docker image https://quay.io/repository/biocontainers/cd-hit?tab=tags

For AMR analysis the default ARGannot_r3.fasta db from the srst2 repo had to be edited in order to view the resulting .bam files in IGV (there was a parsing error caused by single quotes and parentheses in the fasta headers which I removed as described here /ceph/cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/README)

## Data location

### Testing data raw reads
* E. coli test reads: Med Micro server(smb://athena.medmicro.uct.ac.za )in the File Station /MedMicro/Clinton/E. coli/original_data

### Raw data on Ilifu

* The raw data for Pseudomonas and E. coli have been copied over to Ilifu by Suresh with

~~~ text
rsync -rv --progress -e 'ssh -vvv' suresh@137.158.204.181:/home/suresh/katie/E.coli /ceph/cbio/tmp/katie/
~~~

* Moved to
~~~ text
/ceph/cbio/users/katie/Nicol/E.coli
~~~

~~~ text
/ceph/cbio/users/katie/Nicol/Ps_aerug
~~~

### Pseudomonas reference databases were sourced and downloaded to Ilifu (see DBs and README)

~~~ text
/ceph/cbio/users/katie/Nicol/Tychus_DBs
~~~

* Virulence DB: downloaded from VFDB (http://www.mgc.ac.cn/VFs/download.htm) on 12/3/2019 and converted to single line fasta (SL_VFDB_setA_nt.fas)
* AMR DB: Downloaded from resfinder Downloaded as: git clone https://git@bitbucket.org/genomicepidemiology/resfinder_db.git (Individual .fsa files were merged into a single fasta file with cat *.fsa >> KL_all_resfinder.fa and converted to single line fasta SL_KL_all_resfinder.fa)
* Adapters: The file 'adapters.fa' is from the bbmap installation (/opt/conda/opt/bbmap-37.10/resources/adapters.fa) as used in the YAMP pipeine and represnets a more comprehensive list of possible adapters than the TruSeq3-PE.fa default Tychus file
* Plasmid DB: PLSDB (https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/) contained a BLAST formatted DB (.nin etc files) but no fasta so I had to convert this blast db back to a fasta file so I can index it with bowtie2 in the alignment.nf script
* Pseudomonas reference genome: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Pseudomonas_aeruginosa/reference/GCA_000006765.1_ASM676v1/GCA_000006765.1_ASM676v1_genomic.fna.gz

This was done with:
/opt/exp_soft/ncbi-blast-2.2.28+/bin/blastdbcmd -entry all -db /home/kviljoen/Tychus_DBs/KL_plsdb2019/2019_03_05.fna -out /home/kviljoen/Tychus_DBs/KL_plsdb2019/2019_03_05_KL.fasta

### Troubleshooting - common errors during pipeline setup/customization
* Segmentation fault error when running csa (coverage sampler): convert input DBs to singleline fastas (they're probably multiline if you're seeing this error). Use awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' interleaved.fasta > singleline.fasta for conversion
* Missing output file(s) `Trees/*.tre` expected by process `BuildPhylogenies (ConfigurationFiles) ` The reference DB name is probably not being extracted correctly from the base (directory). Check if there are '.' in your reference filename and change to '_' e.g. you can't have GCA_000006765.1_ASM676v1_genomic.fna --> convert to GCA_000006765_1_ASM676v1_genomic.fna

### Processed data

**Raw reads FastQC/multiQC results on Ilifu:**

~~~ text
/ceph/cbio/users/katie/Nicol/E_coli_raw_fastqc
~~~
~~~ text
/ceph/cbio/users/katie/Nicol/Ps_aerug_raw_fastqc
~~~

**Raw reads FastQC/multiQC results on medmicro:**

~~~ text
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/E. coli/Katie_results/E_coli_raw_fastqc
~~~
~~~ text
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/Ps_aerug/Katie_results/Ps_aerug_raw_fastqc
~~~

**Trimmomatic-trimmed/filtered reads FastQC/multiQC results on Ilifu:**

~~~ text
/ceph/cbio/users/katie/Nicol/E_coli_trimmomatic_fastqc
~~~
~~~ text
/ceph/cbio/users/katie/Nicol/Ps_aerug_trimmomatic_fastqc
~~~

**Trimmomatic-trimmed/filtered reads FastQC/multiQC results on medmicro:**

~~~ text
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/E. coli/Katie_results/E_coli_trimmomatic_fastqc
~~~
~~~ text
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/Ps_aerug/Katie_results/Ps_aerug_trimmomatic_fastqc
~~~

**Tychus alignment module results on Ilifu:**

~~~ text
/ceph/cbio/users/katie/Nicol/E_coli_alignment
~~~
~~~ text
/ceph/cbio/users/katie/Nicol/Ps_aeruginosa_alignment
~~~

Additional phylogenetic trees, requested by Clinton (generate separate phylogenetic trees for each of the reference genomes below). Each tree was run twice to check for reproducibility
~~~ text
/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_Ps_ref_genome (/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_Ps_ref_genome/Trees/tree.core.tre md5 1a7041263caba40e835eaf026d685c8e used for downstream analyes in R)

/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_ASM141149v1
/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_ASM23706v1
/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_ASM756v2
~~~

**Tychus assembly module results on Ilifu:**

~~~ text
/ceph/cbio/users/katie/Nicol/E_coli_assembly
~~~
~~~ text
/ceph/cbio/users/katie/Nicol/Ps_assembly
~~~

**MLST and ARGanot srst2 results on Ilifu:**
~~~ text
/ceph/cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/srst2
~~~

### Final results for publishing

### Still to do:

1. Streamline email report
2. Change pipeline setup so that trimmomatic (process name RunQC) is not run by default for both the alignment and assembly modules which just wastes time and space

Project

General

Profile

Pathogen outbreak study - Pseudomonas single isolate WGS

Wiki » History » Revision 16