Project

General

Profile

Wiki » History » Version 16

Katie Lennard, 10/22/2019 09:50 AM

1 1 Katie Lennard
# Wiki
2
3
## Study background
4
This study was prompted by an unusual outbreak of wild type Pseudomonas that coincided with the Cape Town drought. Preliminary molecular analysis suggests clonality, the interest is therefore to try an establish how this outbreak came about and whether the drought is in some way responsible. Pseudomonas are waterborne opportunistic pathogens that can form biofilms in plumbing pipes. One hypothesis is therefore that the drought, with decreased water pressure allowed increased biofilm formation and subsequently increased concentrations in drinking water. The data will include WGS of blood culture isolates and water samples from before, during, and after the outbreak (96 samples).
5
6
## Pipeline tool options considered
7 2 Katie Lennard
* Tychus: A Nextflow-based pipeline for pathogen WGS assembly and annotation (Repo: https://github.com/Abdo-Lab/Tychus Paper: https://www.biorxiv.org/content/biorxiv/early/2018/03/16/283101.full.pdf)
8
* Advice from Arash: Use Velvet for assembly and Prokka for annotation. If the species genome is very diverse across different strains build a pan-genome and consider it as a reference genome. Use Roary (https://academic.oup.com/bioinformatics/article/31/22/3691/240757) or Pyseer or BPGA for pan-genome construction and then perform a gene presence/absence statistical analysis across different populations by Scoary tool. Roary is installed on CHPC and its output files are compatible with Scoary and R.
9 3 Katie Lennard
*  Options from Nicky: 
10
https://www.pathogensurveillance.net/software:
11
A) Microreact (Interactive visualisation of trees, geographic data, and temporal data) - not immediately useful for Pseudomonas (only 1 entry in their database, vs. e.g ~4000 for Staph)
12 4 Katie Lennard
B) Pathogenwatch (Processing and Visualisation of Microbial Genome Sequences in Pylogenetic and Geographical Contexts). Here you can upload your assemblies to do MLST and AMR profiling - can test this once we have assemblies, although seemingly also limited to a handful of pathogens currently (https://pathogen.watch/)
13 1 Katie Lennard
14 5 Katie Lennard
## Implementation of Tychus (nextflow pipeline)
15
* Singularity images (one for the Tychus alignment module and one for the Tychus assembly module) built on BST server based on Dockerfiles in Tychus. First tested on hex so added the relevant bind points to the Dockerfiles and then did docker build -t Tychus_alignment . from folder on BST with dockerfile (git clone repo first). Singularity image then built from docker image using docker2singularity
16
17 10 Katie Lennard
## Tychus pipeline parameters to consider
18
19
Trimmomatic configurable variables include trim length, quality (phred) scores, sliding window and specified adapters file (https://github.com/kviljoen/Tychus/blob/ilifu/nextflow.config). Parameters not specifed in nextflow.config would have to be changed in the main scripts (alignment.nf and assembly.nf)
20
21 13 Katie Lennard
## Round 2 analyses (post-Tychus)
22
23
Additional analyses were requested by Clinton after the first round of results (from the Tychus pipeline). Below requests (Clinton) + suggested analyses (me, boldface) + feedback on suggested analyses (Clinton)
24
Proposed analyses:
25
26
* We would like to extract the in silico MLST profiles from these genomes. - **Use srst2 (https://github.com/katholt/srst2#mlst-results)** This looks great. Could you include the resistance gene ID option in srst2 as well, with ArgANNOT3 database as recommended? Always good to query multiple database for resistance gene ID.
27
 
28
* Reconstruct the phylogenetic tree to include certain outgroups (Burkholderia cepacia, Pseudomonas fluorescens, Pseudomonas putida). This will allow us to root the tree and get a better context for evolution. - **OK, it looks like there is a way to do this with kSNP3 but I'll have to write a separate script for this to the main pipeline.** Is it not perhaps possible to simply include these genomes in the pipeline as a sample, just a thought? Ideally 3 separate trees, one for each f the outgroups, since we won’t know what it will look like with these included?
29
 
30
* Are you able to assist constructing a phylogenetic heatmap (see image below) or even 2-dimensional? This would include the phylogenetic data on one side, and some additional data, such as presence of certain genes, etc. on the other? - **I'm good at doing annotated heatmaps in R for other data types, I just need to think about how one would include the phylogenetic data on one side - so the data matrix would be presence/absence of certain genes? - i.e. the colour values** I’ve heard of an online resource (https://microreact.org/showcase) which is quite simple to use. I should be able to do this once we have the tree we need.
31
 
32
* For the plasmid resistome results, we have found hit which is present in all the outbreak isolates and only a few of the non-outbreak isolates. The gene fractions for these results only go up to approximately 60%. Does this mean that only 60% of the reference plasmid is covered? If so, is the rest of the plasmid unique, or perhaps absent? We would like to compare this plasmid from all the isolates to see how similar they are to the reference (CP002153.1) as well as to each other. Can you assist with plasmid assembly and constructing a plasmid map (see below)? - **Hard to say exactly what is happening here, I'm assuming this won't be part of the assembly results either since it's a plasmid. I have to think about this one. I haven't done plasmid maps before so if someone in your lab has that might be a shorter turnaround.**  This is becoming a common analysis for bacteria since resistance and virulence are often carried on them. Can I suggest using plasmidSPADES, which uses an algorithm to assemble potential plasmids from WGS data. The problem is, how to compare them once we have the contigs. We could start with a tree and then try to draw the comparative map after?
33
 
34
* For the virulence factors we have identified 3 factors (NP_253217, NP_251844, NP_251850) present in all the outbreak isolates and only a few of the non-outbreak isolates. Could you extract these sequences from the relevant contigs and blast, and do a multiple alignment for comparison of each one? These factors confer different levels of virulence depending on the mutations present. - **Sure, I can use the Spn scripts for this** . Great stuff!!!
35
36 14 Katie Lennard
## srst2 MLST, AMR, VF, plasmid analysis
37
* srst2 was implemented as a nextflow pipeline () on Ilifu for MLST analysis. Clinton requested that we also do AMR analysis with the ARGannot database. Once I did AMR analysis and compared the results with those obtained with the Tychus pipeline I noticed some differences in the results between the two pipelines - the .bam file alignments and coverage results differed. It turns out this is due to differences in bowtie2 parameters between the two pipelines. Whereas Tychus just uses default parameters, srst2 has been optimized for local alignments with short reads using the --very-sensitive-local and -a flags. The srst2 output is alos a lot more informative and provides a measure of divergence (from the reference gene), number of SNPs, annotation, gene length etc. I therefore decided it would be a good idea to redo the AMR, VF and plasmid analyses using srst2. 
38 13 Katie Lennard
39 14 Katie Lennard
For VF analysis, a VF database was created for Pseudomonas as described here https://github.com/katholt/srst2#using-the-vfdb-virulence-factor-database-with-srst2 , I also had to rebuild the singularity image for srst2 to include biopython (needed to run script VFDBgenus.py) for which the dockerfile is in the uct-srst2 repo and build a singularity image for cd-hit from the docker image https://quay.io/repository/biocontainers/cd-hit?tab=tags
40
41
For AMR analysis the default ARGannot_r3.fasta db from the srst2 repo had to be edited in order to view the resulting .bam files in IGV (there was a parsing error caused by single quotes and parentheses in the fasta headers which I removed as described here /ceph/cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/README)
42 13 Katie Lennard
43 3 Katie Lennard
## Data location
44 1 Katie Lennard
45 5 Katie Lennard
### Testing data raw reads
46 3 Katie Lennard
* E. coli test reads: Med Micro server(smb://athena.medmicro.uct.ac.za )in the File Station /MedMicro/Clinton/E. coli/original_data
47 1 Katie Lennard
48 6 Katie Lennard
### Raw data on Ilifu
49
50
* The raw data for Pseudomonas and E. coli have been copied over to Ilifu by Suresh with 
51
52
~~~ text
53
rsync -rv --progress -e 'ssh -vvv' suresh@137.158.204.181:/home/suresh/katie/E.coli /ceph/cbio/tmp/katie/
54
~~~
55
56 7 Katie Lennard
* Moved to 
57
~~~ text
58
/ceph/cbio/users/katie/Nicol/E.coli
59
~~~
60
61
~~~ text
62
/ceph/cbio/users/katie/Nicol/Ps_aerug
63
~~~
64
65 6 Katie Lennard
 ### Pseudomonas reference databases were sourced and downloaded to Ilifu (see DBs and README)
66
67
~~~ text
68
/ceph/cbio/users/katie/Nicol/Tychus_DBs
69
~~~
70
71
* Virulence DB: downloaded from VFDB (http://www.mgc.ac.cn/VFs/download.htm)  on 12/3/2019 and converted to single line fasta (SL_VFDB_setA_nt.fas)
72
* AMR DB: Downloaded from resfinder Downloaded as: git clone https://git@bitbucket.org/genomicepidemiology/resfinder_db.git (Individual .fsa files were merged into a single fasta file with cat *.fsa >> KL_all_resfinder.fa and converted to single line fasta SL_KL_all_resfinder.fa)
73
* Adapters: The file 'adapters.fa' is from the bbmap installation (/opt/conda/opt/bbmap-37.10/resources/adapters.fa) as used in the YAMP pipeine and represnets a more comprehensive list of possible adapters than the TruSeq3-PE.fa default Tychus file
74
* Plasmid DB: PLSDB (https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/) contained a BLAST formatted DB (.nin etc files) but no fasta so I had to convert this blast db back to a fasta file so I can index it with bowtie2 in the alignment.nf script
75
* Pseudomonas reference genome: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Pseudomonas_aeruginosa/reference/GCA_000006765.1_ASM676v1/GCA_000006765.1_ASM676v1_genomic.fna.gz
76
77
This was done with:
78
/opt/exp_soft/ncbi-blast-2.2.28+/bin/blastdbcmd -entry all -db /home/kviljoen/Tychus_DBs/KL_plsdb2019/2019_03_05.fna -out /home/kviljoen/Tychus_DBs/KL_plsdb2019/2019_03_05_KL.fasta
79
80
81
### Troubleshooting - common errors during pipeline setup/customization
82
* Segmentation fault error when running csa (coverage sampler): convert input DBs to singleline fastas (they're probably multiline if you're seeing this error). Use awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' interleaved.fasta > singleline.fasta for conversion
83
* Missing output file(s) `Trees/*.tre` expected by process `BuildPhylogenies (ConfigurationFiles) ` The reference DB name is probably not being extracted correctly from the base (directory). Check if there are '.' in your reference filename and change to '_' e.g. you can't have GCA_000006765.1_ASM676v1_genomic.fna --> convert to GCA_000006765_1_ASM676v1_genomic.fna
84
85 1 Katie Lennard
### Processed data
86 7 Katie Lennard
87 9 Katie Lennard
**Raw reads FastQC/multiQC results on Ilifu:**
88
89 7 Katie Lennard
~~~ text
90
/ceph/cbio/users/katie/Nicol/E_coli_raw_fastqc
91
~~~
92
~~~ text
93
/ceph/cbio/users/katie/Nicol/Ps_aerug_raw_fastqc
94
~~~
95
96 9 Katie Lennard
**Raw reads FastQC/multiQC results on medmicro:**
97
98 7 Katie Lennard
~~~ text
99 1 Katie Lennard
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/E. coli/Katie_results/E_coli_raw_fastqc
100
~~~
101
~~~ text
102 9 Katie Lennard
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/Ps_aerug/Katie_results/Ps_aerug_raw_fastqc
103 1 Katie Lennard
~~~
104
105 9 Katie Lennard
**Trimmomatic-trimmed/filtered reads FastQC/multiQC results on Ilifu:**
106 1 Katie Lennard
107
~~~ text
108 9 Katie Lennard
/ceph/cbio/users/katie/Nicol/E_coli_trimmomatic_fastqc
109
~~~
110
~~~ text
111
/ceph/cbio/users/katie/Nicol/Ps_aerug_trimmomatic_fastqc
112
~~~
113
114
**Trimmomatic-trimmed/filtered reads FastQC/multiQC results on medmicro:**
115
116
~~~ text
117
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/E. coli/Katie_results/E_coli_trimmomatic_fastqc
118
~~~
119
~~~ text
120
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/Ps_aerug/Katie_results/Ps_aerug_trimmomatic_fastqc
121
~~~
122
123
**Tychus alignment module results on Ilifu:**
124
125
~~~ text
126 8 Katie Lennard
/ceph/cbio/users/katie/Nicol/E_coli_alignment
127
~~~
128
~~~ text
129
/ceph/cbio/users/katie/Nicol/Ps_aeruginosa_alignment
130
~~~
131
132 15 Katie Lennard
Additional phylogenetic trees, requested by Clinton (generate separate phylogenetic trees for each of the reference genomes below). Each tree was run twice to check for reproducibility
133
~~~ text
134 16 Katie Lennard
/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_Ps_ref_genome (/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_Ps_ref_genome/Trees/tree.core.tre md5 1a7041263caba40e835eaf026d685c8e used for downstream analyes in R) 
135 15 Katie Lennard
/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_ASM141149v1
136
/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_ASM23706v1
137
/ceph/cbio/users/katie/Nicol/SNPsAndPhylogenies_ASM756v2
138
~~~
139
140 9 Katie Lennard
**Tychus assembly module results on Ilifu:**
141 8 Katie Lennard
142
~~~ text
143
/ceph/cbio/users/katie/Nicol/E_coli_assembly
144
~~~
145
~~~ text
146
/ceph/cbio/users/katie/Nicol/Ps_assembly
147 7 Katie Lennard
~~~
148 1 Katie Lennard
149 12 Katie Lennard
**MLST and ARGanot srst2 results on Ilifu:**
150
~~~ text
151
/ceph/cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/srst2 
152
~~~
153
154 1 Katie Lennard
### Final results for publishing
155 11 Katie Lennard
156
### Still to do:
157
158
1. Streamline email report
159
2. Change pipeline setup so that trimmomatic (process name RunQC) is not run by default for both the alignment and assembly modules which just wastes time and space 
160