Project

General

Profile

Wiki » History » Version 13

Katie Lennard, 05/09/2019 10:16 AM

1 1 Katie Lennard
# Wiki
2
3
## Study background
4
This study was prompted by an unusual outbreak of wild type Pseudomonas that coincided with the Cape Town drought. Preliminary molecular analysis suggests clonality, the interest is therefore to try an establish how this outbreak came about and whether the drought is in some way responsible. Pseudomonas are waterborne opportunistic pathogens that can form biofilms in plumbing pipes. One hypothesis is therefore that the drought, with decreased water pressure allowed increased biofilm formation and subsequently increased concentrations in drinking water. The data will include WGS of blood culture isolates and water samples from before, during, and after the outbreak (96 samples).
5
6
## Pipeline tool options considered
7 2 Katie Lennard
* Tychus: A Nextflow-based pipeline for pathogen WGS assembly and annotation (Repo: https://github.com/Abdo-Lab/Tychus Paper: https://www.biorxiv.org/content/biorxiv/early/2018/03/16/283101.full.pdf)
8
* Advice from Arash: Use Velvet for assembly and Prokka for annotation. If the species genome is very diverse across different strains build a pan-genome and consider it as a reference genome. Use Roary (https://academic.oup.com/bioinformatics/article/31/22/3691/240757) or Pyseer or BPGA for pan-genome construction and then perform a gene presence/absence statistical analysis across different populations by Scoary tool. Roary is installed on CHPC and its output files are compatible with Scoary and R.
9 3 Katie Lennard
*  Options from Nicky: 
10
https://www.pathogensurveillance.net/software:
11
A) Microreact (Interactive visualisation of trees, geographic data, and temporal data) - not immediately useful for Pseudomonas (only 1 entry in their database, vs. e.g ~4000 for Staph)
12 4 Katie Lennard
B) Pathogenwatch (Processing and Visualisation of Microbial Genome Sequences in Pylogenetic and Geographical Contexts). Here you can upload your assemblies to do MLST and AMR profiling - can test this once we have assemblies, although seemingly also limited to a handful of pathogens currently (https://pathogen.watch/)
13 1 Katie Lennard
14 5 Katie Lennard
## Implementation of Tychus (nextflow pipeline)
15
* Singularity images (one for the Tychus alignment module and one for the Tychus assembly module) built on BST server based on Dockerfiles in Tychus. First tested on hex so added the relevant bind points to the Dockerfiles and then did docker build -t Tychus_alignment . from folder on BST with dockerfile (git clone repo first). Singularity image then built from docker image using docker2singularity
16
17 10 Katie Lennard
## Tychus pipeline parameters to consider
18
19
Trimmomatic configurable variables include trim length, quality (phred) scores, sliding window and specified adapters file (https://github.com/kviljoen/Tychus/blob/ilifu/nextflow.config). Parameters not specifed in nextflow.config would have to be changed in the main scripts (alignment.nf and assembly.nf)
20
21 13 Katie Lennard
## Round 2 analyses (post-Tychus)
22
23
Additional analyses were requested by Clinton after the first round of results (from the Tychus pipeline). Below requests (Clinton) + suggested analyses (me, boldface) + feedback on suggested analyses (Clinton)
24
Proposed analyses:
25
26
* We would like to extract the in silico MLST profiles from these genomes. - **Use srst2 (https://github.com/katholt/srst2#mlst-results)** This looks great. Could you include the resistance gene ID option in srst2 as well, with ArgANNOT3 database as recommended? Always good to query multiple database for resistance gene ID.
27
 
28
* Reconstruct the phylogenetic tree to include certain outgroups (Burkholderia cepacia, Pseudomonas fluorescens, Pseudomonas putida). This will allow us to root the tree and get a better context for evolution. - **OK, it looks like there is a way to do this with kSNP3 but I'll have to write a separate script for this to the main pipeline.** Is it not perhaps possible to simply include these genomes in the pipeline as a sample, just a thought? Ideally 3 separate trees, one for each f the outgroups, since we won’t know what it will look like with these included?
29
 
30
* Are you able to assist constructing a phylogenetic heatmap (see image below) or even 2-dimensional? This would include the phylogenetic data on one side, and some additional data, such as presence of certain genes, etc. on the other? - **I'm good at doing annotated heatmaps in R for other data types, I just need to think about how one would include the phylogenetic data on one side - so the data matrix would be presence/absence of certain genes? - i.e. the colour values** I’ve heard of an online resource (https://microreact.org/showcase) which is quite simple to use. I should be able to do this once we have the tree we need.
31
 
32
* For the plasmid resistome results, we have found hit which is present in all the outbreak isolates and only a few of the non-outbreak isolates. The gene fractions for these results only go up to approximately 60%. Does this mean that only 60% of the reference plasmid is covered? If so, is the rest of the plasmid unique, or perhaps absent? We would like to compare this plasmid from all the isolates to see how similar they are to the reference (CP002153.1) as well as to each other. Can you assist with plasmid assembly and constructing a plasmid map (see below)? - **Hard to say exactly what is happening here, I'm assuming this won't be part of the assembly results either since it's a plasmid. I have to think about this one. I haven't done plasmid maps before so if someone in your lab has that might be a shorter turnaround.**  This is becoming a common analysis for bacteria since resistance and virulence are often carried on them. Can I suggest using plasmidSPADES, which uses an algorithm to assemble potential plasmids from WGS data. The problem is, how to compare them once we have the contigs. We could start with a tree and then try to draw the comparative map after?
33
 
34
* For the virulence factors we have identified 3 factors (NP_253217, NP_251844, NP_251850) present in all the outbreak isolates and only a few of the non-outbreak isolates. Could you extract these sequences from the relevant contigs and blast, and do a multiple alignment for comparison of each one? These factors confer different levels of virulence depending on the mutations present. - **Sure, I can use the Spn scripts for this** . Great stuff!!!
35
36
37
38 3 Katie Lennard
## Data location
39 1 Katie Lennard
40 5 Katie Lennard
### Testing data raw reads
41 3 Katie Lennard
* E. coli test reads: Med Micro server(smb://athena.medmicro.uct.ac.za )in the File Station /MedMicro/Clinton/E. coli/original_data
42 1 Katie Lennard
43 6 Katie Lennard
### Raw data on Ilifu
44
45
* The raw data for Pseudomonas and E. coli have been copied over to Ilifu by Suresh with 
46
47
~~~ text
48
rsync -rv --progress -e 'ssh -vvv' suresh@137.158.204.181:/home/suresh/katie/E.coli /ceph/cbio/tmp/katie/
49
~~~
50
51 7 Katie Lennard
* Moved to 
52
~~~ text
53
/ceph/cbio/users/katie/Nicol/E.coli
54
~~~
55
56
~~~ text
57
/ceph/cbio/users/katie/Nicol/Ps_aerug
58
~~~
59
60 6 Katie Lennard
 ### Pseudomonas reference databases were sourced and downloaded to Ilifu (see DBs and README)
61
62
~~~ text
63
/ceph/cbio/users/katie/Nicol/Tychus_DBs
64
~~~
65
66
* Virulence DB: downloaded from VFDB (http://www.mgc.ac.cn/VFs/download.htm)  on 12/3/2019 and converted to single line fasta (SL_VFDB_setA_nt.fas)
67
* AMR DB: Downloaded from resfinder Downloaded as: git clone https://git@bitbucket.org/genomicepidemiology/resfinder_db.git (Individual .fsa files were merged into a single fasta file with cat *.fsa >> KL_all_resfinder.fa and converted to single line fasta SL_KL_all_resfinder.fa)
68
* Adapters: The file 'adapters.fa' is from the bbmap installation (/opt/conda/opt/bbmap-37.10/resources/adapters.fa) as used in the YAMP pipeine and represnets a more comprehensive list of possible adapters than the TruSeq3-PE.fa default Tychus file
69
* Plasmid DB: PLSDB (https://ccb-microbe.cs.uni-saarland.de/plsdb/plasmids/download/) contained a BLAST formatted DB (.nin etc files) but no fasta so I had to convert this blast db back to a fasta file so I can index it with bowtie2 in the alignment.nf script
70
* Pseudomonas reference genome: ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/bacteria/Pseudomonas_aeruginosa/reference/GCA_000006765.1_ASM676v1/GCA_000006765.1_ASM676v1_genomic.fna.gz
71
72
This was done with:
73
/opt/exp_soft/ncbi-blast-2.2.28+/bin/blastdbcmd -entry all -db /home/kviljoen/Tychus_DBs/KL_plsdb2019/2019_03_05.fna -out /home/kviljoen/Tychus_DBs/KL_plsdb2019/2019_03_05_KL.fasta
74
75
76
### Troubleshooting - common errors during pipeline setup/customization
77
* Segmentation fault error when running csa (coverage sampler): convert input DBs to singleline fastas (they're probably multiline if you're seeing this error). Use awk '{if(NR==1) {print $0} else {if($0 ~ /^>/) {print "\n"$0} else {printf $0}}}' interleaved.fasta > singleline.fasta for conversion
78
* Missing output file(s) `Trees/*.tre` expected by process `BuildPhylogenies (ConfigurationFiles) ` The reference DB name is probably not being extracted correctly from the base (directory). Check if there are '.' in your reference filename and change to '_' e.g. you can't have GCA_000006765.1_ASM676v1_genomic.fna --> convert to GCA_000006765_1_ASM676v1_genomic.fna
79
80 1 Katie Lennard
### Processed data
81 7 Katie Lennard
82 9 Katie Lennard
**Raw reads FastQC/multiQC results on Ilifu:**
83
84 7 Katie Lennard
~~~ text
85
/ceph/cbio/users/katie/Nicol/E_coli_raw_fastqc
86
~~~
87
~~~ text
88
/ceph/cbio/users/katie/Nicol/Ps_aerug_raw_fastqc
89
~~~
90
91 9 Katie Lennard
**Raw reads FastQC/multiQC results on medmicro:**
92
93 7 Katie Lennard
~~~ text
94 1 Katie Lennard
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/E. coli/Katie_results/E_coli_raw_fastqc
95
~~~
96
~~~ text
97 9 Katie Lennard
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/Ps_aerug/Katie_results/Ps_aerug_raw_fastqc
98 1 Katie Lennard
~~~
99
100 9 Katie Lennard
**Trimmomatic-trimmed/filtered reads FastQC/multiQC results on Ilifu:**
101 1 Katie Lennard
102
~~~ text
103 9 Katie Lennard
/ceph/cbio/users/katie/Nicol/E_coli_trimmomatic_fastqc
104
~~~
105
~~~ text
106
/ceph/cbio/users/katie/Nicol/Ps_aerug_trimmomatic_fastqc
107
~~~
108
109
**Trimmomatic-trimmed/filtered reads FastQC/multiQC results on medmicro:**
110
111
~~~ text
112
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/E. coli/Katie_results/E_coli_trimmomatic_fastqc
113
~~~
114
~~~ text
115
http://athena.medmicro.uct.ac.za:5000/MedMicro/Clinton/Ps_aerug/Katie_results/Ps_aerug_trimmomatic_fastqc
116
~~~
117
118
**Tychus alignment module results on Ilifu:**
119
120
~~~ text
121 8 Katie Lennard
/ceph/cbio/users/katie/Nicol/E_coli_alignment
122
~~~
123
~~~ text
124
/ceph/cbio/users/katie/Nicol/Ps_aeruginosa_alignment
125
~~~
126
127 9 Katie Lennard
**Tychus assembly module results on Ilifu:**
128 8 Katie Lennard
129
~~~ text
130
/ceph/cbio/users/katie/Nicol/E_coli_assembly
131
~~~
132
~~~ text
133
/ceph/cbio/users/katie/Nicol/Ps_assembly
134 7 Katie Lennard
~~~
135 1 Katie Lennard
136 12 Katie Lennard
**MLST and ARGanot srst2 results on Ilifu:**
137
~~~ text
138
/ceph/cbio/users/katie/Nicol/Ps_aerug_srst2_MLST/srst2 
139
~~~
140
141 1 Katie Lennard
### Final results for publishing
142 11 Katie Lennard
143
### Still to do:
144
145
1. Streamline email report
146
2. Change pipeline setup so that trimmomatic (process name RunQC) is not run by default for both the alignment and assembly modules which just wastes time and space 
147