Setting up environment

knitr::opts_chunk$set(eval = TRUE, 
                      echo = TRUE, 
                      cache = FALSE,
                      include = TRUE,
                      collapse = FALSE,
                      dependson = NULL,
                      engine = "R", # Chunks will always have R code, unless noted
                      message = FALSE,
                      #dev = 'pdf',
                      fig.path="Figures/",  # Set the figure options
                      fig.align = "center"
                      
                      #, 
                      # fig.width = 6,
                      # fig.height = 6
                      )

Load needed packages. Note, I set the working directory here and you can do the same to point it to the contents of git hub zipped folder. I still hardcoded the full path to files as I call them through this document because I could not get knitr to generate html by calling the set wd…one of those mysteries, maybe you have better luck.

library(knitr)
library(vegan)
library(ggplot2)
library(Hmisc)
library(RColorBrewer)
library(gplots)
library(plyr)
library(dplyr)
library(reshape2)
library(DT)
library(Matrix)

Figure 1. Genome relatedness heatmap.

How many genera of PSA phages exist? And who belongs to those genera?

I will use the working definition of genera set forth by Lavigne et al. 2008, whereby phage genus membership is defined at >40% genes shared at bitscore > 75.

  1. To answer this, I collected the protein sequences of all phages of Pseudoalteromonas in individual protein fasta files. This list included:
  • 8 Helgoland PSA phage isolates that we sequenced here
  • 7 PSA phage isolates sequenced by others (available from GenBank)
  • 7 prophages predicted from Helgoland PSA bacterial genomes
  • 28 prophages predicted from published PSA bacterial genomes (available from GenBank and WGS; only those considered “high-quality” prophage predictions by VirSorter)

This resulted in 2377 proteins in 50 multifastas protein files, available in the directory \faa

  1. Next, I ran pair-wise blastp protein homology searches for all pairs of PSA phages. The bash files called are also contained in the directory \faa

    bash makeblastdbs.sh
    bash blastp.sh

Here is an example line from the makeblastdbs.sh script:
makeblastdb -in ANT505.faa -dbtype prot

Here is an example line from the blastp.sh script:
blastp -db ANT505.faa -query 1315s1g383.faa -outfmt 6 -evalue 0.1 > ANT505.faa_1315s1g383.faa.blastp

  1. Next, the blastp results are parsed to determine the reciprocal best blastp (RBB) hits between all phages using the script reciprocal_blast_hits.py, also stored in the /faa directory. In other words, all homologues between all phages were determined based on which pairs had the highest bit scores. Only hits with bitscore >75 were considered, per Lavigne et al. 2008.
bash rbb_bit.sh  
cat *.blastp.rbb.out > all.rbb_bit.out

Here is an example line from the reciprocal_blast_hits.py script, available here:

python reciprocal_blast_hits.py HM1_H101.blastp H101_HM1.blastp 1 2 12 high HM1_H101.rbb.out

These results (all_v_all.blastp) are called into a dataframe.

allPSA<-read.table("/Users/duhaimem/busibee/PostDoc/TMPL/Helgoland_Phages/ecogenomics_project/genome_analysis/JGI/phage/seq_data/genes.aa/faa/all.rbb_bit.out", header=FALSE)

colnames(allPSA) <- c("query","sbjt", "bitA","bitB")

allPSA_bs75 <- filter(allPSA, bitA > 75 & bitB > 75)

datatable(allPSA_bs75, 
          options = list(pageLength = 5),
          caption = "All-v-All Reciprocal Best Blastp results"
          )