In this section of the workflow we look at the diversity of the water samples collected from Cayo Roldan (hypoxic) and Cayo Coral (normoxic).
::opts_chunk$set(echo = TRUE)
knitrset.seed(0199)
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
require(gdata)
::p_load(DT, ggplot2, dplyr, microbiome, tidyverse,
pacman
data.table, plyr, stringr, labdsv, reshape, install = FALSE, update = FALSE)
options(scipen=999)
::opts_current$get(c(
knitr"cache",
"cache.path",
"cache.rebuild",
"dependson",
"autodep"
))
In order to run this workflow, you either need to run the previous workflows or grab the files below from figshare.
All files needed to run this workflow can be downloaded from figshare. These are the output files from the Data Prep Workflow.
File names and descriptions:
seq_table.txt, tax_table.txt, asv.fasta: Sequence tables, taxonomy tables, and ASV fasta files from the full phyloseq object (before removing contaminants) and the trimmed dataset (after removing contaminants).
16s-data_prep.rdata: contains all variables and phyloseq objects from the data prep pipeline. To see the Objects, in R run load(“16s-data_prep.rdata,” verbose=TRUE)
This script will process the curated water sample data from the Data Prep Workflow.
Now that we have a phyloseq object containing the water samples only, we can summarize the data in the water phyloseq object. Again, we use the summarize_phyloseq
from the microbiome R package (Lahti, Sudarshan, and others 2017) as we did before.
First, load the R objects that contain the sample data, sequence table, and taxonomy table (saved at the end of the previous section) and merge them into a single phyloseq object. Doing it this way keeps the memory footprint low.
remove(list = ls())
<- readRDS("rdata/16s-data-prep/ps_water-sample.rds")
sample_d <- readRDS("rdata/16s-data-prep/ps_water-seqtab.rds")
seqtab <- readRDS("rdata/16s-data-prep/ps_water-taxtab.rds")
taxtab <- merge_phyloseq(sample_d, seqtab, taxtab)
ps_water <- ps_water
ps_water_o ps_water
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1272 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 5 sample variables ]
tax_table() Taxonomy Table: [ 1272 taxa by 8 taxonomic ranks ]
Notice we made a copy of the ps_water
phyloseq object called ps_water_o
. This is because we are going to change the names of NA
taxa below and we want a copy of the original object before making changes. Next, we can save copies of the sequence and taxonomy tables just in case we want a quick look at these data.
write.table(tax_table(ps_water),
"tables/16s-water/water_tax_table.txt", sep="\t",
quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_water)),
"tables/16s-water/water_seq_table.txt", sep="\t",
quote = FALSE, col.names=NA)
And generate a fasta file for all ASVs in the water samples.
# Create fasta file from tax_table
<- tax_table(ps_water)
table2format #retain only the column with the sequences
<- table2format[, 7]
table2format_trim <- data.frame(row.names(table2format_trim),
table2format_trim_df
table2format_trim)colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ")
#format fasta
$ASV_ID <- sub("ASV", ">ASV", table2format_trim_df$ASV_ID)
table2format_trim_df
write.table(table2format_trim_df, "tables/16s-water/water_asv.fasta",
sep = "\r", col.names = FALSE, row.names = FALSE,
quote = FALSE, fileEncoding = "UTF-8")
Let’s go ahead and a sample variable called oxstate
to the sample data frame of the water phyloseq object so we can compare the taxonomy of normoxic (Coral Caye) vs. hypoxic (Cayo Roldan) samples. Note one of the Cayo Roldan samples (WCR0) was collected after the hypoxic event when oxygen levels had returned to normal. Therefore WCR0 is considered a normoxic sample.
One last thing to do is to compare diversity across the two sites before and after the hypoxic event. For this we add a sample variable called period
denoting whether the sample was collected during or after the event.
sample_data(ps_water)$oxstate <-
c("normoxic", "normoxic", "normoxic", "normoxic",
"normoxic", "hypoxic", "hypoxic", "hypoxic")
sample_data(ps_water)$period <-
c("cc_after", "cc_during", "cc_during", "cc_during",
"cr_after", "cr_during", "cr_during", "cr_during")
The
ps_water
dataset contains 1272 ASVs, 180557 total reads, 8 samples, and 5 sample variables.
And here is a summary of just water data samples.
Metric | Results |
---|---|
Min. number of reads | 12804 |
Max. number of reads | 33311 |
Total number of reads | 180557 |
Average number of reads | 22570 |
Median number of reads | 24120 |
Min. number of ASVs | 237 |
Max. number of ASVs | 552 |
Total number of ASVs | 1272 |
Average number of ASVs | 317 |
Median number of ASVs | 295.5 |
Sparsity | 0.751 |
Any ASVs sum to 1 or less? | TRUE |
Number of singleton ASVs | 68 |
Percent of ASVs that are singletons | 5.346 |
Number of sample variables are: | 5 (SamName, TYPE, SITE, oxstate, period) |
We can also generate a summary table of total reads & ASVs for each sample. You can sort the table or download a copy. Here is the code to generate the data for the table. First, we create data frames that hold total reads and ASVs for each sample. We can also do a quick calculation of alpha diversity using the Shannon and InvSimpson indices.
<- sample_sums(ps_water)
total_reads <- as.data.frame(total_reads, make.names = TRUE)
total_reads <- total_reads %>% rownames_to_column("Sample_ID")
total_reads
<- estimate_richness(ps_water,
total_asvs measures = c(
"Observed", "Shannon", "InvSimpson"))
<- total_asvs %>% rownames_to_column("Sample_ID")
total_asvs $Sample_ID <- gsub('\\.', '-', total_asvs$Sample_ID) total_asvs
And then we merge these two data frames with the sample data frame. We will use the meta
command from the microbiome package to convert the sample_data
to a data frame.
<- meta(sample_data(ps_water))
sam_details rownames(sam_details) <- NULL
colnames(sam_details) <- c("Sample_ID", "Type", "Site", "Oxstate", "Period")
<- merge(sam_details, total_reads, by = "Sample_ID")
merge_tab <- merge(merge_tab, total_asvs, by = "Sample_ID")
merge_tab2 colnames(merge_tab2) <- c("Sample<br/>ID", "Type", "Site", "Oxstate", "Period",
"total<br/>reads", "total<br/>ASVs", "Shannon", "InvSimpson")
Let’s first take a look at the taxonomic diversity of the dataset. The code to generate this table is a little gross.
Phyloseq has an odd way of dealing with taxonomic ranks that have no value—in other words, NA in the tax table. The first thing we are going to do before moving forward is to change all of the NA to have a value of the next highest classified rank. For example, ASV5
is not classified at the Genus level but is at Family level (Saprospiraceae). So, we change the Genus name to Family_Saprospiraceae. The code for this is hidden here but comes from these two posts on the phyloseq GitHub, both by MSMortensen: issue #850 and issue #990. If you want to access the code chunk, check out the source code linked at the bottom of this page.
[1] "Kingdom" "Phylum" "Class" "Order" "Family" "Genus"
[7] "ASV_SEQ" "ASV_ID"
To make it easier to change the code later on, we can assign the taxonomic rank we are interested into a variable, in the case TRANK
.
<- "Class" TRANK
<- table(tax_table(ps_water)[, TRANK],
tax_asv exclude = NULL, dnn = "Taxa")
<- as.data.frame(tax_asv, make.names = TRUE, stringsAsFactors=FALSE)
tax_asv
<- factor(tax_table(ps_water)[, TRANK])
tax_reads <- apply(otu_table(ps_water), MARGIN = 1, function(x)
tax_reads
{tapply(x, INDEX = tax_reads, FUN = sum, na.rm = TRUE, simplify = TRUE)
})<- as.data.frame(tax_reads, make.names = TRUE)
tax_reads <- cbind(tax_reads, reads = rowSums(tax_reads))
tax_reads <- tax_reads[9]
tax_reads <- setDT(tax_reads, keep.rownames = TRUE)[] tax_reads
<- merge(tax_reads, tax_asv, by.x = "rn", by.y = "Taxa")
taxa_read_asv_tab <- top_n(taxa_read_asv_tab, n = 8, wt = reads)
top_reads <- top_n(taxa_read_asv_tab, n = 8, wt = Freq)
top_asvs
names(taxa_read_asv_tab) <- c("Taxa", "total reads", "total ASVs")
Here we can see that:
How does the taxonomic composition of normoxic compare to hypoxic samples? To look at this, we will combine samples by the oxstate
variable and then calculate the relative abundance for each taxon for both oxygen states. Note one of the Cayo Roldan samples (WCR0) was collected after the hypoxic event when oxygen levels had returned to normal. Therefore, WCR0 is considered a normoxic sample.
There are a few steps we need to run for this analysis.
Again, we can select a rank.
<- "Class" TRANK
<- transform_sample_counts(ps_water, function(x) x/sum(x))
ps_water_AVG <- merge_samples(ps_water, "oxstate")
ps_water_ox <- merge_samples(sample_data(ps_water_AVG), "oxstate") SD_BAR_w
We will set a variable for the taxonomic rank of choice. If you want to choose a different rank, be sure to change the variable assignment.
<- tax_glom(ps_water_ox, taxrank = TRANK, NArm = FALSE)
mdata_phy_w <- transform_sample_counts(mdata_phy_w, function(x) x/sum(x))
mdata_phyrel_w <- psmelt(mdata_phyrel_w)
meltd_w <- as.character(meltd_w[[TRANK]]) meltd_w[[TRANK]]
<- ddply(meltd_w, ~get(TRANK), function(x) c(mean = mean(x$Abundance)))
means_w colnames(means_w) <- c(TRANK, "mean")
$mean <- round(means_w$mean, digits = 5)
means_w<- means_w[order(-means_w$mean), ] # this orders in decending fashion
taxa_means_w <- format(taxa_means_w, scientific = FALSE) # ditch the sci notation taxa_means_w
Time to group low abundance taxa into an Other category for the bar graph. Since we cannot possibly display all 59 different taxa in one figure, we need to collapse the low abundance groups.
<- 8
TAXAN <- top_n(taxa_means_w, n = TAXAN, wt = mean)
top_perc_w $mean <- round(as.numeric(top_perc_w$mean), digits = 5)
top_perc_w<- round(as.numeric(min(top_perc_w$mean)), digits = 5)
min_top_perc_w <- top_perc_w[,1] top_perc_w_list
Here we decided to look at the abundance of the top 8, which sets a relative abundance cutoff at 0.01398. Anything lower than 0.01398 will be grouped into Other.
<- means_w[means_w$mean < min_top_perc_w, ][[TRANK]]
Other_w %in% Other_w, ][[TRANK]] <- "Other"
meltd_w[meltd_w[[TRANK]] <- aggregate(meltd_w$Abundance, by = list(meltd_w$Sample), FUN = sum)[, 1]
samp_names_w <- environment()
.e_w <- factor(meltd_w[, TRANK], sort(unique(meltd_w[, TRANK])))
meltd_w[, TRANK] <- meltd_w[order(meltd_w[, TRANK]), ]
meltd_w1
<- c("Cyanobacteriia", "Alphaproteobacteria",
target "Bacteroidia", "Gammaproteobacteria",
"Thermoplasmata", "Phylum_Marinimicrobia_(SAR406_clade)",
"Acidimicrobiia", "Campylobacteria", "Other")
<- reorder.factor(meltd_w1[[TRANK]], new.order = target) meltd_w1[[TRANK]]
Finally.
We can also look at this in tabular format. To accomplish the task, we use the aggregate
command to group abundance values by oxygen state and taxonomic group.
<- aggregate(meltd_w1$Abundance,
hyp_v_norm by = list(Rank = meltd_w1[[TRANK]],
Sample = meltd_w1$Sample),
FUN = sum)
$x <- round(hyp_v_norm$x, digits = 4)
hyp_v_norm<- spread(hyp_v_norm, Sample, x) hyp_v_norm
What did taxonomic composition look like during and after the event at each site? To look at this, we will combine samples by the period
variable and then calculate the relative abundance for each taxon. The code for this analysis is basically the same as the code above where we compared hypoxic to normoxic samples. Therefore, all of the code is hidden. See the link at the bottom of the page for the page source code if you are interested in the code we used here. We go through the same steps:
Again, Time to group low abundance taxa into an Other category for the bar graph. Since we cannot possibly display all 59 different taxa in one figure, we need to collapse the low abundance groups.
And that’s it for this part of the 16S rRNA part of the workflow. Next, we drill a bit deeper into the data set and look at differentially abundant ASVs between normoxic vs. hypoxic samples.
The source code for this page can be accessed on GitHub by clicking this link.
Output files from this workflow available on figshare at doi:10.25573/data.12794258. The 16s-water.rdata file contains all variables and phyloseq objects from this pipeline and is needed for the next workflow (as are the .rds files).
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hypocolypse/web/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".