In this section of the workflow we look at the diversity of the water samples collected from Cayo Roldan (hypoxic) and Cayo Coral (normoxic).
::opts_chunk$set(echo = TRUE)
knitrset.seed(0199)
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
require(gdata)
::p_load(DT, ggplot2, dplyr, microbiome, tidyverse,
pacman
data.table, plyr, stringr, labdsv, reshape,
metacoder, naniar, install = FALSE, update = FALSE)
options(scipen=999)
::opts_current$get(c(
knitr"cache",
"cache.path",
"cache.rebuild",
"dependson",
"autodep"
))
In order to run this workflow, you either need to run the previous workflows or grab the files below from figshare.
All files needed to run this workflow can be downloaded from figshare. These are the output files from the Diversity workflow page.
File names and descriptions:
load("16s-water.rdata", verbose=TRUE)
.And here are the output files from this workflow.
File names and descriptions:
This script will process the curated water sample data from the Diversity Workflow.
remove(list = ls())
<- readRDS("rdata/16s-water/ps_water.rds")
ps_water <- readRDS("rdata/16s-water/ps_water_o.rds")
ps_water_o <- readRDS("rdata/16s-water/ps_water_ox.rds")
ps_water_ox load("rdata/16s-water/16s-water.rdata")
Time to drill down a little deeper and look at differentially abundant (DA) ASVs between the two oxygen conditions. ASVs that were enriched and more frequent in hypoxic and normoxic samples were identified using the Indicator Species Analysis (ISA) (Dufrêne and Legendre 1997), computed with the R package labdsv (Roberts 2007).
set.seed(119)
#rm(class, order, phylum, kingdom)
#ls(all.names = TRUE)
<- data.frame(otu_table(ps_water))
water_seq_table # Delete columns when sum == 0 Should not be any
<- water_seq_table[, which(colSums(water_seq_table) != 0)]
water_seq_table
<- cbind(oxstate = c("normoxic","normoxic",
water_seq_table "normoxic","normoxic",
"normoxic","hypoxic",
"hypoxic","hypoxic"),
water_seq_table)
<- indval(water_seq_table[,-1], water_seq_table[,1])
iva <- iva$maxcls[iva$pval <= 0.05]
gr <- iva$indcls[iva$pval <= 0.05]
iv <- iva$pval[iva$pval <= 0.05]
pv <- apply(water_seq_table[,-1] > 0, 2, sum)[iva$pval <= 0.05]
fr <- data.frame(group = gr, indval = iv,
indvalsummary pvalue = pv, freq = fr)
<- indvalsummary[order(indvalsummary$group,
indvalsummary -indvalsummary$indval),]
<- data.frame(tax_table(ps_water))
water_tax_table <- merge(indvalsummary, water_tax_table,
indvalsummary_tax by = "row.names",
all = TRUE)
<- indvalsummary_tax[!(is.na(indvalsummary_tax$group)),]
indvalsummary_tax
lapply(indvalsummary_tax, class)
class(indvalsummary_tax$group) = "character"
$group <- ifelse(indvalsummary_tax$group == "1",
indvalsummary_taxas.character("hypoxic"),
as.character("normoxic"))
Now we can save a few files and display the data.
We also save fasta files of DA ASVs. For the visualization below we will only consider ASVs that were represented by > 100 reads so we need a fasta of just those ASVs.
Now. What do we get from all of this? Well, we started with 1272 and ran Indicator Species Analysis (ISA) to see if we could identify Differentially Abundant (DA) ASVs between the normoxic and hypoxic conditions. ISA found 159 ASVs that were DA between the two, specifically 76 ASVs enriched in the normoxic samples and 83 in the hypoxic. If we consider only ASVs with more than 100 total reads we get 76 normoxic ASVs and 83 hypoxic ASVs.
Next, we will combine the results of the ISA analysis with the distribution of ASVs across each sample. We are going to do the analysis in anvi’o (Eren et al. 2015) using the anvi-interactive
command. Anvi’o likes database but it also understands that sometimes you don’t have a database. So it offers a manual mode for all of us slackers out there. If you type this command you can have a look at the relevant pieces we need for the visualization, specifically those under the headings MANUAL INPUTS
and ADDITIONAL STUFF
. I will go through these out of order so try to keep up ;)
anvi-interactive -h
MANUAL INPUTS:
Mandatory input parameters to start the interactive interface without
anvi'o databases.
--manual-mode We need this flag to run anvi'o in an ad hoc
manner, i.e., no database.
-f FASTA, --fasta-file FASTA
A FASTA-formatted input file. This is sort of
optional
-d VIEW_DATA, --view-data VIEW_DATA
A TAB-delimited file for view data. This is the ASV
by sample matrix. We need this
-t NEWICK, --tree NEWICK
NEWICK formatted tree structure. How the ASVs are
ordered in our case.
ADDITIONAL STUFF:
Parameters to provide additional layers, views, or layer data.
-V ADDITIONAL_VIEW, --additional-view ADDITIONAL_VIEW
A TAB-delimited file for an additional view to be used
in the interface. This file should contain all split
names, and values for each of them in all samples.
Each column in this file must correspond to a sample
name. Content of this file will be called 'user_view',
which will be available as a new item in the 'views'
combo box in the interface
-A ADDITIONAL_LAYERS, --additional-layers ADDITIONAL_LAYERS
A TAB-delimited file for additional layer info. In
our case this is info about each ASV. The first column
of the file must be the ASV names, and
the remaining columns should be unique attributes.
There are also a few files we generate that cannot be loaded directly. So, in addition to the files that can be loaded when running the interactive, we also have files that must be added to the database created by anvi’o.
Here is a nice tutorial on Working with anvi’o additional data tables. A lot of what we need is covered in this tutorial. To get the most out the visualization we need to create a few files to give anvi’o when we fire up the interactive interface.
Let’s start with the -d
or --view-data
file. This file needs to be an ASV by sample matrix of read counts. To simplify the visualization, we will use all ASVs represented by 100 or more total reads, including those identified as differentially abundant by the ISA.
<- prune_taxa(taxa_sums(ps_water) > 100, ps_water)
ps_water_100 ps_water_100
phyloseq-class experiment-level object
otu_table() OTU Table: [ 189 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 5 sample variables ]
tax_table() Taxonomy Table: [ 189 taxa by 8 taxonomic ranks ]
Looks like this are 189 total ASVs. We know from the ISA analysis above that 159 ASVs were differentially abundant between the two groups so there will be 30 ASVs in our graph that showed no difference.
<- as.data.frame(t(otu_table(ps_water_100)))
data_tab <- data_tab %>% rownames_to_column("Group")
data_tab write.table(data_tab, "16s-anvio/data.txt", sep = "\t",
quote = FALSE, row.names = FALSE)
Next, we need some additional data about the ASVs to overlay on the visual. This can be anything however what I specifically want are the details of the ISA analysis, total reads, and lineage info. I warn you; this code will get ugly and I urge you to find a better way.
Start with an ASV + lineage table for the ASVs in the ps_water_100
phyloseq object.
<- as.data.frame(t(otu_table(ps_water_100)))
data_tab2 <- cbind(data_tab2, total_reads = rowSums(data_tab2))
total_reads2 1:8] <- NULL
total_reads2[<- as.data.frame(tax_table(ps_water_100))
tax_tab2 7] <- NULL
tax_tab2[<- tibble::rownames_to_column(total_reads2, "Group")
total_reads2 <- tibble::rownames_to_column(tax_tab2, "Group")
tax_tab2
<- merge(total_reads2, tax_tab2,
merge_tabX by = "Group",
all = TRUE)
<- c("reads", "norm", "hypo", "Kingdom",
drop_c "Phylum", "Class", "Order", "Family", "Genus")
<- full_indval_sum_noseq[ , !(names(full_indval_sum_noseq) %in% drop_c)]
idval2 colnames(idval2)[1:2] <- c("Group", "enriched")
<- merge(idval2, merge_tabX,
additional_layers by = "Group",
all.y = TRUE)
write.table(additional_layers, "16s-anvio/additional_layers.txt",
sep = "\t", quote = FALSE, row.names = FALSE, na = "")
Now we want some general data about the samples to overlay on the visual. Again, this can be anything. How about a table of alpha diversity metrics? We actually have such a table that was generated way back up the road. Just need to fix the column names.
<- merge_tab2
additional_view 2] <- NULL
additional_view[colnames(additional_view) <- c("id", "site", "oxstate", "period", "no_reads",
"no_asvs", "Shannon", "InvSimpson")
write.table(additional_view, "16s-anvio/additional_views.txt",
sep = "\t", quote = FALSE, row.names = FALSE, na = "")
Turned out this was a little tricky to figure out, but thanks to this little nifty block of code written by guoyanzhao on the phyloseq Issues forum, it was a piece of cake. The code can be altered to take any rank. See the post for an explanation.
Anyway, the goal is to sum each taxon at sum rank and present that as a bar chart for each sample in the visualization. Anvi’o has a specific format it needs where each row is a sample and each column is a taxon. Taxa names need the prefix t_<RANK>!
. For example, t_class!
should be added for Class rank.
# Make the table
<- ps_water
anvio_taxa <- tax_glom(anvio_taxa, taxrank = 'Class')
glom_rank <- psmelt(glom_rank)
melt_rank $Class <- as.character(melt_rank$Class)
melt_rank<- aggregate(Abundance~Sample+Class, melt_rank, FUN = sum)
rank_abundance <- as.data.frame(cast(rank_abundance, Sample ~ Class))
rank_abundance #rank_abundance <- as.data.frame(rank_abundance)
<- tibble::remove_rownames(rank_abundance)
rank_abundance <- tibble::column_to_rownames(rank_abundance, "Sample")
rank_abundance
# Reorder table column by summ
<- rank_abundance[,names(sort(colSums(rank_abundance), decreasing = TRUE))]
layers
# Add the prefix
<- layers %>% dplyr::rename_all(function(x) paste0("t_class!", x))
layers <- tibble::rownames_to_column (layers, "taxon")
layers
# save the table
write.table(layers, "16s-anvio/tax_layers.txt",
sep = "\t", quote = FALSE, row.names = FALSE, na = "")
The last piece we need is to generate dendrograms that order the ASVs by their distribution in the samples and the samples by their ASV composition. For this task we will use anvi’o.
anvi-matrix-to-newick data.txt --distance euclidean \
--linkage ward \
-o asv.tre
anvi-matrix-to-newick data.txt --distance euclidean \
--linkage ward \
-o sample.tre \
--transpose
The first command reads the view data we generated above and uses Euclidean distance and Ward linkage for hierarchical clustering of the ASVs. The second command transposes the view data table and then does the same for the samples. There are many distance metrics and linkage methods available. Boom.
The ASV tree is fine as is, but the sample tree needs a special format. Specifically, the tree needs to be in a three column, tab delimited, table. This way you can add multiple orderings to the same file and view them all in the interactive. The table needs to be in this format:
item_name | data_type | data_value |
---|---|---|
tree_1 | newick | ((WCR0:0.0793043,(WCC0:0.0431881,(WCC1:0.0301642,.. |
tree_2 | newick | (WCR1:0.142825,(WCR2:0.0466953,WCR3:0.0466953),… |
(…) | (…) | (…) |
This is easy to do by hand, but I really need the practice so I will do it in R. Anvi’o is very particular about formatting. For example, if this file ends with a blank line, which it will because when anvi’o made the initial dendrogram it add a new line. We need to get rid of that or we get an error when trying to import the table.
<- read_file("16s-anvio/sample.tre")
sam_tree <- gsub("[\r\n]", "", sam_tree)
sam_tree <- c("euc_ward")
item_name <- c("newick")
data_type <- c(sam_tree)
data_value <- data.frame(item_name, data_type, data_value)
sam_tre_tab library(janitor)
%>% remove_empty("rows")
sam_tre_tab write.table(sam_tre_tab, "16s-anvio/sample_tree_tab.txt",
sep = "\t", quote = FALSE, row.names = FALSE, na = "")
We don’t need to add a fasta file, but it is a nice way to keep everything in one place. Plus, you can do BLAST searches directly in the interface by right clicking on the ASV of interest, so it is nice to have the sequences.
# Create fasta file from tax_table
<- tax_table(ps_water_100)
anvio_tab #retain only the column with the sequences
<- anvio_tab[, 7]
anvio_tab_trim <- data.frame(row.names(anvio_tab_trim),
anvio_tab_trim_df
anvio_tab_trim)colnames(anvio_tab_trim_df) <- c("ASV_ID", "ASV_SEQ")
#format fasta
$ASV_ID <- sub("ASV", ">ASV", anvio_tab_trim_df$ASV_ID)
anvio_tab_trim_df
write.table(anvio_tab_trim_df, "16s-anvio/anvio.fasta",
sep = "\r", col.names = FALSE, row.names = FALSE,
quote = FALSE, fileEncoding = "UTF-8")
Time to put all of these pieces together. This gets a little tricky since we do not have a database to start with because some of these files can be loaded directly in the interface but some need to be added to a database. When we fire up the interactive in --manual
mode, we must give anvi’o the name of a database and it will create that database for us. Then we can shut down the interactive, add the necessary data files, and start back up.
anvi-interactive --view-data data.txt \
--tree asv.tre \
--additional-layers additional_layers.txt \
--profile-db profile.db \
--manual
Now we have a new profile database that we can add the sample metadata (additional_layers.txt
) and the sample dendrogram (sample.tre) using the command anvi-import-misc-data
. These commands add the table to the new profile.db
. First, kill the interactive.
anvi-import-misc-data additional_views.txt \
--pan-or-profile-db profile.db \
--target-data-table layers
anvi-import-misc-data sample_tree_tab.txt \
--pan-or-profile-db profile.db \
--target-data-table layer_orders
One last this is to get the table with the taxonomy total by sample (tax_layers.txt
) into the profile database. We will run the same command we just used.
anvi-import-misc-data tax_layers.txt \
--pan-or-profile-db profile.db \
--target-data-table layers
In fact, we could just as easily append the taxonomy total data onto the additional_layers.txt
and import in one command. But we didn’t.
With a populated database in hand, we can now begin modifying the visual by running the interactive command again.
anvi-interactive --view-data data.txt \
--tree asv.tre \
--additional-layers additional_layers.txt \
--profile-db profile.db
--fasta-file anvio.fasta \
--manual
Remember, we added a lot of data, so the tree looks crazy when you first open the interactive. This is the interactive part of the process where we select the data we want to display and how we want to display it.
Now have a look after some changing the way everything looks. It is not possible to go through all the steps here, but it didn’t take very long.
Hopefully you agree that this is a near publication quality figure. Obviously some tweaking still left to do but most of it can be done in anvi’o.
All files generated for this analysis can be found in the Data Availabilty appendix below. You should only need to run the command above and the final figure should pop up in your browser.
Alternatively, an interactive version of this figure can be found on the anvi-server using either this link or that link. Works best on Chrome.
One final thing I like to do is save a state file. After you set everything the way you want it you can hit the Save State button, which will store your settings in the database. That way you have a backup text file should something bad happen to the profile database.
anvi-export-state --pan-or-profile-db profile.db
--state default
--output-file states.json
Time for something a little different. We have been using some pretty standard graphical constructs to display diversity data from these samples. Especially when looking at taxonomic diversity, we are pigeonholing the data by selecting and isolating specific ranks to look at—Class, Family, ASV, etc. But we are not really considering these pieces together. For example, when we asked what the differences between normoxic and hypoxic samples were, we looked at which ASVs were differentially abundant. Nice. But what about higher taxonomic ranks? Two treatments may differ at the ASV level but not say the Family level. This is important becuase it can tell us where, on a lineage map, sample type start to diverge from one another. And this of course has important ecological consequences. In fact, we can perform this analysis and actually visualize the lineage map and significant results as a heat tree using the R package metacoder (Foster, Sharpton, and Grünwald 2017). Let’s take a look at a simple example using the Phylum Cyanobacteria.
The map is like a network, composed of nodes (circles) connected by edges (lines). Each node is a taxonomic level, and along with the edges, forms part of a taxonomic lineage. Start at the big P—this is the phylum Cyanobacteria. Then the single Class in this example (big C), then 4 orders, 5 familes, 9 genera, and finally 21 terminal nodes, which are the individual ASVs.
Color indicates a node (or entire lineage) that is enriched in normoxic or hypoxic samples, while the intensity of the color is dervived from the strength of that difference. Grey shades indicate no difference. Testing is happening at each taxonomic level. For example, the nodes for phylum and class are both blue, indicating at thos levels, normoxic samples are enriched. However, moving down ranks, we can see that 2 orders are actually enriched in the hypoxic samples, 1 in the normoxic, and 1 where there is no difference, and so on.
Ok, right off the bat I will say that I found some of the metacoder code a little esoteric. I chalk this up to my own ignorance but mention it because my explanations may be confusing. So, grain of salt as they say because I am still learning. I highly recommend going through the tutorials on the metacoder site before using this tool. We can start by copying the original phyloseq object.
<- ps_water ps_water_mc
Now we need to decide if we want to remove rare ASVs and/or taxa before proceeding. Honestly, I am not sure. Certainly with a large dataset it will take a long time to run the analysis and ASVs that only have a few reads are uninformative. How about we use the same ASVs we used in the anvi’o visualization of the ISA, in other words all ASVs represented by more than 100 reads.
<- prune_taxa(taxa_sums(ps_water_mc) > 100, ps_water_mc)
ps_water_mc_100 <- prune_taxa(taxa_sums(ps_water_mc) > 10, ps_water_mc)
ps_water_mc_10 <- sum(readcount(ps_water_mc))
total_reads_ps_water <- sum(readcount(ps_water_mc_100))
total_reads_ps_water_100 <- sum(readcount(ps_water_mc_10))
total_reads_ps_water_10 <- ps_water_mc_100 ps_obj
Next, we need a properly formatted data frame for metacoder, specifically one where each row is an ASV and the coulmns are sample abundances and taxonomic lineage. I am sure there is a way to directly export this from a phyloseq object but I could not figure out how. So I made the data frame myself from the tax_table and sample_data table.
#summarize_phyloseq(ps_water_o)
#summarize_phyloseq(ps_water_mc)
<- as.data.frame(tax_table(ps_obj))
tax_tab1 <- tibble::rownames_to_column(tax_tab1, "otu_id")
tax_tab1 <- as.data.frame(t(otu_table(ps_obj)))
asv_tab1 <- tibble::rownames_to_column(asv_tab1, "otu_id")
asv_tab1 <- data.frame(sample_data(ps_obj))
sam_tab1 1] <- NULL
sam_tab1[<- tibble::rownames_to_column(sam_tab1, "sample_id")
da_samp <- merge(asv_tab1, tax_tab1, by="otu_id")
water_mc $ASV_SEQ <- NULL water_mc
Metacoder uses tibbles. Tibbles basically replace the traditional data frame in R. R for Data Science has a nice section on tibbles if you want to learn more.
<- parse_tax_data(water_mc, class_cols = c("Kingdom", "Phylum",
obj "Class", "Order",
"Family", "Genus", "ASV_ID" ))
$data$tax_abund <- calc_taxon_abund(obj, "tax_data",
objcols = da_samp$sample_id)
$data$tax_occ <- calc_n_samples(obj, "tax_abund",
objgroups = da_samp$oxstate,
cols = da_samp$sample_id)
Now we calculate the differences between the two conditions and add the results as a new table.
$data$diff_table <- compare_groups(obj,
objdata = "tax_abund",
cols = da_samp$sample_id,
groups = da_samp$oxstate)
Now we can build the heat map(s). There are a lot of parameters to set and you need to play around with your data set to see what is most appropriate. Our data set has a few Archaea but we just want to focus on Bacteria. So we select only the Bacteria and also remove any taxa with fewer than 100 reads.
#set.seed(1999)
%>%
obj filter_taxa(taxon_names %in% c("Bacteria"), subtaxa = TRUE) %>%
# filter_taxa(taxon_ranks == "o", supertaxa = TRUE) %>% # subset to the order rank
# filter_taxa(taxon_names %in% c("Proteobacteria", "Bacteroidia", "Archaea"), subtaxa = TRUE, invert = TRUE) # to remove taxa
heat_tree(
node_label = taxon_names,
node_size = n_obs,
# node_size_range = c(0.01, 0.05),
node_label_size_range = c(0.008, 0.04),
node_color = log2_median_ratio,
node_color_interval = c(-5.5, 4),
edge_color_interval = c(-5.5, 4),
node_color_trans = "area",
node_color_range = c("#D55E00", "gray", "#0072B2"),
node_size_axis_label = "OTU count",
node_color_axis_label = "Log 2 median ratio",
layout = "da",
initial_layout = "re",
overlap_avoidance = 2,
output_file = "figures/16s-da-asvs/differential_heat_tree.png")
## This is for the example above
#set.seed(10)
%>%
obj filter_taxa(taxon_names %in% c("Cyanobacteria"),
subtaxa = TRUE) %>%
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_size_range = c(0.02, 0.03),
node_label_size_range = c(0.02, 0.04),
node_color = log2_median_ratio,
node_color_interval = c(-5, 5),
edge_color_interval = c(-5, 5),
node_color_range = c("#D55E00", "gray", "#0072B2"),
tree_label = taxon_names,
initial_layout = "re", layout = "da",
node_color_axis_label = "Sum of root reads",
node_size_axis_label = "Number of OTUs",
output_file = "figures/16s-da-asvs/cyano_example.svg")
We can also look only the two major classes—Proteobacteria and Bacteroidia.
#set.seed(10)
%>%
obj filter_taxa(taxon_names %in% c("Proteobacteria", "Bacteroidia"),
subtaxa = TRUE) %>%
heat_tree(node_label = taxon_names,
node_size = n_obs,
node_color = log2_median_ratio,
node_color_range = c("#D55E00", "gray", "#0072B2"),
node_color_interval = c(-3, 3),
tree_label = taxon_names,
initial_layout = "re", layout = "da",
node_color_axis_label = "Sum of root reads",
node_size_axis_label = "Number of OTUs",
output_file = "figures/16s-da-asvs/by_major_taxa.png")
Or compare by site and time.
set.seed(1)
$data$diff_table2 <- compare_groups(obj,
objdata = "tax_abund",
cols = da_samp$sample_id, # What columns of sample data to use
groups = da_samp$period) # What category each sample is assigned to
heat_tree_matrix(obj,
data = "diff_table2",
node_size = n_obs, # n_obs is a function that calculates, in this case, the number of OTUs per taxon
node_label = taxon_names,
node_color = log2_median_ratio, # A column from `obj$data$diff_table`
node_color_range = diverging_palette(), # The built-in palette for diverging data
node_color_trans = "linear", # The default is scaled by circle area
node_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
edge_color_interval = c(-3, 3), # The range of `log2_median_ratio` to display
node_size_axis_label = "Number of OTUs",
node_color_axis_label = "Log2 ratio median proportions",
layout = "davidson-harel", # The primary layout algorithm
initial_layout = "reingold-tilford", # The layout algorithm that initializes node locations
output_file = "figures/16s-da-asvs/differential_heat_tree2.png") # Saves the plot as a png file
The source code for this page can be accessed on GitHub by clicking this link.
Output files from this workflow available on figshare at doi:10.25573/data.12808427. The da_asvs_pipeline.rdata
file contains all variables and objects from this pipeline. The directory within da_asv_anvio.zip
contains all the files needed to visualize the DA ASVs in anvi’o.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hypocolypse/web/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".