Show setup information.

knitr::opts_chunk$set(echo = TRUE)
set.seed(0199)
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
pacman::p_load(DT, ggplot2, dplyr, microbiome, tidyverse,   
               install = FALSE, update = FALSE)
options(scipen=999)
knitr::opts_current$get(c(
  "cache",
  "cache.path",
  "cache.rebuild",
  "dependson",
  "autodep"
))

Data Availability

You will either need to run the DADA2 workflow or the grab output file water_pipeline.rdata from the workflow. This file contains the sequence and taxonomy tables. See the Data Availability page for complete details.

All files needed to run this workflow can be downloaded from figshare. These are the output files from the DADA2 workflow page.

File names and descriptions:

RUN01_read_changes.txt: Tracking changes in read counts (per sample) from the beginning to end of the DADA2 workflow.
RUN02_read_changes.txt: Tracking changes in read counts (per sample) from the beginning to end of the DADA2 workflow.
water_pipeline.rdata: contains water sample sequence and taxonomy tables from the DADA2 pipeline needed for subsequent analyses. To see the Objects, in R run load(“water_pipeline.rdata,” verbose=TRUE)
combo_pipeline.rdata: contains sequence and taxonomy tables from the DADA2 pipeline for all samples. To see the Objects, in R run load(“combo_pipeline.rdata,” verbose=TRUE)

water_pipeline.rdata contents.

seqtab.1: Sequence table from Run01 before merging with Run02.
seqtab.1: Sequence table from Run02 before merging with Run01.
st.sum: merged sequence table before removing chimeras
st.all: duplicate of st.sum
seqtab: merged sequence table after removing chimeras
tax_silva: Silva (v132) taxonomy table of seqtab
tax_gg: GreenGenes taxonomy table of seqtab

Workflow Overview

Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to analyze the 16S rRNA data set. Before we conduct any analyses we first need to prepare our data set by curating samples, removing contaminants, and creating phyloseq objects.

Read Counts Assessment

Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 pipeline. We need to inspect how total reads changed through the pipeline. Just to get an idea, we combined the results of the Track Changes analysis for these 8 samples. Table headers are as follows:

Header	Description
`Type`	the type of source material for the sample (aka the habitat)
`Site`	the site where the sample was collected
`input`	number of raw reads after `cutadapt`
`filter`	reads remaining after QC & filtering
`denoiseF` & `denoiseR`	reads after error correction
`merged`	reads after merging forward and reverse read
`nochim`	final read count after removing chimeras
`Change`	percent change from `input` to `nonchim`

Defining Groups

The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (water_pipeline.rdata) contains the ASV-by-sample table and the ASV taxonomy table.
After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.

load("rdata/16s-dada2/water_pipeline.rdata")
samples.out <- rownames(seqtab)
subject <- sapply(strsplit(samples.out, "[[:digit:]]"), `[`, 1)
# this splits the string at first instance of a digit
sample_name <- substr(samples.out, 1, 999)  # use the whole string for individuals
type <- substr(samples.out, 0, 1)  # use the first two letters for sample typle
site <- substr(samples.out, 2, 3)  # use the next three letters for site
num_samp <- length(unique(sample_name))
num_type <- length(unique(type))
num_sites <- length(unique(site))

So we have a total of 8 samples from 2 sites.

Sample abbreviations

The first letter of the sample ID indicates the environment:

W = Water

The next two letter indicates the site name:

CC = Coral Caye, CR = Cayo Roldan

An the number indicates the replicate number.

For example, WCR3 is a water sample from Cayo Roldan replicate 3.

Sample abbreviations:

WCC = Water sample from Coral Caye
WCR = Water sample from Cayo Roldan

And finally we define a sample data frame that holds the different groups we extracted from the sample names.

#define a sample data frame
samdf <- data.frame(SamName = sample_name, TYPE = type, SITE = site)
rownames(samdf) <- samples.out

Phyloseq Objects

A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.

# this create the phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
                   sample_data(samdf), tax_table(tax_silva))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))

# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))
head(taxa_names(ps))

[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"

So the complete data set contains 1424 ASVs. We can also use the microbiome R package (Lahti, Sudarshan, and others 2017) to get some additional summary data from the phyloseq object.

Metric	Results
Min. number of reads	12909
Max. number of reads	33802
Total number of reads	189227
Average number of reads	23653
Median number of reads	25252.5
Sparsity	0.754
Any ASVs sum to 1 or less?	TRUE
Number of singleton ASVs	74
Percent of ASVs that are singletons	5.197
Number of sample variables are:	3 (SamName, TYPE, SITE)

B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.

colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order",
    "Family", "Genus", "ASV_SEQ", "ASV_ID")
ps

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1424 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1424 taxa by 8 taxonomic ranks ]

C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full to indicate that these are the raw outputs.

write.table(tax_table(ps),
            "tables/16s-data-prep/full_tax_table.txt",
            sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps)),
            "tables/16s-data-prep/full_seq_table.txt",
            sep="\t", quote = FALSE, col.names=NA)

D. Export fasta file

# Create fasta file from tax_table
table2format <- tax_table(ps)
#retain only the column with the sequences
table2format_trim <- table2format[, 7]
table2format_trim_df <- data.frame(row.names(table2format_trim),
                                   table2format_trim)
colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ")
#format fasta
table2format_trim_df$ASV_ID <- sub("ASV", ">ASV", table2format_trim_df$ASV_ID)

write.table(table2format_trim_df, "tables/16s-data-prep/full_asv.fasta",
            sep = "\r", col.names = FALSE, row.names = FALSE,
            quote = FALSE, fileEncoding = "UTF-8")

Curation

Let’s see if we have any potential contaminants. We can use some inline R code to see the taxonomy table for any taxa of interest.

The code (hidden by default) is written as `r "Chloroplast" %in% tax_table(ps)`.

Are Mitochondria present? TRUE
Are Chloroplast present? TRUE
Are Eukaryota present? FALSE

Let’s remove these taxa—Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from algae. We must do each of these in turn using phyloseq and it gets a little messy.

Why messy? The subset_taxa command removes anything that is NA for the specified taxonomic level or above. For example, let’s say you run the subset_taxa command using Family != "Mitochondria". Seems like you should get a phyloseq object with everything except Mitochondria. But actually, the command not only gets rid of Mitochondria but everything else that has NA for Family and above. In my experience this is not well documented, and I had to dig through the files to figure out what was happening.

Anyway, to remove the taxa we do the following:

Subset the taxa and generate a ps object of just the taxa of interest,
Select the ASV column only, turn it into a factor, and use this to remove from the ps object.

Remove Mitochondria ASVs

Remember the original data set contained 1424 ASVs.

# generate a file with mitochondria ASVs
MT1 <- subset_taxa(ps, Family == "Mitochondria")
MT1 <-  as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1df <- as.factor(MT1)
goodTaxa <- setdiff(taxa_names(ps), MT1df)
ps_no_mito <- prune_taxa(goodTaxa, ps)
ps_no_mito

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1412 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1412 taxa by 8 taxonomic ranks ]

Looks like this removed 12 Mitochondria ASVs. We will duplicate the code block to remove other groups.

Remove Chloroplast ASVs

# generate a file with mitochondria ASVs
CH1 <- subset_taxa(ps_no_mito, Order == "Chloroplast")
CH1 <-  as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1df <- as.factor(CH1)
goodTaxa <- setdiff(taxa_names(ps_no_mito), CH1df)
ps_no_chloro <- prune_taxa(goodTaxa, ps_no_mito)
ps_no_chloro

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1275 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1275 taxa by 8 taxonomic ranks ]

The code removed an additional 137 Chloroplast ASVs.

Remove Eukaryotic ASVs

Turns out we had no eukaryotic ASVs but here is the code just in case.

# generate a file with mitochondria ASVs
EU1 <- subset_taxa(ps_no_chloro, Kingdom == "Eukaryota")
EU1 <-  as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1df <- as.factor(EU1)
goodTaxa <- setdiff(taxa_names(ps_no_chloro), EU1df)
ps_no_euk <- prune_taxa(goodTaxa, ps_no_chloro)
ps_no_euk

Remove any Kingdom NAs

Here we can just use the straight up subset_taxa command since we do not need to worry about any ranks above Kingdom also being removed.

ps_filt <- subset_taxa(ps_no_chloro, !is.na(Kingdom))
ps_filt

phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1272 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1272 taxa by 8 taxonomic ranks ]

The code eliminated an additional 3 Kingdom level NA ASVs from the phyloseq object.

Now that we have a trimmed data set, we can save copies of the sequence and taxonomy tables just in case we want a quick look at these data.

write.table(tax_table(ps_filt),
            "tables/16s-data-prep/trim_tax_table.txt", sep="\t",
            quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_filt)),
            "tables/16s-data-prep/trim_seq_table.txt", sep="\t",
            quote = FALSE, col.names=NA)
write.table(sample_data(ps_filt),
            "tables/16s-data-prep/sample_table.txt",
            sep="\t", quote = FALSE, col.names=NA)

And generate a fasta file for all ASVs in the trimmed data set.

# Create fasta file from tax_table
table2format <- tax_table(ps_filt)
#retain only the column with the sequences
table2format_trim <- table2format[, 7]
table2format_trim_df <- data.frame(row.names(table2format_trim),
                                   table2format_trim)
colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ")
#format fasta
table2format_trim_df$ASV_ID <- sub("ASV", ">ASV", table2format_trim_df$ASV_ID)

write.table(table2format_trim_df, "tables/16s-data-prep/trim_asv.fasta",
            sep = "\r", col.names = FALSE, row.names = FALSE,
            quote = FALSE, fileEncoding = "UTF-8")

Save Objects

One last thing is to rename the phyloseq object and save some tables and images for the next part.

ps_water <- ps_filt
save.image("rdata/16s-data-prep/16s-data_prep.rdata")
saveRDS(otu_table(ps_water), "rdata/16s-data-prep/ps_water-seqtab.rds")
saveRDS(sample_data(ps_water), "rdata/16s-data-prep/ps_water-sample.rds")
saveRDS(tax_table(ps_water), "rdata/16s-data-prep/ps_water-taxtab.rds")

That’s all for this part! In the next section we work through analyzing the diversity of the water samples.

Source Code

You can find the source code for this page by clicking this link.

Data Availability

Output files from this workflow available on figshare at doi:10.25573/data.12403925. Here you will find the sample metadata table, sequence tables, taxonomy tables, and ASV fasta files from the full phyloseq object (before removing contaminants) and the trimmed data set (after removing contaminants). The 16s-data_prep.rdata file contains all variables and phyloseq objects from this pipeline and is needed for the next workflow (as are the .rds files).

Lahti, Leo, Shetty Sudarshan, and others. 2017. “Tools for Microbiome Analysis in r. Version 1.9.97.” https://github.com/microbiome/microbiome.

McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An r Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PLoS One 8 (4): e61217. https://doi.org/10.1371/journal.pone.0061217.

No 2. Data Preparation

Data Availability

Workflow Overview

Read Counts Assessment

Defining Groups

Phyloseq Objects

Curation

Remove Mitochondria ASVs

Remove Chloroplast ASVs

Remove Eukaryotic ASVs

Remove any Kingdom NAs

Save Objects

Source Code

Data Availability

References

Corrections

Reuse