Complete, reproducible workflow for preparation of the 16S rRNA data set. These steps are needed before analyzing the data. For aesthetic reasons, the R code for tables & figures are hidden (see the GitHub link at the bottom of the page for the raw .Rmd
file). R code for everything else is generally shown.
::opts_chunk$set(echo = TRUE)
knitrset.seed(0199)
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
::p_load(DT, ggplot2, dplyr, microbiome, tidyverse,
pacmaninstall = FALSE, update = FALSE)
options(scipen=999)
::opts_current$get(c(
knitr"cache",
"cache.path",
"cache.rebuild",
"dependson",
"autodep"
))
You will either need to run the DADA2 workflow or the grab output file
water_pipeline.rdata
from the workflow. This file contains the sequence and taxonomy tables. See the Data Availability page for complete details.
All files needed to run this workflow can be downloaded from figshare. These are the output files from the DADA2 workflow page.
File names and descriptions:
water_pipeline.rdata
contents.
Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to analyze the 16S rRNA data set. Before we conduct any analyses we first need to prepare our data set by curating samples, removing contaminants, and creating phyloseq objects.
Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 pipeline. We need to inspect how total reads changed through the pipeline. Just to get an idea, we combined the results of the Track Changes analysis for these 8 samples. Table headers are as follows:
Header | Description |
---|---|
Type |
the type of source material for the sample (aka the habitat) |
Site |
the site where the sample was collected |
input |
number of raw reads after cutadapt |
filter |
reads remaining after QC & filtering |
denoiseF & denoiseR |
reads after error correction |
merged |
reads after merging forward and reverse read |
nochim |
final read count after removing chimeras |
Change |
percent change from input to nonchim |
The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (water_pipeline.rdata
) contains the ASV-by-sample table and the ASV taxonomy table.
After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.
load("rdata/16s-dada2/water_pipeline.rdata")
<- rownames(seqtab)
samples.out <- sapply(strsplit(samples.out, "[[:digit:]]"), `[`, 1)
subject # this splits the string at first instance of a digit
<- substr(samples.out, 1, 999) # use the whole string for individuals
sample_name <- substr(samples.out, 0, 1) # use the first two letters for sample typle
type <- substr(samples.out, 2, 3) # use the next three letters for site
site <- length(unique(sample_name))
num_samp <- length(unique(type))
num_type <- length(unique(site)) num_sites
So we have a total of 8 samples from 2 sites.
Sample abbreviations
The first letter of the sample ID indicates the environment:
W = Water
The next two letter indicates the site name:
CC = Coral Caye, CR = Cayo Roldan
An the number indicates the replicate number.
For example, WCR3
is a water
sample from Cayo Roldan
replicate 3
.
Sample abbreviations:
#define a sample data frame
<- data.frame(SamName = sample_name, TYPE = type, SITE = site)
samdf rownames(samdf) <- samples.out
A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.
# this create the phyloseq object
<- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
ps sample_data(samdf), tax_table(tax_silva))
tax_table(ps) <- cbind(tax_table(ps),
rownames(tax_table(ps)))
# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
rownames(tax_table(ps)))
head(taxa_names(ps))
[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"
So the complete data set contains 1424 ASVs. We can also use the microbiome R package (Lahti, Sudarshan, and others 2017) to get some additional summary data from the phyloseq object.
Metric | Results |
---|---|
Min. number of reads | 12909 |
Max. number of reads | 33802 |
Total number of reads | 189227 |
Average number of reads | 23653 |
Median number of reads | 25252.5 |
Sparsity | 0.754 |
Any ASVs sum to 1 or less? | TRUE |
Number of singleton ASVs | 74 |
Percent of ASVs that are singletons | 5.197 |
Number of sample variables are: | 3 (SamName, TYPE, SITE) |
B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.
colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order",
"Family", "Genus", "ASV_SEQ", "ASV_ID")
ps
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1424 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 1424 taxa by 8 taxonomic ranks ]
C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full
to indicate that these are the raw outputs.
write.table(tax_table(ps),
"tables/16s-data-prep/full_tax_table.txt",
sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps)),
"tables/16s-data-prep/full_seq_table.txt",
sep="\t", quote = FALSE, col.names=NA)
D. Export fasta file
# Create fasta file from tax_table
<- tax_table(ps)
table2format #retain only the column with the sequences
<- table2format[, 7]
table2format_trim <- data.frame(row.names(table2format_trim),
table2format_trim_df
table2format_trim)colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ")
#format fasta
$ASV_ID <- sub("ASV", ">ASV", table2format_trim_df$ASV_ID)
table2format_trim_df
write.table(table2format_trim_df, "tables/16s-data-prep/full_asv.fasta",
sep = "\r", col.names = FALSE, row.names = FALSE,
quote = FALSE, fileEncoding = "UTF-8")
Let’s see if we have any potential contaminants. We can use some inline R code to see the taxonomy table for any taxa of interest.
Let’s remove these taxa—Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from algae. We must do each of these in turn using phyloseq and it gets a little messy.
Why messy? The subset_taxa
command removes anything that is NA
for the specified taxonomic level or above. For example, let’s say you run the subset_taxa
command using Family != "Mitochondria
". Seems like you should get a phyloseq object with everything except Mitochondria. But actually, the command not only gets rid of Mitochondria but everything else that has NA
for Family and above. In my experience this is not well documented, and I had to dig through the files to figure out what was happening.
Anyway, to remove the taxa we do the following:
ps
object of just the taxa of interest,ps
object.Remember the original data set contained 1424 ASVs.
# generate a file with mitochondria ASVs
<- subset_taxa(ps, Family == "Mitochondria")
MT1 <- as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1 <- as.factor(MT1)
MT1df <- setdiff(taxa_names(ps), MT1df)
goodTaxa <- prune_taxa(goodTaxa, ps)
ps_no_mito ps_no_mito
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1412 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 1412 taxa by 8 taxonomic ranks ]
Looks like this removed 12 Mitochondria ASVs. We will duplicate the code block to remove other groups.
# generate a file with mitochondria ASVs
<- subset_taxa(ps_no_mito, Order == "Chloroplast")
CH1 <- as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1 <- as.factor(CH1)
CH1df <- setdiff(taxa_names(ps_no_mito), CH1df)
goodTaxa <- prune_taxa(goodTaxa, ps_no_mito)
ps_no_chloro ps_no_chloro
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1275 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 1275 taxa by 8 taxonomic ranks ]
The code removed an additional 137 Chloroplast ASVs.
Turns out we had no eukaryotic ASVs but here is the code just in case.
# generate a file with mitochondria ASVs
<- subset_taxa(ps_no_chloro, Kingdom == "Eukaryota")
EU1 <- as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1 <- as.factor(EU1)
EU1df <- setdiff(taxa_names(ps_no_chloro), EU1df)
goodTaxa <- prune_taxa(goodTaxa, ps_no_chloro)
ps_no_euk ps_no_euk
Here we can just use the straight up subset_taxa
command since we do not need to worry about any ranks above Kingdom also being removed.
<- subset_taxa(ps_no_chloro, !is.na(Kingdom))
ps_filt ps_filt
phyloseq-class experiment-level object
otu_table() OTU Table: [ 1272 taxa and 8 samples ]
sample_data() Sample Data: [ 8 samples by 3 sample variables ]
tax_table() Taxonomy Table: [ 1272 taxa by 8 taxonomic ranks ]
The code eliminated an additional 3 Kingdom level NA ASVs from the phyloseq object.
Now that we have a trimmed data set, we can save copies of the sequence and taxonomy tables just in case we want a quick look at these data.
write.table(tax_table(ps_filt),
"tables/16s-data-prep/trim_tax_table.txt", sep="\t",
quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_filt)),
"tables/16s-data-prep/trim_seq_table.txt", sep="\t",
quote = FALSE, col.names=NA)
write.table(sample_data(ps_filt),
"tables/16s-data-prep/sample_table.txt",
sep="\t", quote = FALSE, col.names=NA)
And generate a fasta file for all ASVs in the trimmed data set.
# Create fasta file from tax_table
<- tax_table(ps_filt)
table2format #retain only the column with the sequences
<- table2format[, 7]
table2format_trim <- data.frame(row.names(table2format_trim),
table2format_trim_df
table2format_trim)colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ")
#format fasta
$ASV_ID <- sub("ASV", ">ASV", table2format_trim_df$ASV_ID)
table2format_trim_df
write.table(table2format_trim_df, "tables/16s-data-prep/trim_asv.fasta",
sep = "\r", col.names = FALSE, row.names = FALSE,
quote = FALSE, fileEncoding = "UTF-8")
One last thing is to rename the phyloseq object and save some tables and images for the next part.
<- ps_filt
ps_water save.image("rdata/16s-data-prep/16s-data_prep.rdata")
saveRDS(otu_table(ps_water), "rdata/16s-data-prep/ps_water-seqtab.rds")
saveRDS(sample_data(ps_water), "rdata/16s-data-prep/ps_water-sample.rds")
saveRDS(tax_table(ps_water), "rdata/16s-data-prep/ps_water-taxtab.rds")
That’s all for this part! In the next section we work through analyzing the diversity of the water samples.
You can find the source code for this page by clicking this link.
Output files from this workflow available on figshare at doi:10.25573/data.12403925. Here you will find the sample metadata table, sequence tables, taxonomy tables, and ASV fasta files from the full phyloseq object (before removing contaminants) and the trimmed data set (after removing contaminants). The 16s-data_prep.rdata file contains all variables and phyloseq objects from this pipeline and is needed for the next workflow (as are the .rds files).
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hypocolypse/web/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".