No 2. Data Preparation

Complete, reproducible workflow for preparation of the 16S rRNA data set. These steps are needed before analyzing the data. For aesthetic reasons, the R code for tables & figures are hidden (see the GitHub link at the bottom of the page for the raw .Rmd file). R code for everything else is generally shown.

Show setup information.
knitr::opts_chunk$set(echo = TRUE)
set.seed(0199)
library(phyloseq); packageVersion("phyloseq")
library(Biostrings); packageVersion("Biostrings")
pacman::p_load(DT, ggplot2, dplyr, microbiome, tidyverse,   
               install = FALSE, update = FALSE)
options(scipen=999)
knitr::opts_current$get(c(
  "cache",
  "cache.path",
  "cache.rebuild",
  "dependson",
  "autodep"
))

Data Availability

You will either need to run the DADA2 workflow or the grab output file water_pipeline.rdata from the workflow. This file contains the sequence and taxonomy tables. See the Data Availability page for complete details.

All files needed to run this workflow can be downloaded from figshare. These are the output files from the DADA2 workflow page.

File names and descriptions:

water_pipeline.rdata contents.

Workflow Overview

Unless otherwise noted, we primarily use phyloseq (McMurdie and Holmes 2013) in this section of the workflow to analyze the 16S rRNA data set. Before we conduct any analyses we first need to prepare our data set by curating samples, removing contaminants, and creating phyloseq objects.

Read Counts Assessment

Before we begin, let’s create a summary table containing some basic sample metadata and the read count data from the DADA2 pipeline. We need to inspect how total reads changed through the pipeline. Just to get an idea, we combined the results of the Track Changes analysis for these 8 samples. Table headers are as follows:

Header Description
Type the type of source material for the sample (aka the habitat)
Site the site where the sample was collected
input number of raw reads after cutadapt
filter reads remaining after QC & filtering
denoiseF & denoiseR reads after error correction
merged reads after merging forward and reverse read
nochim final read count after removing chimeras
Change percent change from input to nonchim


Defining Groups

  1. The first thing we want to do is load the data packet produced by the final step of the DADA2 workflow. This packet (water_pipeline.rdata) contains the ASV-by-sample table and the ASV taxonomy table.

  2. After we load the data packet we next need to format sample names and define groups. We will use the actual sample names to define the different groups.

load("rdata/16s-dada2/water_pipeline.rdata")
samples.out <- rownames(seqtab)
subject <- sapply(strsplit(samples.out, "[[:digit:]]"), `[`, 1)
# this splits the string at first instance of a digit
sample_name <- substr(samples.out, 1, 999)  # use the whole string for individuals
type <- substr(samples.out, 0, 1)  # use the first two letters for sample typle
site <- substr(samples.out, 2, 3)  # use the next three letters for site
num_samp <- length(unique(sample_name))
num_type <- length(unique(type))
num_sites <- length(unique(site))

So we have a total of 8 samples from 2 sites.

Sample abbreviations

The first letter of the sample ID indicates the environment:

W = Water

The next two letter indicates the site name:

CC = Coral Caye, CR = Cayo Roldan

An the number indicates the replicate number.

For example, WCR3 is a water sample from Cayo Roldan replicate 3.

Sample abbreviations:

  1. And finally we define a sample data frame that holds the different groups we extracted from the sample names.
#define a sample data frame
samdf <- data.frame(SamName = sample_name, TYPE = type, SITE = site)
rownames(samdf) <- samples.out

Phyloseq Objects

A. The first step is to rename the amplicon sequence variants (ASVs) so the designations are a bit more user friendly. By default, DADA2 names each ASV by its unique sequence so that data can be directly compared across studies (which is great). But this convention can get cumbersome downstream, so we rename the ASVs using a simpler convention—ASV1, ASV2, ASV3, and so on.

# this create the phyloseq object
ps <- phyloseq(otu_table(seqtab, taxa_are_rows = FALSE),
                   sample_data(samdf), tax_table(tax_silva))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))

# adding unique ASV names
taxa_names(ps) <- paste0("ASV", seq(ntaxa(ps)))
tax_table(ps) <- cbind(tax_table(ps),
                           rownames(tax_table(ps)))
head(taxa_names(ps))
[1] "ASV1" "ASV2" "ASV3" "ASV4" "ASV5" "ASV6"

So the complete data set contains 1424 ASVs. We can also use the microbiome R package (Lahti, Sudarshan, and others 2017) to get some additional summary data from the phyloseq object.

Metric Results
Min. number of reads 12909
Max. number of reads 33802
Total number of reads 189227
Average number of reads 23653
Median number of reads 25252.5
Sparsity 0.754
Any ASVs sum to 1 or less? TRUE
Number of singleton ASVs 74
Percent of ASVs that are singletons 5.197
Number of sample variables are: 3 (SamName, TYPE, SITE)

B. Add two final columns containing the ASV sequences and ASV IDs. This will be useful later when trying to export a fasta file. We can also take a look at the phyloseq object.

colnames(tax_table(ps)) <- c("Kingdom", "Phylum", "Class", "Order",
    "Family", "Genus", "ASV_SEQ", "ASV_ID")
ps
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1424 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1424 taxa by 8 taxonomic ranks ]

C. Export sequence and taxonomy tables for the unadulterated phyloseq object for later use. We will use the prefix full to indicate that these are the raw outputs.

write.table(tax_table(ps),
            "tables/16s-data-prep/full_tax_table.txt",
            sep="\t", quote = FALSE, col.names=NA)
write.table(t(otu_table(ps)),
            "tables/16s-data-prep/full_seq_table.txt",
            sep="\t", quote = FALSE, col.names=NA)

D. Export fasta file

# Create fasta file from tax_table
table2format <- tax_table(ps)
#retain only the column with the sequences
table2format_trim <- table2format[, 7]
table2format_trim_df <- data.frame(row.names(table2format_trim),
                                   table2format_trim)
colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ")
#format fasta
table2format_trim_df$ASV_ID <- sub("ASV", ">ASV", table2format_trim_df$ASV_ID)

write.table(table2format_trim_df, "tables/16s-data-prep/full_asv.fasta",
            sep = "\r", col.names = FALSE, row.names = FALSE,
            quote = FALSE, fileEncoding = "UTF-8")

Curation

Let’s see if we have any potential contaminants. We can use some inline R code to see the taxonomy table for any taxa of interest.

Let’s remove these taxa—Eukaryota because we used bacterial/archaeal primers, Mitochondria because those are likely from eukaryotes, and Chloroplast because those are likely from algae. We must do each of these in turn using phyloseq and it gets a little messy.

Why messy? The subset_taxa command removes anything that is NA for the specified taxonomic level or above. For example, let’s say you run the subset_taxa command using Family != "Mitochondria". Seems like you should get a phyloseq object with everything except Mitochondria. But actually, the command not only gets rid of Mitochondria but everything else that has NA for Family and above. In my experience this is not well documented, and I had to dig through the files to figure out what was happening.

Anyway, to remove the taxa we do the following:

Remove Mitochondria ASVs

Remember the original data set contained 1424 ASVs.

# generate a file with mitochondria ASVs
MT1 <- subset_taxa(ps, Family == "Mitochondria")
MT1 <-  as(tax_table(MT1), "matrix")
MT1 <- MT1[, 8]
MT1df <- as.factor(MT1)
goodTaxa <- setdiff(taxa_names(ps), MT1df)
ps_no_mito <- prune_taxa(goodTaxa, ps)
ps_no_mito
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1412 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1412 taxa by 8 taxonomic ranks ]

Looks like this removed 12 Mitochondria ASVs. We will duplicate the code block to remove other groups.

Remove Chloroplast ASVs

# generate a file with mitochondria ASVs
CH1 <- subset_taxa(ps_no_mito, Order == "Chloroplast")
CH1 <-  as(tax_table(CH1), "matrix")
CH1 <- CH1[, 8]
CH1df <- as.factor(CH1)
goodTaxa <- setdiff(taxa_names(ps_no_mito), CH1df)
ps_no_chloro <- prune_taxa(goodTaxa, ps_no_mito)
ps_no_chloro
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1275 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1275 taxa by 8 taxonomic ranks ]

The code removed an additional 137 Chloroplast ASVs.

Remove Eukaryotic ASVs

Turns out we had no eukaryotic ASVs but here is the code just in case.

# generate a file with mitochondria ASVs
EU1 <- subset_taxa(ps_no_chloro, Kingdom == "Eukaryota")
EU1 <-  as(tax_table(EU1), "matrix")
EU1 <- EU1[, 8]
EU1df <- as.factor(EU1)
goodTaxa <- setdiff(taxa_names(ps_no_chloro), EU1df)
ps_no_euk <- prune_taxa(goodTaxa, ps_no_chloro)
ps_no_euk

Remove any Kingdom NAs

Here we can just use the straight up subset_taxa command since we do not need to worry about any ranks above Kingdom also being removed.

ps_filt <- subset_taxa(ps_no_chloro, !is.na(Kingdom))
ps_filt
phyloseq-class experiment-level object
otu_table()   OTU Table:         [ 1272 taxa and 8 samples ]
sample_data() Sample Data:       [ 8 samples by 3 sample variables ]
tax_table()   Taxonomy Table:    [ 1272 taxa by 8 taxonomic ranks ]

The code eliminated an additional 3 Kingdom level NA ASVs from the phyloseq object.

Now that we have a trimmed data set, we can save copies of the sequence and taxonomy tables just in case we want a quick look at these data.

write.table(tax_table(ps_filt),
            "tables/16s-data-prep/trim_tax_table.txt", sep="\t",
            quote = FALSE, col.names=NA)
write.table(t(otu_table(ps_filt)),
            "tables/16s-data-prep/trim_seq_table.txt", sep="\t",
            quote = FALSE, col.names=NA)
write.table(sample_data(ps_filt),
            "tables/16s-data-prep/sample_table.txt",
            sep="\t", quote = FALSE, col.names=NA)

And generate a fasta file for all ASVs in the trimmed data set.

# Create fasta file from tax_table
table2format <- tax_table(ps_filt)
#retain only the column with the sequences
table2format_trim <- table2format[, 7]
table2format_trim_df <- data.frame(row.names(table2format_trim),
                                   table2format_trim)
colnames(table2format_trim_df) <- c("ASV_ID", "ASV_SEQ")
#format fasta
table2format_trim_df$ASV_ID <- sub("ASV", ">ASV", table2format_trim_df$ASV_ID)

write.table(table2format_trim_df, "tables/16s-data-prep/trim_asv.fasta",
            sep = "\r", col.names = FALSE, row.names = FALSE,
            quote = FALSE, fileEncoding = "UTF-8")

Save Objects

One last thing is to rename the phyloseq object and save some tables and images for the next part.

ps_water <- ps_filt
save.image("rdata/16s-data-prep/16s-data_prep.rdata")
saveRDS(otu_table(ps_water), "rdata/16s-data-prep/ps_water-seqtab.rds")
saveRDS(sample_data(ps_water), "rdata/16s-data-prep/ps_water-sample.rds")
saveRDS(tax_table(ps_water), "rdata/16s-data-prep/ps_water-taxtab.rds")

That’s all for this part! In the next section we work through analyzing the diversity of the water samples.


Previous

Next

Source Code

You can find the source code for this page by clicking this link.

Data Availability

Output files from this workflow available on figshare at doi:10.25573/data.12403925. Here you will find the sample metadata table, sequence tables, taxonomy tables, and ASV fasta files from the full phyloseq object (before removing contaminants) and the trimmed data set (after removing contaminants). The 16s-data_prep.rdata file contains all variables and phyloseq objects from this pipeline and is needed for the next workflow (as are the .rds files).

Lahti, Leo, Shetty Sudarshan, and others. 2017. “Tools for Microbiome Analysis in r. Version 1.9.97.” https://github.com/microbiome/microbiome.
McMurdie, Paul J, and Susan Holmes. 2013. “Phyloseq: An r Package for Reproducible Interactive Analysis and Graphics of Microbiome Census Data.” PLoS One 8 (4): e61217. https://doi.org/10.1371/journal.pone.0061217.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hypocolypse/web/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".