Data Availability

In this section you can find information on obtaining raw sequencing data, data products, and processing scripts. Of course, all the code is also embedded in the workflows on the website as well.

Archived Sequence Data

All sequence data is deposited at the European Nucleotide Archive (ENA). See below for instructions on submitting data to the ENA and the files we used to deposit the data.

The trimmed 16S rRNA data (primers removed) are deposited under Project Accession number PRJEB36632 (ERP119845), sample accession numbers ERS4291994-ERS4292031.

The metagenomic data from the four water column samples are also deposited under Project Accession number PRJEB36632 (ERP119845). The individual sample accession numbers for these data are ERS4578390-ERS4578393.

Pipeline Data

Data for each individual pipeline are available through the Smithsonian figshare under a single collection at doi:10.25573/data.c.5025362. In addition, data from each pipeline are available for download from figshare using the links at the bottom of each page (where applicable).

Submitting Sequence Data

We submitted out data to the European Nucleotide Archive (ENA). The ENA does not like RAW data and prefers to have primers removed. So we submitted the trimmed Fastq files to the ENA. You can find these data under the study accession number PRJEB36632 (ERP119845). The RAW files on our figshare site (see above).

To submit to the ENA you need two data tables (plus your sequence data). One file describes the samples and the other file describes the sequencing data.

You can download these data tables here:

16S rRNA


Instructions for Submitting to the ENA

  1. go to and select Submit to ENA.
  2. Login or Register.
  3. Go to New Submission tab and select Register study (project).
  4. Hit Next
  5. Enter details and hit Submit.
  6. Next, Select Checklist. This will be specific to the type of samples you have and basically will create a template so you can add your sample metadata. For this study I chose GSC MIxS water, checklist accession number ERC000024
  7. Next
  8. Now go through and select/deselect fields as needed. Note, some fields are mandatory.
  9. Once finished, hit Next to fill in any details that apply to All samples.
  10. Fill in the sheet
  11. Hit the Next button, change the number of samples, and download the sheet. (This is a little messy and you just need to wade through it)
  12. Once everything looks good and uploaded, click Next to get to the Run page.
  13. Select Two Fastq files (Paired) and Download the template.
  14. Before filling out the form, gzip .gz all the trimmed fastq files (these are what you submit)
  15. Then run md5sum on all the tar.gz files.
  16. Upload all the fastq files. You must do this before submitting the experiment spreadsheet. There are different options for this step. I opened a terminal and typed ftp I entered my username and password. Then typed mput *.gz. The problem was that I had to say yes for each file. Should be a way around this. Documentation can be found here. Probably need to type prompt first.
  17. Fill in the sheet including md5 checksum values.
  18. Upload and submit the sheet.

Source Code

The source code for this page can be accessed on GitHub by clicking this link.



If you see mistakes or want to suggest changes, please create an issue on the source repository.


Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".