No 1. Working Environment

In this section we go through our steps to build a working environment for data processing and analysis.

For the metagenomic analysis we largely leave the R environment. The initial step of our metagenomic workflow is to setup our working environment, which we will do in two steps.

In the first part we install the backbone of our computational operations—Miniconda and anvi’o. Then we build the annotation databases and install the needed software.

Metagenomic datasets are often a bit large so you may need to run these workflows on a compute cluster. A personal computer will be insufficient for many of the steps. We ran our analyses on the Smithsonian cluster (Hydra). For each step in the workflow we include our Hydra job scripts, embedded in a handy dropdown.

By the time you read this, some of these instructions may be out of date. Consult the source material for programs/databases to ensure up-to-date installations.

For the metagenomic analysis we mainly use anvi’o—an innovative (dare I say gnarly) platform for metagenomic analysis and visualization. If you put in the time to learn this platform, you will be rewarded—beautifully rendered interactive graphics, a deeper understanding of metagenomics, and the ability to explore your data in novel ways. As the developers like to say, anvi’o helps you get your hands dirty. More on that in a minute.

We will also use other tools to generate data products that we can port into the anvi’o ecosystem. First though, let’s set up a Miniconda environment. We will use Miniconda to install a stable version of anvi’o as well as several other tools as the need arises.

Miniconda

I chose to use conda as my package management system and Miniconda as the conda installer. I prefer Miniconda because it is a (free) minimal installer with a small footprint that I can customize to suite my needs. We will use conda to install almost every program in this workflow.

Conda allows you to create environments that are isolated from the rest of your system. This is useful for many reasons. For example, as a user on a server you may not have sudo privileges and installing software can be a huge pain for everyone. With conda you basically create a computer within a computer so that what you do does not affect anything outside the Miniconda directory. In addition, every environment is isolated from each other within miniconda. This is important because you can install software packages that require different versions Python for example.

The first step is to install Miniconda (available in multiple flavors).

#get the latest version of the installer.
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
sh Miniconda3-latest-Linux-x86_64.sh
#update conda
conda update -n base -c defaults conda

And then add these channels. Different packages are stored on different channels. To get a specific package you often must specify a specific remote channel.

conda config --env --add channels conda-forge
conda config --env --add channels bioconda

If you are not familiar with conda and plan to use it as a package management system, I strongly encourage you to take some time and learn the ins and outs. I promise it will be worth your time. conda is fairly straightforward to use and you will use it a lot. There are a ton of resources out there accessible using your favorite search engine.

If you ever want to know if there is a conda package for some program, just search for the program name + conda in your favorite search engine. If there is, the first link should be to the Anaconda Cloud. There you will find details about the package like installation instructions, last update, total downloads, etc.

Anvi’o

Now we are ready to install anvi’o (Eren et al. 2015). If you plan to work with anvi’o I have a few recommendations.

  1. Take some time to look around the anvi’o website. There is a TON of information, not just about the software platform but about metagenomics in general. For example, have a look at the metagenomic vocabulary page.
  2. The anvi’o developers offer many options for getting help. I suggest joining their Slack channel or Google Group. For technical issues check out the anvi’o GitHub site.
  3. There are many tutorials built around core anvi’o functionality. Be sure to work through some of those.
  4. The data page has data and reproducible bioinformatic workflows for all their publications. This is an invaluable resource that I use all the time.
  5. You can find up-to-date installation instructions on the anvi’o website.
  6. Another excellent resource if you’re just getting started is Happy Belly Bioinformatics. That’s right. Great name, great site.

For posterity, I will describe here exactly how I installed anvi’o at the time of this analysis.

Development anvi’o

I like to run the development version of anvi’o so I can keep up with the master branch. This allows me to access new features and bug fixes as they are released. The way this is done is that any time we fire up anvi’o, a set of instructions will compare our version of anvi’o with the master repository. If any differences are detected, the script will replace the out-of-date local files with the updated files from the repository.

The downside of this is that in the extremely unlikely event that the remarkably talented anvi’o developers make a mistake, you could break your anvi’o install. So just in case we will also install a stable version of the codebase (see below).

Moving on. At the time of this writing, the miniconda install default Python version was 3.7.3. Many of the anvi’o dependencies required Python 3.6.x. The first step is to create a conda environment running Python 3.6. This is an example of the power of package management—running different versions of python.

First create the anvi’o environment.

conda deactivate
conda create -y --name anvio-master python=3.6

Next, we activate the environment and double check we have the correct version of Python.

conda activate anvio-master
python --version

Hopefully you get something like this…

Python 3.6.9 :: Anaconda, Inc.

Now time to install all the different 3rd party software that anvi’o needs to do its job. Most of these are pretty much mandatory and part of the general anvi’o installation instructions

pip install virtualenv
# this will install virtualenv with the Python version of
# the environment, in this case 3.6.9

conda install -y -c bioconda prodigal
conda install -y -c bioconda mcl
conda install -y -c bioconda muscle
conda install -y -c bioconda fasttree
conda install -y -c bioconda hmmer
conda install -y -c bioconda blast
conda install -y -c bioconda megahit
conda install -y -c bioconda bowtie2
conda install -y -c bioconda bwa
conda install -y -c bioconda samtools
conda install -y -c bioconda centrifuge
conda install -y -c bioconda diamond=0.9.14
conda install -y -c bioconda bioconductor-qvalue
conda install -y r-base=3.6.1
conda install -y -c r r-tidyverse
conda install -y -c conda-forge r-optparse

Swell. On to the anvi’o install. I used ~/github/ as the base directory for the anvi’o codebase.

mkdir -p ~/github && cd ~/github/
git clone --recursive https://github.com/meren/anvio.git

Now we set-up a Python virtual environment for anvi’o. For reasons beyond my efforts to understand, this is the best way to install the dev version. Remember we are still in the conda environment (anvio-master) so this is a “virtual environment within a virtual environment.”

# just in case you've done this before
# remove old and create new
rm -rf ~/virtual-envs/anvio-master
mkdir -p ~/virtual-envs/
# start virtualenv
virtualenv ~/virtual-envs/anvio-master
source ~/virtual-envs/anvio-master/bin/activate

Now anvi’o is active in a virtual environment, which itself is active in a conda environment. Cool. Before installing anvi’o we need to install the dependencies for the codebase. These are mostly Python packages like numpy, snakemake, pysam, etc.

cd ~/github/anvio/
pip install -r requirements.txt

You can now deactivate the anvi’o virtualenv.

deactivate

If we want to update the codebase from the master repository, we need to make a few additions to the activation file in the virtualenv. This will look for changes every time we activate anvi’o.

# updating the activation script for the Python virtual environment
# so (1) Python knows where to find anvi'o libraries, (2) BASH knows
# where to find its programs, and (3) every the environment is activated
# it downloads the latest code from the `master` repository
echo -e "\n# >>> ANVI'O STUFF >>>" >> ~/virtual-envs/anvio-master/bin/activate
echo 'export PYTHONPATH=$PYTHONPATH:~/github/anvio/' >> ~/virtual-envs/anvio-master/bin/activate
echo 'export PATH=$PATH:~/github/anvio/bin:~/github/anvio/sandbox' >> ~/virtual-envs/anvio-master/bin/activate
echo 'cd ~/github/anvio && git pull && cd -' >> ~/virtual-envs/anvio-master/bin/activate
echo "# <<< ANVI'O STUFF <<<" >> ~/virtual-envs/anvio-master/bin/activate

And finally…add an alias to the .bash_profile file to do all of this with a single command.

echo -e "\n# >>> ANVI'O STUFF >>>" >> ~/.bash_profile
echo 'alias anvi-activate-master="source ~/virtual-envs/anvio-master/bin/activate"' >> ~/.bash_profile
echo "# <<< ANVI'O STUFF <<<" >> ~/.bash_profile
source ~/.bash_profile

That’s it. Dev version installed. Bam.

Run the self-test and make sure it completes without issue

anvi-self-test --suite mini

I also like to run the full version of the test. It takes longer but it makes me feel better because sometimes the mini test passes while the full test fails.

anvi-self-test --suite full

Stable anvi’o

Next we install a stable version just in case we break something in the master, want to test the behavior of a code change, etc. It’s probably a good idea to check the anvio release page from time to time for new releases and update this install accordingly. The easiest thing to do is remove the environment and reinstall or start a completely new environment. I find trying to update an install to be a pain. More on that below…

First we create a conda environment and activate it. We will name the environment anvio-6 but you can call it whatever you want.

conda create -n anvio-6 python=3.6
conda activate anvio-6

Then we install anvi’o.

conda install -y anvio=6

Boom. The last step is to check the version of diamond. If you do not have this version of diamond many things will break.

diamond --version

If it says anything other than diamond version 0.9.14 you need to run:

conda install -y diamond=0.9.14

That’s it. Stable version installed. Bam.

Run the self-test and make sure it completes without issue

anvi-self-test --suite mini

Again, I also like to run the full version of the test. It takes longer but it makes me feel better…

anvi-self-test --suite full

Anytime you want to activate just run

conda activate anvio-6

Or add an alias to your .bash_profile file like so…

echo 'alias anvi-activate-stable="conda activate anvio-6"' >> ~/.bash_profile

Accessing anvi’o like a pro

Meren wrote this little ditty to access the two different anvi’o installs with just simple alias commands. And he tucked it is a cool dropdown. I present this here exactly as he had it on the anvio website :)

Show/hide Meren’s BASH profile setup

This is all personal taste and they may need to change from computer to computer, but I added the following lines at the end of my ~/.bash_profile to easily switch between different versions of anvi’o on my Mac system:

If you are using Anaconda rather than miniconda, or you are using Linux and not Mac, you will have to find corresponding paths for lines that start with /Users down below :)

init_anvio_6 () {
    {
        deactivate && conda deactivate
    } &> /dev/null

    export PATH="/Users/$USER/miniconda3/bin:$PATH"
    . /Users/$USER/miniconda3/etc/profile.d/conda.sh
    conda activate anvio-6
    export PS1="\[\e[0m\e[40m\e[1;30m\] :: anvi'o v6 :: \[\e[0m\e[0m \[\e[1;32m\]\]\w\[\e[m\] \[\e[1;31m\]>>>\[\e[m\] \[\e[0m\]"
}


init_anvio_master () {
    {
        deactivate && conda deactivate
    } &> /dev/null

    export PATH="/Users/$USER/miniconda3/bin:$PATH"
    . /Users/$USER/miniconda3/etc/profile.d/conda.sh
    conda activate anvio-master
    anvi-activate-master
    export PS1="\[\e[0m\e[40m\e[1;30m\] :: anvi'o v6 master :: \[\e[0m\e[0m \[\e[1;34m\]\]\w\[\e[m\] \[\e[1;31m\]>>>\[\e[m\] \[\e[0m\]"
}

alias a6=init_anvio_6
alias am=init_anvio_master

With this setup, in a new terminal window you can type a6 or am to run the stable or master version of anvi’o, or to switch from one to the other. That’s rad.

(anvio-6) Js-MacBook-Pro:~ rad$ a6
 :: anvi'o v6 ::  ~ >>> anvi-self-test -v
Anvi'o version ...............................: esther (v6)
Profile DB version ...........................: 31
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1
 :: anvi'o v6 ::  ~ >>>

  :: anvi'o v6 ::  ~ >>> am
Already up to date.
/Users/rad
 :: anvi'o v6 master ::  ~ >>> anvi-self-test -v
Anvi'o version ...............................: esther (v6-master)
Profile DB version ...........................: 31
Contigs DB version ...........................: 14
Pan DB version ...............................: 13
Genome data storage version ..................: 6
Auxiliary data storage version ...............: 2
Structure DB version .........................: 1
 :: anvi'o v6 master ::  ~ >>>

Other Software

As mentioned above, we will use many other tools. Some are installed with anvi’o, some we installed while installing anvi’o, while others need to be installed separately. We will do this as the need arises and document it as we move through the workflow. In the next section for example we build some databases and install the tools we need at the same time. :)

Next

Source Code

The source code for this page can be accessed on GitHub by clicking this link.

Eren, A Murat, Özcan C Esen, Christopher Quince, Joseph H Vineis, Hilary G Morrison, Mitchell L Sogin, and Tom O Delmont. 2015. “Anvi’o: An Advanced Analysis and Visualization Platform for ‘Omics Data.” PeerJ 3: e1319. https://doi.org/10.7717/peerj.1319.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hypocolypse/web/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".