: No 7. KEGG/KOFam Analysis

Show setup information.

pacman::p_load(tidyverse, seqinr, Biostrings, stringr, 
               plyr, dplyr, pathview, heatmaply,  
               install = FALSE, update = FALSE)

The objective of this section is to visualize all of the genes that have KEGG-KOfam annotations and assess their distribution across metagenomes. Our goal is to create a contig.db that contains only genes annotated by KEGG-KOfams and a corresponding profile.db that contains coverage and detection of genes across each metagenome. We also will generate files that have other important information about each gene such as taxonomy, KOfam functions(s), and KEGG modules (if any).

Step 1. Create contigs database

In order to create a basic contigs.db of genes, we need a) a fasta file and b) an external gene calls file, both of which only contain KOfam annotated genes. The final command will look something like this:

anvi-gen-contigs-database --contigs-fasta kofam_genes_rn.fa \
                          --output-db-path KOFAMS-contigs.db  \
                          --external-gene-calls kofam_external_gene_calls.txt \

KOfam gene annotation

First we run the annotation on the full contig.db

anvi-run-kegg-kofams -c 03_CONTIGS/WATER-contigs.db \
                     --kegg-data-dir /PATH/to/KEGG/db \
                     -T $NSLOTS

And here is the final portion of the output from the command.

Number of raw hits ...........................: 631,174
Number of weak hits removed ..................: 619,356
Number of hits in annotation dict  ...........: 11,818
Gene functions ...............................: 11818 function calls from 1 sources for 11588 unique gene calls has been added to the contigs database.
Gene functions ...............................: 3240 function calls from 1 sources for 3198 unique gene calls has been added to the contigs database.
Gene functions ...............................: 3240 function calls from 1 sources for 3198 unique gene calls has been added to the contigs database.

As you can see, 11,818 KOfam function calls were added from 11,588 unique gene calls, meaning a small percentage of genes had multiple KOfam hits. This is unavoidable and we will need to deal with that at some point, but not now. If you run something like anvi-export-gene-calls -c contigs.db, we find out the assembly has 65,102 gene calls, which tells us that ~18% of the genes have KOfam annotations. What the output also tells us is that 3240 KEGG Module calls were also added to the data base.

Next, we export all the KOfam function calls. To keep things organized, we can keep this entire analysis in a separate directory (METABOLISM) and store the files we need to build and annotate in the subdirectory 00_HELPER_FILES.

mkdir -p METABOLISM/00_HELPER_FILES/
anvi-export-functions -c 03_CONTIGS/WATER-contigs.db \
                      --annotation-sources KOfam \
                      -o METABOLISM/00_HELPER_FILES/KOfam.txt

Again, this file has 11,818 hits from 11,588 unique gene calls. The file is tab-delimited and contains the KOfam accession number, function, and e_value for each gene_callers_id. We need a list of unique gene IDs to build a contig database so for now we select the hit with the lowest e_value. It really doesn’t matter at this point but just in case we decide latter to only present a single function call per gene, this approach seems to make the most sense.

kofam <- read.table("metabolism/00_HELPER_FILES/KOfam.txt", 
                    sep = "\t", header = TRUE,  quote = "")
kofam <- kofam %>% dplyr::rename(ko_function = function.)
kofam$ko_function <- stringr::str_replace(kofam$ko_function, "'", "")
kofam$gene_callers_id <- as.factor(kofam$gene_callers_id)
kofam <- kofam[order(kofam$gene_callers_id, kofam$e_value), ]
kofam_unique <- kofam[!duplicated(kofam$gene_callers_id), ]
write.table(kofam_unique, "metabolism/00_HELPER_FILES/KOfam_unique.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)
head(kofam_unique)

Same file but with duplicate gene IDs removed.

Get gene sequences for KOfam hits

Next, we create a file containing a list of the just the unique gene_callers_ids so that we can parse out the KOfam annotated gene sequence.

kofam_genes <- kofam_unique
kofam_genes[, c('accession', 'ko_function', 'e_value', 'source')] <- list(NULL)
write.table(kofam_genes, "metabolism/00_HELPER_FILES/kofam_genes_list.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

We can now use this file to get the sequences for each gene caller ID that has a KOfam hit from the original contigs.db.

anvi-get-sequences-for-gene-calls \
                        -c 03_CONTIGS/WATER-contigs.db \
                        --gene-caller-ids kofam_genes_list.txt \
                        --wrap 0 \
                        -o METABOLISM/00_HELPER_FILES/kofam_genes.fa

Generate external gene calls file

We need to tweak the deflines of the fasta file before building the new database but we will do that in a minute. For now assume the fasta file is fine as is. First, we create the --external-gene-calls file. This file allows us to skip many of the steps anvi’o normally performs when building a database, like contig splitting and gene calling. Remember, we already did this with the assembly, which is where these genes came from in the first place. Our goal is to create a database of gene calls containing information as close to the original database as possible so that the two databases are comparable.

anvi-export-gene-calls -c 03_CONTIGS/WATER-contigs.db \
                       --gene-caller prodigal \
                       -o METABOLISM/00_HELPER_FILES/kofam_gene_calls.txt

This command gives us a file containing all 65,102 gene calls. We can use the list of unique gene calls created above (kofam_genes_list.txt) to pull out the relevant IDs.

all_genes <- read.table("metabolism/00_HELPER_FILES/kofam_gene_calls.txt", 
                        sep = "\t", header = TRUE)
all_genes <- all_genes %>% dplyr::mutate(gene_callers_id = as.character(gene_callers_id))
kofam_genes_tab <- dplyr::inner_join(kofam_genes, all_genes, 
                                     by = "gene_callers_id")
head(kofam_genes_tab)

We need to modify this file a bit to make it compatible with what anvi’o needs to build the database. You see, anvi’o is expecting contigs not genes. When we give anvi’o an external gene calls file for anvi-gen-contigs-database we are basically saying that we performed our own gene calling on the contigs in the fasta file. Anvi’o will look at the contig ID in the external gene calls file and compare those IDs to the fasta deflines. Problem is that the deflines are named by genes, not contigs. We could just make up new names but then we lose the original contig information and gene IDs for each call. So instead, we add the original gene caller id to the original contig name. This not only preserves the contig information but also avoids any duplication of contig IDs, since many genes are found on the same contig.

kofam_genes_tab <- kofam_genes_tab %>% unite("contig", contig, gene_callers_id, 
                                              remove = FALSE, sep = "_gene_") 
kofam_genes_tab <- kofam_genes_tab[, c(2,1,7,8,4,5,3,6,9,10)]
head(kofam_genes_tab)

Better. We still have the original contig IDs, now appended with the gene ID. If we just give anvi’o this file, it will look at gene caller id 9, and see that this gene came from position 443–896 on contig WATER_000000000004_gene_9. The problem is that when anvi’o looks in the fasta file at the corresponding sequence, it will not find a sequence greater than or equal to 896 bps. Instead, it will find a sequence that is 453 bps long, since this is the length of the gene and not the contig. Make sense? So we need to trick anvi’o a bit by making it think each “contig” just happens to be the exact length of a given gene. To do this, we need to adjust the start and stop values. If you don’t do this you will get an error saying something like IndexError: list assignment index out of range. This is because of the length discrepency just described. To circumvent this problem, we can set all start values to 0 and adjust the stop values to the full gene length.

kofam_genes_tab <- kofam_genes_tab %>% dplyr::mutate(stop = (stop - start), .after = 4) 
kofam_genes_tab$start <- 0
head(kofam_genes_tab)
write.table(kofam_genes_tab, "metabolism/00_HELPER_FILES/kofam_external_gene_calls.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)

Back to gene 9, which is still 453 bp long, but now starts at the beginning of the sequence.

This is the new external-gene-calls file.

Fix fasta deflines

If you take a peak at the kofam_genes.fa gene fasta file we generated above, you will see the deflines are named after the genes, not contigs. Again, avni’o is expecting contig IDs. Time to play another trick. We will take the new contig IDs, appended with the gene IDs, and use those in our fasta file instead of the gene IDs. Things get a bit weird here since we are trying to substitute the new contig name for the original gene caller id in the fasta file.

We start by converting the fasta file to a data frame and then ordering the data frame by the sequence name so they are in the same order as the contig names in the external-gene-calls file.

kofam_fna2 <- readDNAStringSet("metabolism/00_HELPER_FILES/kofam_genes.fa")
kofam_names <-  names(kofam_fna2)
kofam_seqs <-  paste(kofam_fna2)
fasta_df <- data.frame(kofam_names, kofam_seqs)
fasta_df$kofam_names <- as.numeric(as.character(fasta_df$kofam_names))
fasta_df <- dplyr::arrange(fasta_df, kofam_names)

Next, add the contigs column from the external-gene-calls file to the data frame and make sure the names match.

fasta_df <- fasta_df %>% dplyr::mutate(new_id = kofam_genes_tab$contig)
fasta_df <- fasta_df[, c(1,3,2)]
fasta_df$kofam_names <- NULL
fasta_df$new_id <- sub("", ">", fasta_df$new_id)
write.table(fasta_df, "metabolism/00_HELPER_FILES/kofam_genes_rn.fa",
            sep = "\r", col.names = FALSE, row.names = FALSE,
            quote = FALSE, fileEncoding = "UTF-8")
head(fasta_df)

This is the new contigs-fasta file.

And that’s it. It is a really good idea to double check the two files to make certain everything looks OK.

Make the contig.db of KOfam genes

And finally we can create the database. We want anvi’o to do as little as possible so we use some additional flags to prevent any unnecessary processing.

anvi-gen-contigs-database --contigs-fasta 00_HELPER_FILES/kofam_genes_rn.fa \
                          --project-name KOFAMS \
                          --output-db-path 01_CONTIGS/KOFAMS-contigs.db  \
                          --external-gene-calls 00_HELPER_FILES/kofam_external_gene_calls.txt \
                          --ignore-internal-stop-codons \
                          --skip-predict-frame

And here is the final output of the command…

Input FASTA file .............................: kofam_genes_rn.fa
Name .........................................: KOFAMS
Description ..................................: No description is given
External gene calls ..........................: 11588 gene calls recovered and will be processed.

WARNING
===============================================
Anvi'o found amino acid sequences in your external gene calls file that match to
11588 of 11588 gene in it and will use these amino acid sequences for
everything.


EXTERNAL GENE CALLS PARSER REPORT
===============================================
Num gene calls in file .......................: 11,588
Non-coding gene calls ........................: 0
Partial gene calls ...........................: 5,748
Num amino acid sequences provided ............: 11,588
  - For complete gene calls ..................: 5,840
  - For partial gene calls ...................: 5,748
Frames predicted .............................: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0
Gene calls marked as NONCODING ...............: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0
Gene calls with internal stops ...............: 0
  - For complete gene calls ..................: 0
  - For partial gene calls ...................: 0


CONTIGS DB CREATE REPORT
===============================================
Split Length .................................: 20,000
K-mer size ...................................: 4
Skip gene calling? ...........................: False
External gene calls provided? ................: True
External gene calls file have AA sequences? ..: True
Proper frames will be predicted? .............: False
Ignoring internal stop codons? ...............: True
Splitting pays attention to gene calls? ......: True
Contigs with at least one gene call ..........: 11588 of 11588 (100.0%)
Contigs database .............................: A new database, KOFAMS-contigs.db, has been created.
Number of contigs ............................: 11,588
Number of splits .............................: 11,588
Total number of nucleotides ..................: 9,249,465
Gene calling step skipped ....................: False
Splits broke genes (non-mindful mode) ........: False
Desired split length (what the user wanted) ..: 20,000
Average split length (what anvi'o gave back) .: (Anvi'o did not create any splits)

Add annotations to new database

We could rerun the KOfam annotation, but since we already did that to get these genes in the first place, there really is no need. If you do not know what annotations are in your original contigs database, run the command like this.

anvi-export-functions -c 03_CONTIGS/WATER-contigs.db \
                     --list-annotation-sources

FUNCTIONAL ANNOTATION SOURCES FOUND
===============================================
* COG_CATEGORY
* COG_FUNCTION
* Pfam
* KOfam
* Transfer_RNAs
* KEGG_Module

OK, so we have KOfam and KEGG_Module annotations.

Remember, to get annotations you need to run this:

anvi-export-functions --contigs-db 03_CONTIGS/WATER-contigs.db \
                      --annotation-sources KOfam \
                      --output-file METABOLISM/00_HELPER_FILES/KOfam.txt

And then import the annotation into the new database.

anvi-import-functions --contigs-db 01_CONTIGS/KOFAMS-contigs.db \
                      --input-files 00_HELPER_FILES/KOfam.txt

Gene functions ...............................: 11818 function calls from 1 sources for 11588 unique gene calls has been added to the contigs database.

We can also import the KEGG_Module annotation. First, export the information from the original contigs.db.

anvi-export-functions --contigs-db 03_CONTIGS/WATER-contigs.db \
                      --annotation-sources KEGG_Module \
                      -output-file METABOLISM/00_HELPER_FILES/KEGG_Module.txt

There are two things we need to change in the KEGG_Module.txt file before importing the functions into the new contigs.db. The first is related to anvi’o itself, specifically the output file from the command above contains the value of NONE for every entry in the e_value column. When we try to import the function calls into our new database, anvi’o will throw an error because it needs an actual e value. We can just set the value to 0 to make anvi’o happy. The other problem is that when we read the file into R, the column called function is changed to function. with a period since function has a special meaning in R. So we need to change it back to the original column name. In order to import gene functions, anvi’o needs very specific column headers.

kegg_mod <- read.table("metabolism/00_HELPER_FILES/KEGG_Module.txt", 
                       sep = "\t", header = TRUE)
kegg_mod$e_value <- 0
colnames(kegg_mod) <- c("gene_callers_id", "source", 
                        "accession", "function", "e_value")
write.table(kegg_mod, "metabolism/00_HELPER_FILES/KEGG_Module_mod.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)

And now we import the modified annotation file. To run this you need to be in the METABOLISM directory.

anvi-import-functions --contigs-db 01_CONTIGS/KOFAMS-contigs.db \
                      --input-files 00_HELPER_FILES/KEGG_Module_mod.txt

Gene functions ...............................: 3240 function calls from 1 sources for 3198 unique gene calls has been added to the contigs database.

Profile metagenomes

Next, we profile each metagenome against the new gene database. Profiling quantifies things like coverage, detection, as well as single-nucleotide, single-codon, and single-amino acid for each metagenome. The default input for profiling is a BAM file, so let’s map short reads to the database. variants, etc.

mkdir 02_MAPPING_TO_KOFAMS
bowtie2-build kofam_genes_rn.fa 02_MAPPING_TO_KOFAMS/KOFAMS

for sample in `cat samples_WATER.txt`
do
    bowtie2  -x 02_MAPPING_TO_KOFAMS/KOFAMS \
             -1 00_QC/$sample-QUALITY_PASSED_R1.fastq.gz \
             -2 00_QC/$sample-QUALITY_PASSED_R2.fastq.gz \
             -S 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.sam
             --no-unal --threads $NSLOTS 
             
    samtools view -F 4 -bS 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.sam > \
                           02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS-RAW.bam

    # sort and index the BAM file:
    samtools sort 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS-RAW.bam \
               -o 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.bam
    samtools index 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.bam

    # remove temporary files:
    rm 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.sam \
       02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS-RAW.bam
done

Then we profile each metagenome.

mkdir 03_KOFAMS_PROFILES
for sample in `cat samples_WATER.txt`
do
    anvi-profile --contigs-db KOFAMS-contigs.db \
                 --input-file 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.bam \
                 --profile-SCVs \
                 --num-threads 1 \
                 --min-contig-length 10 \
                 --output-dir 03_KOFAMS_PROFILES/$sample-in-KOFAMS
done

And finally merge all of the profiles so that we can make comparisons across metagenomes.

anvi-merge 03_KOFAMS_PROFILES/*-in-KOFAMS/PROFILE.db \
           --contigs-db KOFAMS-contigs.db \
           --output-dir 04_KOFAM_PROFILES-MERGED/

All set. We have a contigs.db of all KOfam annotated genes and a profile.db. We could just run a visualiztion on the data we have so far, but what we want to do instead in to create some additional files to enhance the visualiztion.

Adding additional items info

Most of the time you use the interactive interface you are looking at two types of data frames—items, which in this case are the genes and layers, which in this case are the samples. Click here for more details. We can add basically any type of data we wish for either of these data frames. We will do this piecemeal and most of what we add is already somewhere in our various databases, just not yet in a format that is accessible by the interactive.

So let’s start by generating additional info for the items, that is the genes.

Adding gene classifications

To add taxonomic info for each gene call we need to two tables that can be generated from the original contigs database.

anvi-export-table 03_CONTIGS/WATER-contigs.db \
                  --table genes_taxonomy \
                  --output-file METABOLISM/00_HELPER_FILES/genes_taxonomy.txt
anvi-export-table 03_CONTIGS/WATER-contigs.db \
                  --table taxon_names \
                  --output-file METABOLISM/00_HELPER_FILES/taxon_names.txt

The first file (genes_taxonomy.txt) provides a taxon ids for all genes in the database. Again, we could rerun the classification on just the KOfam annotated genes but why bother. The second file (taxon_names.txt) is a lookup table for each taxon id. What we need to do is join these two table so that each gene id is coupled with a taxon id and a lineage. Then merge these data with of list of mock contig names in the in the KOfam database.

tax_names <- read.table("metabolism/00_HELPER_FILES/taxon_names.txt", 
                        sep = "\t", header = TRUE, quote = "")
gene_tax <- read.table("metabolism/00_HELPER_FILES/genes_taxonomy.txt", 
                       sep = "\t", header = TRUE, quote = "")
gene_tax_names <- dplyr::inner_join(gene_tax, tax_names, by = "taxon_id")
kofam_genes_n <- kofam_genes
kofam_genes_n$gene_callers_id <- as.numeric(as.character(kofam_genes_n$gene_callers_id))
kofam_with_tax <- dplyr::left_join(kofam_genes_n, gene_tax_names, 
                                   by = "gene_callers_id")
head(kofam_with_tax)
write.table(kofam_with_tax, "metabolism/00_HELPER_FILES/kofam_with_tax.txt", sep = "\t", 
            quote = FALSE, row.names = FALSE)

As you can see, we now have each gene associated with a taxonmic lineage (when one was identified). We could remove genes with NA from the taxonomy file but anvi’o does not seem to have a problem with it so let’s leave it be. Now we can import the taxonomy files into the new contigs database.

anvi-import-taxonomy-for-genes --contigs-db 01_CONTIGS/KOFAMS-contigs.db \
                               --input-files 00_HELPER_FILES/kofam_with_tax.txt \
                               --parser default_matrix

Here is the output.

Total num hits found .........................: 11,588
Num gene caller ids in the db ................: 11,588
Num gene caller ids in the incoming data .....: 11,588
Taxon names table ............................: Updated with 1467 unique taxon names
Genes taxonomy table .........................: Taxonomy stored for 11588 gene calls
Splits taxonomy ..............................: Input data from "default_matrix" annotated 11588 of 11588 splits (100.0%) with taxonomy.

Adding functional annotations

Even though we have this information, adding to the interactive is tricky because some genes have multiple annotations. Our main purpose for the visualiztion is exploratory analysis so we do not need to capture everything, but this is still a concern, especially when we are trying to generate the files we need. To narrow down our efforts, let’s focus on retrieving three pieces of information.

Which KOfam hits are part of complete Pathway Modules?
What module(s) is/are they a part of?
What hits are found in which MAGs?
Get module information for each gene call.

We use the command called anvi-estimate-metabolism with the --kegg-output-modes modules flag. This will generate a file (all_kofam_hits.txt) that has the gene caller ID, the KO accession number, an e-value, and what module(s) the gene belongs to.

anvi-estimate-metabolism --contigs-db 03_CONTIGS/WATER-contigs.db \
                         --kegg-data-dir ~/Desktop/kegg-kofams \
                         --metagenome-mode \
                         --output-file-prefix METABOLISM/00_HELPER_FILES/all  \
                         --kegg-output-modes kofam_hits

Select only unique hits.

As before, in cases of multiple KO hits for a single gene call, we select the hit with the lowest e-value. We could do what we did above or we can use the table with unique gene hits that is stored in the variable kofam_unique. What we do here is merge the two tables by gene caller id and KO accession number. This way we get only the top hits from the all_modules.txt file.

genes_all_ko <- read.table("metabolism/00_HELPER_FILES/all_kofam_hits.txt", 
                        sep = "\t", header = TRUE, quote = "")
genes_all_ko[, c('unique_id', 'metagenome_name', 
            'contig', 'source')] <- list(NULL)
genes_all_ko <- genes_all_ko[, c(2, 1, 3)]
genes_all_ko_alt <- genes_all_ko
kofam_unique1 <- kofam_unique
kofam_unique1[, c('source', 'ko_function', 'e_value' )] <- NULL
kofam_unique1$gene_callers_id <- as.numeric(as.character(kofam_unique1$gene_callers_id))
genes_all_ko <-  dplyr::left_join(kofam_unique1, genes_all_ko, 
                                  by = c("gene_callers_id" = "gene_caller_id" , 
                                         "accession" = "ko"))

### ALT TO USE ALL GENE CALLS ###
kofam1 <- kofam
kofam1[, c('source', 'ko_function', 'e_value' )] <- NULL
kofam1$gene_callers_id <- as.numeric(as.character(kofam1$gene_callers_id))
genes_all_ko_alt <-  dplyr::left_join(kofam1, genes_all_ko_alt, 
                                  by = c("gene_callers_id" = "gene_caller_id" , 
                                         "accession" = "ko"))

Remove genes with no module hit.

OK. now we have a data frame that contains gene IDs, KO hits, and module IDs (if any). Some KOs are found in multiple modules while some are not found in any modules. We will deal with the multiple module cases in a minute. For now, let’s remove any hits not found in a module. Anvi’o doesn’t need information for all genes and removing genes with no hits will make parsing the data a tiny bit easier.

#genes_ko_only <- genes_all_ko[!(genes_all_ko$modules_with_ko=="None"),]
#genes_ko_only <- genes_ko_only %>% dplyr::rename(kegg_module = modules_with_ko)
#genes_ko_only_long <- genes_ko_only %>% tidyr::separate_rows(kegg_module)

### ALT TO USE ALL GENE CALLS ###
genes_ko_only_alt <- genes_all_ko_alt[!(genes_all_ko_alt$modules_with_ko=="None"),]
genes_ko_only_alt <- genes_ko_only_alt %>% dplyr::rename(kegg_module = modules_with_ko)
genes_ko_only_long_alt <- genes_ko_only_alt %>% tidyr::separate_rows(kegg_module)

And we get a list of 3,210 genes that have module affiliations. All we know right now is the module numbers, not the names.

anvi-estimate-metabolism --contigs-db 03_CONTIGS/WATER-contigs.db 
                         --kegg-data-dir ~/Desktop/kegg-kofams/ 
                         --metagenome-mode 
                         --output-file-prefix METABOLISM/00_HELPER_FILES/all  
                         --kegg-output-modes modules

Get module information

all_mod <- read.table("metabolism/00_HELPER_FILES/all_modules.txt", 
                        sep = "\t", header = TRUE, quote = "")
all_mod[, c('module_definition', 'metagenome_name', 
            'module_class', 'module_is_complete', 
            'module_completeness', 'gene_caller_ids_in_module', 
            'kofam_hits_in_module')] <- list(NULL)
all_mod <- all_mod[, c(1,2,4,5,3)]
mod_tab <- all_mod[,c(1,2)]
cat_tab <- all_mod[,c(1,2,3)]
sub_tab <- all_mod[,c(1,2,4)]
name_tab <-  all_mod[,c(1,2,5)]

Combine genes & module data

genes_mod <- dplyr::left_join(genes_ko_only_long_alt, mod_tab, 
                              by = "kegg_module") %>% 
  tidyr::drop_na("unique_id")
genes_mod <- genes_mod %>% tibble::rowid_to_column("unique_row_id")
genes_mod_ko <- genes_mod
genes_mod <- genes_mod[, c(2,1,4)]
genes_mod <- genes_mod %>% dplyr::mutate_if(is.numeric, 
                                            as.character, 
                                            is.factor, 
                                            as.character) %>%
  tidyr::unite("path", unique_row_id:kegg_module, 
               remove = TRUE, sep =  "XXX") %>%  
  tidyr::pivot_wider(names_from = c("path"), values_from = c("path")) %>%  
  tidyr::unite("kegg_module", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")
n_reps <- max(str_count(genes_mod$kegg_module, "!!!")) + 2

genes_mod <- genes_mod %>% separate('kegg_module', 
                                    paste("module", 1:n_reps, sep = "_"), 
                                    sep = "!!!", extra = "drop")

#######################################
genes_cat <- dplyr::left_join(genes_ko_only_long_alt, cat_tab, 
                              by = "kegg_module") %>% 
  tidyr::drop_na("unique_id")
genes_cat[, c("accession", "kegg_module")] <- list(NULL)
genes_cat <- genes_cat %>% tibble::rowid_to_column("unique_row_id")
genes_cat <- genes_cat[, c(2,1,4)]

genes_cat <- genes_cat %>% dplyr::mutate_if(is.numeric, 
                                            as.character, 
                                            is.factor, 
                                            as.character) %>%
  tidyr::unite("path", unique_row_id:module_category, 
               remove = TRUE, sep =  "XXX") %>%  
  tidyr::pivot_wider(names_from = c("path"), values_from = c("path")) %>%  
  tidyr::unite("module_category", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")
n_reps <- max(str_count(genes_cat$module_category, "!!!")) + 2

genes_cat <- genes_cat %>% separate('module_category', 
                                    paste("category", 1:n_reps, sep = "_"), 
                                    sep = "!!!", extra = "drop")
#######################################
genes_sub <- dplyr::left_join(genes_ko_only_long_alt, sub_tab, 
                              by = "kegg_module") %>% 
  tidyr::drop_na("unique_id")
genes_sub[, c("accession", "kegg_module")] <- list(NULL)
genes_sub <- genes_sub %>% tibble::rowid_to_column("unique_row_id")
genes_sub <- genes_sub[, c(2,1,4)]

genes_sub <- genes_sub %>% dplyr::mutate_if(is.numeric, 
                                            as.character, 
                                            is.factor, 
                                            as.character) %>%
  tidyr::unite("path", unique_row_id:module_subcategory, 
               remove = TRUE, sep =  "XXX") %>%  
  tidyr::pivot_wider(names_from = c("path"), values_from = c("path")) %>%  
  tidyr::unite("module_subcategory", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")
n_reps <- max(str_count(genes_sub$module_subcategory, "!!!")) + 2

genes_sub <- genes_sub %>% separate('module_subcategory', 
                                    paste("subcategory", 1:n_reps, sep = "_"), 
                                    sep = "!!!", extra = "drop")

#######################################
genes_name <- dplyr::left_join(genes_ko_only_long_alt, name_tab, 
                              by = "kegg_module") %>% 
  tidyr::drop_na("unique_id")
genes_name[, c("accession", "kegg_module")] <- list(NULL)
genes_name <- genes_name %>% tibble::rowid_to_column("unique_row_id")
genes_name <- genes_name[, c(2,1,4)]
genes_name <- genes_name %>% dplyr::mutate_if(is.numeric, 
                                            as.character, 
                                            is.factor, 
                                            as.character) %>%
  tidyr::unite("path", unique_row_id:module_name, 
               remove = TRUE, sep =  "XXX") %>%  
  tidyr::pivot_wider(names_from = c("path"), values_from = c("path")) %>%  
  tidyr::unite("module_name", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")
n_reps <- max(str_count(genes_name$module_name, "!!!")) + 2

genes_name <- genes_name %>% separate('module_name', 
                                    paste("name", 1:n_reps, sep = "_"), 
                                    sep = "!!!", extra = "drop")
#######################################
genes_ko <- genes_mod_ko
genes_ko <- genes_ko[, c(2,1,3)]
genes_ko <- genes_ko %>% dplyr::mutate_if(is.numeric, 
                                            as.character, 
                                            is.factor, 
                                            as.character) %>%
  tidyr::unite("path", unique_row_id:accession, 
               remove = TRUE, sep =  "XXX") %>%  
  tidyr::pivot_wider(names_from = c("path"), values_from = c("path")) %>%  
  tidyr::unite("ko", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")
n_reps <- max(str_count(genes_ko$ko, "!!!")) + 2

genes_ko <- genes_ko %>% separate('ko', 
                                    paste("module", 1:n_reps, sep = "_"), 
                                    sep = "!!!", extra = "drop")

genes_mod[, 2:n_reps] <- apply(genes_mod[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_mod <- genes_mod %>% tidyr::unite("kegg_module", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")

genes_cat[, 2:n_reps] <- apply(genes_cat[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_cat <- genes_cat %>% tidyr::unite("module_category", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")

genes_sub[, 2:n_reps] <- apply(genes_sub[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_sub <- genes_sub %>% tidyr::unite("module_subcategory", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")

genes_name[, 2:n_reps] <- apply(genes_name[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_name <- genes_name %>% tidyr::unite("module_name", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")

genes_ko[, 2:n_reps] <- apply(genes_ko[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_ko <- genes_ko %>% tidyr::unite("ko", -"gene_callers_id", 
               na.rm = TRUE, remove = TRUE, sep = "!!!")

genes_modules <- dplyr::left_join(genes_ko, genes_mod, by = "gene_callers_id") %>%
  dplyr::left_join(., genes_cat, by = "gene_callers_id") %>%
  dplyr::left_join(., genes_sub, by = "gene_callers_id") %>%
  dplyr::left_join(., genes_name, by = "gene_callers_id") 
write.table(genes_modules, "metabolism/00_HELPER_FILES/genes_modules.txt", 
            quote = FALSE, sep = "\t")

Now we have a table of 2842 unique genes and their KEGG module annotations. The last thing we need to do is change the gene caller IDs to “contig IDs” to match the name of the slits in the contig.db. Way back uo the road we created a variable called kofam_genes_tab, a table derived from a gene calls file where we added the gene call onto the end of the contig name. We need to do the same with the new KEGG modules annotation file so we can add this as an additional layer to the visualization. And, we also need to tack on a split name like we did for the fasta file.

genes_modules1 <- genes_modules
kofam_genes_tab1 <- kofam_genes_tab
add_view_part_1 <- dplyr::right_join(kofam_genes_tab1, genes_modules1, by = "gene_callers_id")
add_view_part_1[, 3:10] <- NULL
add_view_part_1[, 1] <- NULL
add_view_part_1 <- add_view_part_1 %>% dplyr::mutate(split_id = "split_00001", .after = 1) %>%
  tidyr::unite(items, contig, split_id, remove = TRUE, sep = "_")
write.table(add_view_part_1, "metabolism/05_ANNOTATION_FILES/additional-items-kegg-functions.txt", 
            quote = FALSE, sep = "\t", row.names = FALSE)

Now we can run the following command to import the annotation data into the profile database.

anvi-import-misc-data 05_ANNOTATION_FILES/additional-items-kegg-functions.txt \
                      --pan-or-profile-db 04_KOFAM_PROFILES-MERGED/PROFILE.db \
                      --target-data-table items

KO genes from MAGs

for mag in `cat mags.list` 
do anvi-estimate-metabolism --contigs-db 03_CONTIGS/WATER-contigs.db \
                            --kegg-data-dir ~/PATH/to/kegg-kofams/ \
                            --profile-db 06_MERGED/WATER/PROFILE.db \
                            --collection-name MICROBIAL_FINAL \
                            --bin-id $mag -O $mag  \
                            --kegg-output-modes kofam_hits
done

for mag in `cat mags.list`
do anvi-estimate-metabolism --contigs-db 03_CONTIGS/WATER-contigs.db \
                            --kegg-data-dir ~/PATH/to/kegg-kofams/ \
                            --profile-db 06_MERGED/WATER/PROFILE.db \
                            --collection-name MICROBIAL_FINAL \
                            --bin-id $mag --output-file-prefix $mag  
                            --kegg-output-modes modules
done

###################### mag_01 ############################
mag_01<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00001_kofam_hits.txt", 
                    sep = "\t", header = TRUE)
mag_01[, c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_01$mag_id <- "MAG01" 
mag_01 <- mag_01 %>% dplyr::arrange(gene_caller_id)
mag_01 <- mag_01[!duplicated(mag_01$gene_caller_id), ]
mag_01 <- mag_01[, c(2,1,3)]
mag_01 <- mag_01 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>% 
  dplyr::mutate(split_id = "split_00001", .after = 2) %>%
  tidyr::unite(items, "int_name", "split_id", sep = "_")
###################### mag_02 ############################
mag_02<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00002_kofam_hits.txt", 
                    sep = "\t", header = TRUE)
mag_02[, c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_02$mag_id <- "MAG02" 
mag_02 <- mag_02 %>% dplyr::arrange(gene_caller_id)
mag_02 <- mag_02[!duplicated(mag_02$gene_caller_id), ]
mag_02 <- mag_02[, c(2,1,3)]
mag_02 <- mag_02 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>% 
  dplyr::mutate(split_id = "split_00001", .after = 1) %>%
  tidyr::unite(items, "int_name", "split_id", sep = "_")
###################### mag_03 ############################
mag_03<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00003_kofam_hits.txt", 
                    sep = "\t", header = TRUE)
mag_03[, c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_03$mag_id <- "MAG03" 
mag_03 <- mag_03 %>% dplyr::arrange(gene_caller_id)
mag_03 <- mag_03[!duplicated(mag_03$gene_caller_id), ]
mag_03 <- mag_03[, c(2,1,3)]
mag_03 <- mag_03 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>% 
  dplyr::mutate(split_id = "split_00001", .after = 1) %>%
  tidyr::unite(items, "int_name", "split_id", sep = "_")
###################### mag_04 ############################
mag_04<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00004_kofam_hits.txt", 
                    sep = "\t", header = TRUE)
mag_04[, c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_04$mag_id <- "MAG04" 
mag_04 <- mag_04 %>% dplyr::arrange(gene_caller_id)
mag_04 <- mag_04[!duplicated(mag_04$gene_caller_id), ]
mag_04 <- mag_04[, c(2,1,3)]
mag_04 <- mag_04 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>% 
  dplyr::mutate(split_id = "split_00001", .after = 1) %>%
  tidyr::unite(items, "int_name", "split_id", sep = "_")
###################### mag_05 ############################
mag_05<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00005_kofam_hits.txt", 
                    sep = "\t", header = TRUE)
mag_05[, c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_05$mag_id <- "MAG05" 
mag_05 <- mag_05 %>% dplyr::arrange(gene_caller_id)
mag_05 <- mag_05[!duplicated(mag_05$gene_caller_id), ]
mag_05 <- mag_05[, c(2,1,3)]
mag_05 <- mag_05 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>% 
  dplyr::mutate(split_id = "split_00001", .after = 1) %>%
  tidyr::unite(items, "int_name", "split_id", sep = "_")
###################### bin_13 ############################
bin_13<- read.table("metabolism/06_MAG_KOFAMS/WATER_Bin_00013_kofam_hits.txt", 
                    sep = "\t", header = TRUE)
bin_13[, c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
bin_13$mag_id <- "BIN13" 
bin_13 <- bin_13 %>% dplyr::arrange(gene_caller_id)
bin_13 <- bin_13[!duplicated(bin_13$gene_caller_id), ]
bin_13 <- bin_13[, c(2,1,3)]
bin_13 <- bin_13 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>% 
  dplyr::mutate(split_id = "split_00001", .after = 1) %>%
  tidyr::unite(items, "int_name", "split_id", sep = "_")

add_mag <- rbind(mag_01, mag_02, mag_03, mag_04, mag_05, bin_13)
write.table(add_mag, "metabolism/05_ANNOTATION_FILES/additional-items-kos-in-mags.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)

anvi-import-misc-data 05_ANNOTATION_FILES/additional-items-kos-in-mags.txt \
                      --pan-or-profile-db 04_KOFAM_PROFILES-MERGED/PROFILE.db \
                      --target-data-table items

Genes in modules only

kofam_fna_mod <- readDNAStringSet("metabolism/00_HELPER_FILES/kofam_genes.fa")
kofam_names_mod <-  names(kofam_fna_mod)
kofam_seqs_mod <-  paste(kofam_fna_mod)
fasta_df_mod <- data.frame(kofam_names_mod, kofam_seqs_mod)
fasta_df_mod$kofam_names_mod <- as.numeric(as.character(fasta_df_mod$kofam_names_mod))
fasta_df_mod <- dplyr::arrange(fasta_df_mod, kofam_names_mod)

genes_modules_list <- genes_modules1
fasta_df_mod <- fasta_df_mod %>% dplyr::mutate_if(is.numeric, 
                                            as.character, 
                                            is.factor, 
                                            as.character) 
genes_modules_list[, c("ko", "kegg_module", "module_category", 
                       "module_subcategory", "module_name")] <- list(NULL)
fasta_df_mod1 <- dplyr::left_join(genes_modules_list, fasta_df_mod, 
                                                     by = c("gene_callers_id" = "kofam_names_mod"))

fasta_df_mod1 <- dplyr::left_join(fasta_df_mod1, kofam_genes_tab1, 
                                                     by = "gene_callers_id")
fasta_df_mod1 <- fasta_df_mod1[, c(3,2)]
fasta_df_mod1_rn <- fasta_df_mod1
fasta_df_mod1_rn$contig <- sub("", ">", fasta_df_mod1_rn$contig)
write.table(fasta_df_mod1_rn, "metabolism/00_HELPER_FILES_MOD_ONLY/kofam_genes_rn.fa",
            sep = "\r", col.names = FALSE, row.names = FALSE,
            quote = FALSE, fileEncoding = "UTF-8")
head(fasta_df_mod1_rn)

mod_only_list <- fasta_df_mod1
mod_only_list <- mod_only_list %>% dplyr::mutate(split_id = "split_00001", .after = 1) %>%
  tidyr::unite(items, contig, split_id, remove = TRUE, sep = "_")

mod_only_list$kofam_seqs_mod <- NULL
mod_only_mag <- dplyr::left_join(mod_only_list, add_mag,  by = "items")

gene_mod_only_list <- mod_only_list %>% 
  tidyr::separate("items", 
                  into = c("V1", "V2", "V3", "gene_callers_id", "V4", "V5"),  
                  sep ="_")
gene_mod_only_list[, c("V1", "V2", "V3", "V4", "V5")] <- list(NULL)

gene_mod_only_list$gene_callers_id <- as.numeric(as.character(gene_mod_only_list$gene_callers_id))
mods_tax <- dplyr::left_join(gene_mod_only_list, 
                                      kofam_with_tax,  
                                      by = "gene_callers_id")
kofam_genes_tab1$gene_callers_id <- as.numeric(as.character(kofam_genes_tab$gene_callers_id))
mod_external_gene <- dplyr::left_join(gene_mod_only_list, kofam_genes_tab1,  by = "gene_callers_id")

write.table(mod_external_gene, "metabolism/00_HELPER_FILES_MOD_ONLY/mod_external_gene.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)
write.table(mod_only_mag, "metabolism/00_HELPER_FILES_MOD_ONLY/mag-mods.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)
write.table(mods_tax, "metabolism/00_HELPER_FILES_MOD_ONLY/mods_tax.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)
write.table(add_view_part_1, "metabolism/00_HELPER_FILES_MOD_ONLY/mods.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)

anvi-gen-contigs-database --contigs-fasta 00_HELPER_FILES_MOD_ONLY/kofam_genes_rn.fa --project-name MODS --output-db-path 00_HELPER_FILES_MOD_ONLY/KOFAMS_MODS-contigs.db  --external-gene-calls 00_HELPER_FILES_MOD_ONLY/mod_external_gene.txt --ignore-internal-stop-codons --skip-predict-frame

anvi-import-taxonomy-for-genes --contigs-db 00_HELPER_FILES_MOD_ONLY/KOFAMS_MODS-contigs.db --input-files 00_HELPER_FILES_MOD_ONLY/mods_tax.txt --parser default_matrix

anvi-import-taxonomy-for-genes –contigs-db KOFAMS_MODS-contigs.db –input-files mods_tax.txt –parser default_matrix

anvi-import-misc-data mods.txt –pan-or-profile-db 03_KOFAM_PROFILES-MERGED/PROFILE.db –target-data-table items anvi-import-misc-data mag-mods.txt –pan-or-profile-db 03_KOFAM_PROFILES-MERGED/PROFILE.db –target-data-table items

distinct(genes_cat, gene_callers_id, .keep_all = T)

mag01_mod <- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00001_modules.txt", 
                    sep = "\t", header = TRUE)
mag01_mod$unique_id <- NULL
mag01_mod$bin_name <- "MAG01"

mag02_mod <- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00002_modules.txt", 
                    sep = "\t", header = TRUE)
mag02_mod$unique_id <- NULL
mag02_mod$bin_name <- "MAG02"

mag03_mod <- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00003_modules.txt", 
                    sep = "\t", header = TRUE)
mag03_mod$unique_id <- NULL
mag03_mod$bin_name <- "MAG03"

mag04_mod <- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00004_modules.txt", 
                    sep = "\t", header = TRUE)
mag04_mod$unique_id <- NULL
mag04_mod$bin_name <- "MAG04"

mag05_mod <- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00005_modules.txt", 
                    sep = "\t", header = TRUE)
mag05_mod$unique_id <- NULL
mag05_mod$bin_name <- "MAG05"

bin13_mod <- read.table("metabolism/06_MAG_KOFAMS/WATER_Bin_00013_modules.txt", 
                    sep = "\t", header = TRUE)
bin13_mod$unique_id <- NULL
bin13_mod$bin_name <- "Bin13"

all_mods <- read.table("metabolism/00_HELPER_FILES/all_modules.txt", 
                        sep = "\t", header = TRUE, quote = "")
all_mods <- all_mods %>% dplyr::rename(bin_name = metagenome_name)
all_mods$unique_id <- NULL
all_mods$bin_name <- "ALL"

modules <- dplyr::left_join(all_mods, mag01_mod, 
                            by = c("kegg_module", "module_name", 
                                   "module_class", "module_category", 
                                   "module_subcategory"), 
                            suffix = c("_all", "_MAG_01")) %>%
  dplyr::left_join(., mag02_mod, by = c("kegg_module", "module_name", 
                                   "module_class", "module_category", 
                                   "module_subcategory"), suffix = c("XXXX", "_MAG_02")) %>%
  dplyr::left_join(., mag03_mod, by = c("kegg_module", "module_name", 
                                   "module_class", "module_category", 
                                   "module_subcategory"), suffix = c("_MAG_02", "_MAG_03")) %>%
  dplyr::left_join(., mag04_mod, by = c("kegg_module", "module_name", 
                                   "module_class", "module_category", 
                                   "module_subcategory"), suffix = c("_MAG_03", "_MAG_04")) %>%
  dplyr::left_join(., mag05_mod, by = c("kegg_module", "module_name", 
                                   "module_class", "module_category", 
                                   "module_subcategory"), suffix = c("_MAG_04", "_MAG_05")) %>%
  dplyr::left_join(., bin13_mod, by = c("kegg_module", "module_name", 
                                   "module_class", "module_category", 
                                   "module_subcategory"), suffix = c("_MAG_05", "_Bin_13"))

modules <- rbind(all_mods, mag01_mod, mag02_mod, mag03_mod, mag04_mod, mag05_mod, bin13_mod)
modules[, c("module_definition", "module_class")] <- list(NULL)
colnames(modules)
modules_w <- modules %>% tidyr::pivot_wider(
                                          names_from = c("bin_name"), 
                                          values_from = c("module_completeness", 
                                                          "module_is_complete",
                                                          "kofam_hits_in_module", 
                                                          "gene_caller_ids_in_module"),
                                          names_sep = "_")

items_add_data <- modules_w
items_add_data[, c(5:max(ncol(items_add_data)))] <- NULL
items_add_data <- items_add_data %>% dplyr::rename("items" = "kegg_module")
items_add_data <- items_add_data[order(items_add_data$module_category, items_add_data$module_subcategory), ]
write.table(items_add_data, "metabolism/00_HELPER_FILES/module_item_add_data.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)

data_tab <- modules_w
data_tab <- data_tab[order(data_tab$module_category, data_tab$module_subcategory), ]
data_tab[, c(2:4)] <- NULL
data_tab[, c(9:max(ncol(data_tab)))] <- NULL
data_tab <- data_tab %>% replace(is.na(.), 0)

names(data_tab) <- gsub(x = names(data_tab), pattern = "module_completeness_", replacement = "")  
write.table(data_tab, "metabolism/00_HELPER_FILES/module_data.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE)

write.table(data_tab$kegg_module, "metabolism/00_HELPER_FILES/module_order.txt", 
            sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

[1] “unique_row_id” “bin_name” “kegg_module” “module_name”
[5] “module_class” “module_category” “module_subcategory” “module_definition”
[9] “module_completeness” “module_is_complete” “kofam_hits_in_module” “gene_caller_ids_in_module”

coverage <-  read.table("metabolism/00_HELPER_FILES/WATER-GENE-COVERAGES.txt", header = TRUE)
coverage$key <- as.factor(coverage$key)
kofam_unique_coveage <- dplyr::left_join(kofam_unique, coverage, by = c("gene_callers_id" = "key"))
kofam_unique_coveage[, c("gene_callers_id", "source", 
                         "ko_function", "e_value", 
                         "WCCR_1913", "WCCR_1916")] <- list(NULL)
kofam_unique_coveage1 <- kofam_unique_coveage[which(rowSums(kofam_unique_coveage[,2:3]) != 0),]
write.table(kofam_unique_coveage1, "metabolism/00_HELPER_FILES/kofam_unique_coveage1.txt", 
            quote = FALSE, sep = "\t", row.names = FALSE)
kofam_unique_coveage1_u <- kofam_unique_coveage1[!duplicated(kofam_unique_coveage1$accession),]
kofam_unique_coveage1_u <- kofam_unique_coveage1_u %>% tibble::remove_rownames() %>%
  tibble::column_to_rownames("accession")

kofam_unique_coveage1_u <- data.matrix(kofam_unique_coveage1_u[1:2], rownames.force = NULL)
i <- 1
pv.out <- pathview(gene.data = kofam_unique_coveage1_u[, 1:2], 
                   pathway.id = "00920", 
                   species = "ko", out.suffix = "all", 
                   kegg.native = TRUE)
str(pv.out)
head(pv.out$plot.data.gene)
#result PNG file in current directory

mag01 <- read.table('metabolism/06_MAG_KOFAMS/16s-asv_data-2019.txt', header = T)
all_2019 <- dplyr::bind_rows(rrna, amf, its)

mag02 <- read.table("mag02.txt", header = FALSE)
mag02 <- c(mag02$V1)
#KEGG view: gene data only


#load data
data(gse16873.d)
data(demo.paths)

test <- data.frame(gse16873.d)
test <- test %>% tibble::rownames_to_column("id")

test[duplicated(test$id),]
dim(test[duplicated(test$id),])[1]

test$id

#KEGG view: gene data only
i <- 1
pv.out <- pathview(gene.data = gse16873.d[, 1], 
                   pathway.id = demo.paths$sel.paths[i], 
                   species = "hsa", out.suffix = "gse16873", 
                   kegg.native = TRUE)
str(pv.out)
head(pv.out$plot.data.gene)
#result PNG file in current directory

#Graphviz view: gene data only
pv.out <- pathview(gene.data = gse16873.d[, 1], 
                   pathway.id = demo.paths$sel.paths[i], 
                   species = "hsa", out.suffix = "gse16873", 
                   kegg.native = FALSE, sign.pos = demo.paths$spos[i])
#result PDF file in current directory

#KEGG view: both gene and compound data
sim.cpd.data=sim.mol.data(mol.type="cpd", nmol=3000)
i <- 3
print(demo.paths$sel.paths[i])
pv.out <- pathview(gene.data = gse16873.d[, 1], cpd.data = sim.cpd.data, 
                   pathway.id = demo.paths$sel.paths[i], 
                   species = "hsa", out.suffix = "gse16873.cpd", 
                   keys.align = "y", kegg.native = TRUE, 
                   key.pos = demo.paths$kpos1[i])
str(pv.out)
head(pv.out$plot.data.cpd)

#multiple states in one graph
set.seed(10)
sim.cpd.data2 = matrix(sample(sim.cpd.data, 18000, replace = TRUE), ncol = 6)
pv.out <- pathview(gene.data = gse16873.d[, 1:3], 
                   cpd.data = sim.cpd.data2[, 1:2], 
                   pathway.id = demo.paths$sel.paths[i], 
                   species = "hsa", out.suffix = "gse16873.cpd.3-2s", 
                   keys.align = "y", kegg.native = TRUE, 
                   match.data = FALSE, multi.state = TRUE, same.layer = TRUE)
str(pv.out)
head(pv.out$plot.data.cpd)

#result PNG file in current directory

##more examples of pathview usages are shown in the vignette.

arco <- read.table('arco-phylo-kegg-modules_modules.txt', sep = "\t", 
                    header = TRUE,  quote = "")
arco_t <- arco
arco_t[, c("unique_id", "db_name", "module_name", "module_class", "module_category", "module_subcategory", "module_definition", "kofam_hits_in_module", "gene_caller_ids_in_module", "module_completeness")] <- list(NULL)

arco_t$module_is_complete <- str_replace(arco_t$module_is_complete, "True", "1")
arco_t$module_is_complete <- str_replace(arco_t$module_is_complete, "False", "0")
arco_t$module_is_complete <- as.integer(arco_t$module_is_complete)
arco_t_w <- arco_t %>% tidyr::pivot_wider(names_from = c("kegg_module"), values_from = c("module_is_complete"))
arco_t_w <- arco_t_w[, colSums(arco_t_w != 0) > 0]
arco_t_w <- column_to_rownames(arco_t_w, var = "genome_name")
arco_t_w <- arco_t_w[, colSums(arco_t_w > 0) <= 70]
arco_t_w <- rownames_to_column(arco_t_w, var = "genome_name")
g_order <- read.table('genome_order.txt', 
                    header = TRUE,  quote = "")

arco_t_w_order <- dplyr::left_join(g_order, arco_t_w, 
                                   by = "genome_name")
arco_t_l <- arco_t_w_order %>% pivot_longer(cols = 2:34, names_to = "kegg_module", values_to = "module_is_complete")

arco_t_w_order_no_row_id <- arco_t_w_order %>% tibble::column_to_rownames("genome_name")


heatmaply(arco_t_w_order_no_row_id, 
          Rowv = NULL, Colv = NULL, 
          k_row = 1, k_col = 1, 
          file = "heatmaply_plot.html", 
          colors = c("white", "red"), 
          hide_colorbar = TRUE, 
          fontsize_row = 6)

coverage <-  read.table("WATER-GENE-COVERAGES.txt", header = TRUE)
kofam <- read.table("00_HELPER_FILES/KOfam.txt", sep = "\t", 
                    header = TRUE,  quote = "")
kofam_cov <- dplyr::left_join(kofam, coverage, by = c("gene_callers_id" = "key"))

anvi-search-functions --contigs-db 03_CONTIGS/WATER-contigs.db  --full-report ko_search.txt --annotation-sources KOfam --search-terms K17222,K17223,K17227,K17226,K17224,K17225,K22622,K00170,K00169,K00171,K00172,K00174,K00175,K00240,K01676,K00244,K00246,K00245,K01007,K01682,K00407,K00406,K00404,K00405

kofam_search <- read.table("00_HELPER_FILES/ko_search.txt", sep = "\t", 
                    header = TRUE,  quote = "")
kofam_seach_cov <- dplyr::left_join(kofam_search, kofam_cov, by = "gene_callers_id")
write.table(kofam_seach_cov, "kofam_seach_cov.txt", sep = "\t", quote = FALSE, row.names = FALSE)

Source Code

The source code for this page can be accessed on GitHub by clicking this link.

No 7. KEGG/KOFam Analysis