Diving deeper into funtional annotation. This is a work in progress and is not complete.
::p_load(tidyverse, seqinr, Biostrings, stringr,
pacman
plyr, dplyr, pathview, heatmaply, install = FALSE, update = FALSE)
The objective of this section is to visualize all of the genes that have KEGG-KOfam annotations and assess their distribution across metagenomes. Our goal is to create a contig.db that contains only genes annotated by KEGG-KOfams and a corresponding profile.db that contains coverage and detection of genes across each metagenome. We also will generate files that have other important information about each gene such as taxonomy, KOfam functions(s), and KEGG modules (if any).
In order to create a basic contigs.db of genes, we need a) a fasta file and b) an external gene calls file, both of which only contain KOfam annotated genes. The final command will look something like this:
anvi-gen-contigs-database --contigs-fasta kofam_genes_rn.fa \
--output-db-path KOFAMS-contigs.db \
--external-gene-calls kofam_external_gene_calls.txt \
First we run the annotation on the full contig.db
anvi-run-kegg-kofams -c 03_CONTIGS/WATER-contigs.db \
--kegg-data-dir /PATH/to/KEGG/db \
-T $NSLOTS
And here is the final portion of the output from the command.
Number of raw hits ...........................: 631,174
Number of weak hits removed ..................: 619,356
Number of hits in annotation dict ...........: 11,818
Gene functions ...............................: 11818 function calls from 1 sources for 11588 unique gene calls has been added to the contigs database.
Gene functions ...............................: 3240 function calls from 1 sources for 3198 unique gene calls has been added to the contigs database.
Gene functions ...............................: 3240 function calls from 1 sources for 3198 unique gene calls has been added to the contigs database.
As you can see, 11,818 KOfam function calls were added from 11,588 unique gene calls, meaning a small percentage of genes had multiple KOfam hits. This is unavoidable and we will need to deal with that at some point, but not now. If you run something like anvi-export-gene-calls -c contigs.db
, we find out the assembly has 65,102 gene calls, which tells us that ~18% of the genes have KOfam annotations. What the output also tells us is that 3240 KEGG Module calls were also added to the data base.
Next, we export all the KOfam function calls. To keep things organized, we can keep this entire analysis in a separate directory (METABOLISM
) and store the files we need to build and annotate in the subdirectory 00_HELPER_FILES
.
mkdir -p METABOLISM/00_HELPER_FILES/
anvi-export-functions -c 03_CONTIGS/WATER-contigs.db \
--annotation-sources KOfam \
-o METABOLISM/00_HELPER_FILES/KOfam.txt
Again, this file has 11,818 hits from 11,588 unique gene calls. The file is tab-delimited and contains the KOfam accession
number, function
, and e_value
for each gene_callers_id
. We need a list of unique gene IDs to build a contig database so for now we select the hit with the lowest e_value
. It really doesn’t matter at this point but just in case we decide latter to only present a single function call per gene, this approach seems to make the most sense.
<- read.table("metabolism/00_HELPER_FILES/KOfam.txt",
kofam sep = "\t", header = TRUE, quote = "")
<- kofam %>% dplyr::rename(ko_function = function.)
kofam $ko_function <- stringr::str_replace(kofam$ko_function, "'", "")
kofam$gene_callers_id <- as.factor(kofam$gene_callers_id)
kofam<- kofam[order(kofam$gene_callers_id, kofam$e_value), ]
kofam <- kofam[!duplicated(kofam$gene_callers_id), ]
kofam_unique write.table(kofam_unique, "metabolism/00_HELPER_FILES/KOfam_unique.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
head(kofam_unique)
Same file but with duplicate gene IDs removed.
Next, we create a file containing a list of the just the unique gene_callers_id
s so that we can parse out the KOfam annotated gene sequence.
<- kofam_unique
kofam_genes c('accession', 'ko_function', 'e_value', 'source')] <- list(NULL)
kofam_genes[, write.table(kofam_genes, "metabolism/00_HELPER_FILES/kofam_genes_list.txt",
sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
We can now use this file to get the sequences for each gene caller ID that has a KOfam hit from the original contigs.db.
anvi-get-sequences-for-gene-calls \
-c 03_CONTIGS/WATER-contigs.db \
--gene-caller-ids kofam_genes_list.txt \
--wrap 0 \
-o METABOLISM/00_HELPER_FILES/kofam_genes.fa
We need to tweak the deflines of the fasta file before building the new database but we will do that in a minute. For now assume the fasta file is fine as is. First, we create the --external-gene-calls
file. This file allows us to skip many of the steps anvi’o normally performs when building a database, like contig splitting and gene calling. Remember, we already did this with the assembly, which is where these genes came from in the first place. Our goal is to create a database of gene calls containing information as close to the original database as possible so that the two databases are comparable.
anvi-export-gene-calls -c 03_CONTIGS/WATER-contigs.db \
--gene-caller prodigal \
-o METABOLISM/00_HELPER_FILES/kofam_gene_calls.txt
This command gives us a file containing all 65,102 gene calls. We can use the list of unique gene calls created above (kofam_genes_list.txt
) to pull out the relevant IDs.
<- read.table("metabolism/00_HELPER_FILES/kofam_gene_calls.txt",
all_genes sep = "\t", header = TRUE)
<- all_genes %>% dplyr::mutate(gene_callers_id = as.character(gene_callers_id))
all_genes <- dplyr::inner_join(kofam_genes, all_genes,
kofam_genes_tab by = "gene_callers_id")
head(kofam_genes_tab)
We need to modify this file a bit to make it compatible with what anvi’o needs to build the database. You see, anvi’o is expecting contigs not genes. When we give anvi’o an external gene calls file for anvi-gen-contigs-database
we are basically saying that we performed our own gene calling on the contigs in the fasta file. Anvi’o will look at the contig
ID in the external gene calls file and compare those IDs to the fasta deflines. Problem is that the deflines are named by genes, not contigs. We could just make up new names but then we lose the original contig information and gene IDs for each call. So instead, we add the original gene caller id to the original contig name. This not only preserves the contig information but also avoids any duplication of contig IDs, since many genes are found on the same contig.
<- kofam_genes_tab %>% unite("contig", contig, gene_callers_id,
kofam_genes_tab remove = FALSE, sep = "_gene_")
<- kofam_genes_tab[, c(2,1,7,8,4,5,3,6,9,10)]
kofam_genes_tab head(kofam_genes_tab)
Better. We still have the original contig IDs, now appended with the gene ID. If we just give anvi’o this file, it will look at gene caller id 9
, and see that this gene came from position 443–896 on contig WATER_000000000004_gene_9
. The problem is that when anvi’o looks in the fasta file at the corresponding sequence, it will not find a sequence greater than or equal to 896 bps. Instead, it will find a sequence that is 453 bps long, since this is the length of the gene and not the contig. Make sense? So we need to trick anvi’o a bit by making it think each “contig” just happens to be the exact length of a given gene. To do this, we need to adjust the start and stop values. If you don’t do this you will get an error saying something like IndexError: list assignment index out of range
. This is because of the length discrepency just described. To circumvent this problem, we can set all start values to 0
and adjust the stop values to the full gene length.
<- kofam_genes_tab %>% dplyr::mutate(stop = (stop - start), .after = 4)
kofam_genes_tab $start <- 0
kofam_genes_tabhead(kofam_genes_tab)
write.table(kofam_genes_tab, "metabolism/00_HELPER_FILES/kofam_external_gene_calls.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
Back to gene 9
, which is still 453 bp long, but now starts at the beginning of the sequence.
This is the new
external-gene-calls
file.
If you take a peak at the kofam_genes.fa
gene fasta file we generated above, you will see the deflines are named after the genes, not contigs. Again, avni’o is expecting contig IDs. Time to play another trick. We will take the new contig IDs, appended with the gene IDs, and use those in our fasta file instead of the gene IDs. Things get a bit weird here since we are trying to substitute the new contig name for the original gene caller id in the fasta file.
We start by converting the fasta file to a data frame and then ordering the data frame by the sequence name so they are in the same order as the contig names in the external-gene-calls file
.
<- readDNAStringSet("metabolism/00_HELPER_FILES/kofam_genes.fa")
kofam_fna2 <- names(kofam_fna2)
kofam_names <- paste(kofam_fna2)
kofam_seqs <- data.frame(kofam_names, kofam_seqs)
fasta_df $kofam_names <- as.numeric(as.character(fasta_df$kofam_names))
fasta_df<- dplyr::arrange(fasta_df, kofam_names) fasta_df
Next, add the contigs column from the external-gene-calls file
to the data frame and make sure the names match.
<- fasta_df %>% dplyr::mutate(new_id = kofam_genes_tab$contig)
fasta_df <- fasta_df[, c(1,3,2)]
fasta_df $kofam_names <- NULL
fasta_df$new_id <- sub("", ">", fasta_df$new_id)
fasta_dfwrite.table(fasta_df, "metabolism/00_HELPER_FILES/kofam_genes_rn.fa",
sep = "\r", col.names = FALSE, row.names = FALSE,
quote = FALSE, fileEncoding = "UTF-8")
head(fasta_df)
This is the new
contigs-fasta
file.
And that’s it. It is a really good idea to double check the two files to make certain everything looks OK.
And finally we can create the database. We want anvi’o to do as little as possible so we use some additional flags to prevent any unnecessary processing.
anvi-gen-contigs-database --contigs-fasta 00_HELPER_FILES/kofam_genes_rn.fa \
--project-name KOFAMS \
--output-db-path 01_CONTIGS/KOFAMS-contigs.db \
--external-gene-calls 00_HELPER_FILES/kofam_external_gene_calls.txt \
--ignore-internal-stop-codons \
--skip-predict-frame
And here is the final output of the command…
Input FASTA file .............................: kofam_genes_rn.fa
Name .........................................: KOFAMS
Description ..................................: No description is given
External gene calls ..........................: 11588 gene calls recovered and will be processed.
WARNING
===============================================
Anvi'o found amino acid sequences in your external gene calls file that match to
11588 of 11588 gene in it and will use these amino acid sequences for
everything.
EXTERNAL GENE CALLS PARSER REPORT
===============================================
Num gene calls in file .......................: 11,588
Non-coding gene calls ........................: 0
Partial gene calls ...........................: 5,748
Num amino acid sequences provided ............: 11,588
- For complete gene calls ..................: 5,840
- For partial gene calls ...................: 5,748
Frames predicted .............................: 0
- For complete gene calls ..................: 0
- For partial gene calls ...................: 0
Gene calls marked as NONCODING ...............: 0
- For complete gene calls ..................: 0
- For partial gene calls ...................: 0
Gene calls with internal stops ...............: 0
- For complete gene calls ..................: 0
- For partial gene calls ...................: 0
CONTIGS DB CREATE REPORT
===============================================
Split Length .................................: 20,000
K-mer size ...................................: 4
Skip gene calling? ...........................: False
External gene calls provided? ................: True
External gene calls file have AA sequences? ..: True
Proper frames will be predicted? .............: False
Ignoring internal stop codons? ...............: True
Splitting pays attention to gene calls? ......: True
Contigs with at least one gene call ..........: 11588 of 11588 (100.0%)
Contigs database .............................: A new database, KOFAMS-contigs.db, has been created.
Number of contigs ............................: 11,588
Number of splits .............................: 11,588
Total number of nucleotides ..................: 9,249,465
Gene calling step skipped ....................: False
Splits broke genes (non-mindful mode) ........: False
Desired split length (what the user wanted) ..: 20,000
Average split length (what anvi'o gave back) .: (Anvi'o did not create any splits)
We could rerun the KOfam annotation, but since we already did that to get these genes in the first place, there really is no need. If you do not know what annotations are in your original contigs database, run the command like this.
anvi-export-functions -c 03_CONTIGS/WATER-contigs.db \
--list-annotation-sources
FUNCTIONAL ANNOTATION SOURCES FOUND
===============================================
* COG_CATEGORY
* COG_FUNCTION
* Pfam
* KOfam
* Transfer_RNAs
* KEGG_Module
OK, so we have KOfam
and KEGG_Module
annotations.
Remember, to get annotations you need to run this:
anvi-export-functions --contigs-db 03_CONTIGS/WATER-contigs.db \
--annotation-sources KOfam \
--output-file METABOLISM/00_HELPER_FILES/KOfam.txt
And then import the annotation into the new database.
anvi-import-functions --contigs-db 01_CONTIGS/KOFAMS-contigs.db \
--input-files 00_HELPER_FILES/KOfam.txt
Gene functions ...............................: 11818 function calls from 1 sources for 11588 unique gene calls has been added to the contigs database.
We can also import the KEGG_Module
annotation. First, export the information from the original contigs.db.
anvi-export-functions --contigs-db 03_CONTIGS/WATER-contigs.db \
--annotation-sources KEGG_Module \
-output-file METABOLISM/00_HELPER_FILES/KEGG_Module.txt
There are two things we need to change in the KEGG_Module.txt
file before importing the functions into the new contigs.db. The first is related to anvi’o itself, specifically the output file from the command above contains the value of NONE
for every entry in the e_value
column. When we try to import the function calls into our new database, anvi’o will throw an error because it needs an actual e value. We can just set the value to 0
to make anvi’o happy. The other problem is that when we read the file into R, the column called function
is changed to function.
with a period since function
has a special meaning in R. So we need to change it back to the original column name. In order to import gene functions, anvi’o needs very specific column headers.
<- read.table("metabolism/00_HELPER_FILES/KEGG_Module.txt",
kegg_mod sep = "\t", header = TRUE)
$e_value <- 0
kegg_modcolnames(kegg_mod) <- c("gene_callers_id", "source",
"accession", "function", "e_value")
write.table(kegg_mod, "metabolism/00_HELPER_FILES/KEGG_Module_mod.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
And now we import the modified annotation file. To run this you need to be in the METABOLISM
directory.
anvi-import-functions --contigs-db 01_CONTIGS/KOFAMS-contigs.db \
--input-files 00_HELPER_FILES/KEGG_Module_mod.txt
Gene functions ...............................: 3240 function calls from 1 sources for 3198 unique gene calls has been added to the contigs database.
Next, we profile each metagenome against the new gene database. Profiling quantifies things like coverage, detection, as well as single-nucleotide, single-codon, and single-amino acid for each metagenome. The default input for profiling is a BAM file, so let’s map short reads to the database. variants, etc.
mkdir 02_MAPPING_TO_KOFAMS
bowtie2-build kofam_genes_rn.fa 02_MAPPING_TO_KOFAMS/KOFAMS
for sample in `cat samples_WATER.txt`
do
bowtie2 -x 02_MAPPING_TO_KOFAMS/KOFAMS \
-1 00_QC/$sample-QUALITY_PASSED_R1.fastq.gz \
-2 00_QC/$sample-QUALITY_PASSED_R2.fastq.gz \
-S 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.sam
--no-unal --threads $NSLOTS
samtools view -F 4 -bS 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.sam > \
$sample-in-KOFAMS-RAW.bam
02_MAPPING_TO_KOFAMS/
# sort and index the BAM file:
samtools sort 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS-RAW.bam \
-o 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.bam
samtools index 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.bam
# remove temporary files:
rm 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.sam \
$sample-in-KOFAMS-RAW.bam
02_MAPPING_TO_KOFAMS/done
Then we profile each metagenome.
mkdir 03_KOFAMS_PROFILES
for sample in `cat samples_WATER.txt`
do
anvi-profile --contigs-db KOFAMS-contigs.db \
--input-file 02_MAPPING_TO_KOFAMS/$sample-in-KOFAMS.bam \
--profile-SCVs \
--num-threads 1 \
--min-contig-length 10 \
--output-dir 03_KOFAMS_PROFILES/$sample-in-KOFAMS
done
And finally merge all of the profiles so that we can make comparisons across metagenomes.
anvi-merge 03_KOFAMS_PROFILES/*-in-KOFAMS/PROFILE.db \
--contigs-db KOFAMS-contigs.db \
--output-dir 04_KOFAM_PROFILES-MERGED/
All set. We have a contigs.db of all KOfam annotated genes and a profile.db. We could just run a visualiztion on the data we have so far, but what we want to do instead in to create some additional files to enhance the visualiztion.
Most of the time you use the interactive interface you are looking at two types of data frames—items, which in this case are the genes and layers, which in this case are the samples. Click here for more details. We can add basically any type of data we wish for either of these data frames. We will do this piecemeal and most of what we add is already somewhere in our various databases, just not yet in a format that is accessible by the interactive.
So let’s start by generating additional info for the items, that is the genes.
To add taxonomic info for each gene call we need to two tables that can be generated from the original contigs database.
anvi-export-table 03_CONTIGS/WATER-contigs.db \
--table genes_taxonomy \
--output-file METABOLISM/00_HELPER_FILES/genes_taxonomy.txt
anvi-export-table 03_CONTIGS/WATER-contigs.db \
--table taxon_names \
--output-file METABOLISM/00_HELPER_FILES/taxon_names.txt
The first file (genes_taxonomy.txt
) provides a taxon ids for all genes in the database. Again, we could rerun the classification on just the KOfam annotated genes but why bother. The second file (taxon_names.txt
) is a lookup table for each taxon id. What we need to do is join these two table so that each gene id is coupled with a taxon id and a lineage. Then merge these data with of list of mock contig names in the in the KOfam database.
<- read.table("metabolism/00_HELPER_FILES/taxon_names.txt",
tax_names sep = "\t", header = TRUE, quote = "")
<- read.table("metabolism/00_HELPER_FILES/genes_taxonomy.txt",
gene_tax sep = "\t", header = TRUE, quote = "")
<- dplyr::inner_join(gene_tax, tax_names, by = "taxon_id")
gene_tax_names <- kofam_genes
kofam_genes_n $gene_callers_id <- as.numeric(as.character(kofam_genes_n$gene_callers_id))
kofam_genes_n<- dplyr::left_join(kofam_genes_n, gene_tax_names,
kofam_with_tax by = "gene_callers_id")
head(kofam_with_tax)
write.table(kofam_with_tax, "metabolism/00_HELPER_FILES/kofam_with_tax.txt", sep = "\t",
quote = FALSE, row.names = FALSE)
As you can see, we now have each gene associated with a taxonmic lineage (when one was identified). We could remove genes with NA
from the taxonomy file but anvi’o does not seem to have a problem with it so let’s leave it be. Now we can import the taxonomy files into the new contigs database.
anvi-import-taxonomy-for-genes --contigs-db 01_CONTIGS/KOFAMS-contigs.db \
--input-files 00_HELPER_FILES/kofam_with_tax.txt \
--parser default_matrix
Here is the output.
Total num hits found .........................: 11,588
Num gene caller ids in the db ................: 11,588
Num gene caller ids in the incoming data .....: 11,588
Taxon names table ............................: Updated with 1467 unique taxon names
Genes taxonomy table .........................: Taxonomy stored for 11588 gene calls
Splits taxonomy ..............................: Input data from "default_matrix" annotated 11588 of 11588 splits (100.0%) with taxonomy.
Even though we have this information, adding to the interactive is tricky because some genes have multiple annotations. Our main purpose for the visualiztion is exploratory analysis so we do not need to capture everything, but this is still a concern, especially when we are trying to generate the files we need. To narrow down our efforts, let’s focus on retrieving three pieces of information.
Which KOfam hits are part of complete Pathway Modules?
What module(s) is/are they a part of?
What hits are found in which MAGs?
Get module information for each gene call.
We use the command called anvi-estimate-metabolism
with the --kegg-output-modes modules
flag. This will generate a file (all_kofam_hits.txt
) that has the gene caller ID, the KO accession number, an e-value, and what module(s) the gene belongs to.
anvi-estimate-metabolism --contigs-db 03_CONTIGS/WATER-contigs.db \
--kegg-data-dir ~/Desktop/kegg-kofams \
--metagenome-mode \
--output-file-prefix METABOLISM/00_HELPER_FILES/all \
--kegg-output-modes kofam_hits
As before, in cases of multiple KO hits for a single gene call, we select the hit with the lowest e-value. We could do what we did above or we can use the table with unique gene hits that is stored in the variable kofam_unique
. What we do here is merge the two tables by gene caller id and KO accession number. This way we get only the top hits from the all_modules.txt
file.
<- read.table("metabolism/00_HELPER_FILES/all_kofam_hits.txt",
genes_all_ko sep = "\t", header = TRUE, quote = "")
c('unique_id', 'metagenome_name',
genes_all_ko[, 'contig', 'source')] <- list(NULL)
<- genes_all_ko[, c(2, 1, 3)]
genes_all_ko <- genes_all_ko
genes_all_ko_alt <- kofam_unique
kofam_unique1 c('source', 'ko_function', 'e_value' )] <- NULL
kofam_unique1[, $gene_callers_id <- as.numeric(as.character(kofam_unique1$gene_callers_id))
kofam_unique1<- dplyr::left_join(kofam_unique1, genes_all_ko,
genes_all_ko by = c("gene_callers_id" = "gene_caller_id" ,
"accession" = "ko"))
### ALT TO USE ALL GENE CALLS ###
<- kofam
kofam1 c('source', 'ko_function', 'e_value' )] <- NULL
kofam1[, $gene_callers_id <- as.numeric(as.character(kofam1$gene_callers_id))
kofam1<- dplyr::left_join(kofam1, genes_all_ko_alt,
genes_all_ko_alt by = c("gene_callers_id" = "gene_caller_id" ,
"accession" = "ko"))
OK. now we have a data frame that contains gene IDs, KO hits, and module IDs (if any). Some KOs are found in multiple modules while some are not found in any modules. We will deal with the multiple module cases in a minute. For now, let’s remove any hits not found in a module. Anvi’o doesn’t need information for all genes and removing genes with no hits will make parsing the data a tiny bit easier.
#genes_ko_only <- genes_all_ko[!(genes_all_ko$modules_with_ko=="None"),]
#genes_ko_only <- genes_ko_only %>% dplyr::rename(kegg_module = modules_with_ko)
#genes_ko_only_long <- genes_ko_only %>% tidyr::separate_rows(kegg_module)
### ALT TO USE ALL GENE CALLS ###
<- genes_all_ko_alt[!(genes_all_ko_alt$modules_with_ko=="None"),]
genes_ko_only_alt <- genes_ko_only_alt %>% dplyr::rename(kegg_module = modules_with_ko)
genes_ko_only_alt <- genes_ko_only_alt %>% tidyr::separate_rows(kegg_module) genes_ko_only_long_alt
And we get a list of 3,210 genes that have module affiliations. All we know right now is the module numbers, not the names.
anvi-estimate-metabolism --contigs-db 03_CONTIGS/WATER-contigs.db
--kegg-data-dir ~/Desktop/kegg-kofams/
--metagenome-mode
--output-file-prefix METABOLISM/00_HELPER_FILES/all
--kegg-output-modes modules
<- read.table("metabolism/00_HELPER_FILES/all_modules.txt",
all_mod sep = "\t", header = TRUE, quote = "")
c('module_definition', 'metagenome_name',
all_mod[, 'module_class', 'module_is_complete',
'module_completeness', 'gene_caller_ids_in_module',
'kofam_hits_in_module')] <- list(NULL)
<- all_mod[, c(1,2,4,5,3)]
all_mod <- all_mod[,c(1,2)]
mod_tab <- all_mod[,c(1,2,3)]
cat_tab <- all_mod[,c(1,2,4)]
sub_tab <- all_mod[,c(1,2,5)] name_tab
<- dplyr::left_join(genes_ko_only_long_alt, mod_tab,
genes_mod by = "kegg_module") %>%
::drop_na("unique_id")
tidyr<- genes_mod %>% tibble::rowid_to_column("unique_row_id")
genes_mod <- genes_mod
genes_mod_ko <- genes_mod[, c(2,1,4)]
genes_mod <- genes_mod %>% dplyr::mutate_if(is.numeric,
genes_mod
as.character,
is.factor, %>%
as.character) ::unite("path", unique_row_id:kegg_module,
tidyrremove = TRUE, sep = "XXX") %>%
::pivot_wider(names_from = c("path"), values_from = c("path")) %>%
tidyr::unite("kegg_module", -"gene_callers_id",
tidyrna.rm = TRUE, remove = TRUE, sep = "!!!")
<- max(str_count(genes_mod$kegg_module, "!!!")) + 2
n_reps
<- genes_mod %>% separate('kegg_module',
genes_mod paste("module", 1:n_reps, sep = "_"),
sep = "!!!", extra = "drop")
#######################################
<- dplyr::left_join(genes_ko_only_long_alt, cat_tab,
genes_cat by = "kegg_module") %>%
::drop_na("unique_id")
tidyrc("accession", "kegg_module")] <- list(NULL)
genes_cat[, <- genes_cat %>% tibble::rowid_to_column("unique_row_id")
genes_cat <- genes_cat[, c(2,1,4)]
genes_cat
<- genes_cat %>% dplyr::mutate_if(is.numeric,
genes_cat
as.character,
is.factor, %>%
as.character) ::unite("path", unique_row_id:module_category,
tidyrremove = TRUE, sep = "XXX") %>%
::pivot_wider(names_from = c("path"), values_from = c("path")) %>%
tidyr::unite("module_category", -"gene_callers_id",
tidyrna.rm = TRUE, remove = TRUE, sep = "!!!")
<- max(str_count(genes_cat$module_category, "!!!")) + 2
n_reps
<- genes_cat %>% separate('module_category',
genes_cat paste("category", 1:n_reps, sep = "_"),
sep = "!!!", extra = "drop")
#######################################
<- dplyr::left_join(genes_ko_only_long_alt, sub_tab,
genes_sub by = "kegg_module") %>%
::drop_na("unique_id")
tidyrc("accession", "kegg_module")] <- list(NULL)
genes_sub[, <- genes_sub %>% tibble::rowid_to_column("unique_row_id")
genes_sub <- genes_sub[, c(2,1,4)]
genes_sub
<- genes_sub %>% dplyr::mutate_if(is.numeric,
genes_sub
as.character,
is.factor, %>%
as.character) ::unite("path", unique_row_id:module_subcategory,
tidyrremove = TRUE, sep = "XXX") %>%
::pivot_wider(names_from = c("path"), values_from = c("path")) %>%
tidyr::unite("module_subcategory", -"gene_callers_id",
tidyrna.rm = TRUE, remove = TRUE, sep = "!!!")
<- max(str_count(genes_sub$module_subcategory, "!!!")) + 2
n_reps
<- genes_sub %>% separate('module_subcategory',
genes_sub paste("subcategory", 1:n_reps, sep = "_"),
sep = "!!!", extra = "drop")
#######################################
<- dplyr::left_join(genes_ko_only_long_alt, name_tab,
genes_name by = "kegg_module") %>%
::drop_na("unique_id")
tidyrc("accession", "kegg_module")] <- list(NULL)
genes_name[, <- genes_name %>% tibble::rowid_to_column("unique_row_id")
genes_name <- genes_name[, c(2,1,4)]
genes_name <- genes_name %>% dplyr::mutate_if(is.numeric,
genes_name
as.character,
is.factor, %>%
as.character) ::unite("path", unique_row_id:module_name,
tidyrremove = TRUE, sep = "XXX") %>%
::pivot_wider(names_from = c("path"), values_from = c("path")) %>%
tidyr::unite("module_name", -"gene_callers_id",
tidyrna.rm = TRUE, remove = TRUE, sep = "!!!")
<- max(str_count(genes_name$module_name, "!!!")) + 2
n_reps
<- genes_name %>% separate('module_name',
genes_name paste("name", 1:n_reps, sep = "_"),
sep = "!!!", extra = "drop")
#######################################
<- genes_mod_ko
genes_ko <- genes_ko[, c(2,1,3)]
genes_ko <- genes_ko %>% dplyr::mutate_if(is.numeric,
genes_ko
as.character,
is.factor, %>%
as.character) ::unite("path", unique_row_id:accession,
tidyrremove = TRUE, sep = "XXX") %>%
::pivot_wider(names_from = c("path"), values_from = c("path")) %>%
tidyr::unite("ko", -"gene_callers_id",
tidyrna.rm = TRUE, remove = TRUE, sep = "!!!")
<- max(str_count(genes_ko$ko, "!!!")) + 2
n_reps
<- genes_ko %>% separate('ko',
genes_ko paste("module", 1:n_reps, sep = "_"),
sep = "!!!", extra = "drop")
2:n_reps] <- apply(genes_mod[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_mod[, <- genes_mod %>% tidyr::unite("kegg_module", -"gene_callers_id",
genes_mod na.rm = TRUE, remove = TRUE, sep = "!!!")
2:n_reps] <- apply(genes_cat[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_cat[, <- genes_cat %>% tidyr::unite("module_category", -"gene_callers_id",
genes_cat na.rm = TRUE, remove = TRUE, sep = "!!!")
2:n_reps] <- apply(genes_sub[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_sub[, <- genes_sub %>% tidyr::unite("module_subcategory", -"gene_callers_id",
genes_sub na.rm = TRUE, remove = TRUE, sep = "!!!")
2:n_reps] <- apply(genes_name[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_name[, <- genes_name %>% tidyr::unite("module_name", -"gene_callers_id",
genes_name na.rm = TRUE, remove = TRUE, sep = "!!!")
2:n_reps] <- apply(genes_ko[, 2:n_reps], 2, function(x) gsub("\\d+XXX", "", x))
genes_ko[, <- genes_ko %>% tidyr::unite("ko", -"gene_callers_id",
genes_ko na.rm = TRUE, remove = TRUE, sep = "!!!")
<- dplyr::left_join(genes_ko, genes_mod, by = "gene_callers_id") %>%
genes_modules ::left_join(., genes_cat, by = "gene_callers_id") %>%
dplyr::left_join(., genes_sub, by = "gene_callers_id") %>%
dplyr::left_join(., genes_name, by = "gene_callers_id")
dplyrwrite.table(genes_modules, "metabolism/00_HELPER_FILES/genes_modules.txt",
quote = FALSE, sep = "\t")
Now we have a table of 2842 unique genes and their KEGG module annotations. The last thing we need to do is change the gene caller IDs to “contig IDs” to match the name of the slits in the contig.db. Way back uo the road we created a variable called kofam_genes_tab
, a table derived from a gene calls file where we added the gene call onto the end of the contig name. We need to do the same with the new KEGG modules annotation file so we can add this as an additional layer to the visualization. And, we also need to tack on a split name like we did for the fasta file.
<- genes_modules
genes_modules1 <- kofam_genes_tab
kofam_genes_tab1 <- dplyr::right_join(kofam_genes_tab1, genes_modules1, by = "gene_callers_id")
add_view_part_1 3:10] <- NULL
add_view_part_1[, 1] <- NULL
add_view_part_1[, <- add_view_part_1 %>% dplyr::mutate(split_id = "split_00001", .after = 1) %>%
add_view_part_1 ::unite(items, contig, split_id, remove = TRUE, sep = "_")
tidyrwrite.table(add_view_part_1, "metabolism/05_ANNOTATION_FILES/additional-items-kegg-functions.txt",
quote = FALSE, sep = "\t", row.names = FALSE)
Now we can run the following command to import the annotation data into the profile database.
anvi-import-misc-data 05_ANNOTATION_FILES/additional-items-kegg-functions.txt \
--pan-or-profile-db 04_KOFAM_PROFILES-MERGED/PROFILE.db \
--target-data-table items
for mag in `cat mags.list`
do anvi-estimate-metabolism --contigs-db 03_CONTIGS/WATER-contigs.db \
--kegg-data-dir ~/PATH/to/kegg-kofams/ \
--profile-db 06_MERGED/WATER/PROFILE.db \
--collection-name MICROBIAL_FINAL \
--bin-id $mag -O $mag \
--kegg-output-modes kofam_hits
done
for mag in `cat mags.list`
do anvi-estimate-metabolism --contigs-db 03_CONTIGS/WATER-contigs.db \
--kegg-data-dir ~/PATH/to/kegg-kofams/ \
--profile-db 06_MERGED/WATER/PROFILE.db \
--collection-name MICROBIAL_FINAL \
--bin-id $mag --output-file-prefix $mag
--kegg-output-modes modules
done
###################### mag_01 ############################
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00001_kofam_hits.txt",
mag_01sep = "\t", header = TRUE)
c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_01[, $mag_id <- "MAG01"
mag_01<- mag_01 %>% dplyr::arrange(gene_caller_id)
mag_01 <- mag_01[!duplicated(mag_01$gene_caller_id), ]
mag_01 <- mag_01[, c(2,1,3)]
mag_01 <- mag_01 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>%
mag_01 ::mutate(split_id = "split_00001", .after = 2) %>%
dplyr::unite(items, "int_name", "split_id", sep = "_")
tidyr###################### mag_02 ############################
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00002_kofam_hits.txt",
mag_02sep = "\t", header = TRUE)
c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_02[, $mag_id <- "MAG02"
mag_02<- mag_02 %>% dplyr::arrange(gene_caller_id)
mag_02 <- mag_02[!duplicated(mag_02$gene_caller_id), ]
mag_02 <- mag_02[, c(2,1,3)]
mag_02 <- mag_02 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>%
mag_02 ::mutate(split_id = "split_00001", .after = 1) %>%
dplyr::unite(items, "int_name", "split_id", sep = "_")
tidyr###################### mag_03 ############################
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00003_kofam_hits.txt",
mag_03sep = "\t", header = TRUE)
c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_03[, $mag_id <- "MAG03"
mag_03<- mag_03 %>% dplyr::arrange(gene_caller_id)
mag_03 <- mag_03[!duplicated(mag_03$gene_caller_id), ]
mag_03 <- mag_03[, c(2,1,3)]
mag_03 <- mag_03 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>%
mag_03 ::mutate(split_id = "split_00001", .after = 1) %>%
dplyr::unite(items, "int_name", "split_id", sep = "_")
tidyr###################### mag_04 ############################
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00004_kofam_hits.txt",
mag_04sep = "\t", header = TRUE)
c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_04[, $mag_id <- "MAG04"
mag_04<- mag_04 %>% dplyr::arrange(gene_caller_id)
mag_04 <- mag_04[!duplicated(mag_04$gene_caller_id), ]
mag_04 <- mag_04[, c(2,1,3)]
mag_04 <- mag_04 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>%
mag_04 ::mutate(split_id = "split_00001", .after = 1) %>%
dplyr::unite(items, "int_name", "split_id", sep = "_")
tidyr###################### mag_05 ############################
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00005_kofam_hits.txt",
mag_05sep = "\t", header = TRUE)
c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
mag_05[, $mag_id <- "MAG05"
mag_05<- mag_05 %>% dplyr::arrange(gene_caller_id)
mag_05 <- mag_05[!duplicated(mag_05$gene_caller_id), ]
mag_05 <- mag_05[, c(2,1,3)]
mag_05 <- mag_05 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>%
mag_05 ::mutate(split_id = "split_00001", .after = 1) %>%
dplyr::unite(items, "int_name", "split_id", sep = "_")
tidyr###################### bin_13 ############################
<- read.table("metabolism/06_MAG_KOFAMS/WATER_Bin_00013_kofam_hits.txt",
bin_13sep = "\t", header = TRUE)
c("unique_id", "ko", "modules_with_ko", "bin_name")] <- list(NULL)
bin_13[, $mag_id <- "BIN13"
bin_13<- bin_13 %>% dplyr::arrange(gene_caller_id)
bin_13 <- bin_13[!duplicated(bin_13$gene_caller_id), ]
bin_13 <- bin_13[, c(2,1,3)]
bin_13 <- bin_13 %>% tidyr::unite(int_name, "contig", "gene_caller_id", sep = "_gene_") %>%
bin_13 ::mutate(split_id = "split_00001", .after = 1) %>%
dplyr::unite(items, "int_name", "split_id", sep = "_") tidyr
<- rbind(mag_01, mag_02, mag_03, mag_04, mag_05, bin_13)
add_mag write.table(add_mag, "metabolism/05_ANNOTATION_FILES/additional-items-kos-in-mags.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
anvi-import-misc-data 05_ANNOTATION_FILES/additional-items-kos-in-mags.txt \
--pan-or-profile-db 04_KOFAM_PROFILES-MERGED/PROFILE.db \
--target-data-table items
<- readDNAStringSet("metabolism/00_HELPER_FILES/kofam_genes.fa")
kofam_fna_mod <- names(kofam_fna_mod)
kofam_names_mod <- paste(kofam_fna_mod)
kofam_seqs_mod <- data.frame(kofam_names_mod, kofam_seqs_mod)
fasta_df_mod $kofam_names_mod <- as.numeric(as.character(fasta_df_mod$kofam_names_mod))
fasta_df_mod<- dplyr::arrange(fasta_df_mod, kofam_names_mod)
fasta_df_mod
<- genes_modules1
genes_modules_list <- fasta_df_mod %>% dplyr::mutate_if(is.numeric,
fasta_df_mod
as.character,
is.factor,
as.character) c("ko", "kegg_module", "module_category",
genes_modules_list[, "module_subcategory", "module_name")] <- list(NULL)
<- dplyr::left_join(genes_modules_list, fasta_df_mod,
fasta_df_mod1 by = c("gene_callers_id" = "kofam_names_mod"))
<- dplyr::left_join(fasta_df_mod1, kofam_genes_tab1,
fasta_df_mod1 by = "gene_callers_id")
<- fasta_df_mod1[, c(3,2)]
fasta_df_mod1 <- fasta_df_mod1
fasta_df_mod1_rn $contig <- sub("", ">", fasta_df_mod1_rn$contig)
fasta_df_mod1_rnwrite.table(fasta_df_mod1_rn, "metabolism/00_HELPER_FILES_MOD_ONLY/kofam_genes_rn.fa",
sep = "\r", col.names = FALSE, row.names = FALSE,
quote = FALSE, fileEncoding = "UTF-8")
head(fasta_df_mod1_rn)
<- fasta_df_mod1
mod_only_list <- mod_only_list %>% dplyr::mutate(split_id = "split_00001", .after = 1) %>%
mod_only_list ::unite(items, contig, split_id, remove = TRUE, sep = "_")
tidyr
$kofam_seqs_mod <- NULL
mod_only_list<- dplyr::left_join(mod_only_list, add_mag, by = "items")
mod_only_mag
<- mod_only_list %>%
gene_mod_only_list ::separate("items",
tidyrinto = c("V1", "V2", "V3", "gene_callers_id", "V4", "V5"),
sep ="_")
c("V1", "V2", "V3", "V4", "V5")] <- list(NULL)
gene_mod_only_list[,
$gene_callers_id <- as.numeric(as.character(gene_mod_only_list$gene_callers_id))
gene_mod_only_list<- dplyr::left_join(gene_mod_only_list,
mods_tax
kofam_with_tax, by = "gene_callers_id")
$gene_callers_id <- as.numeric(as.character(kofam_genes_tab$gene_callers_id))
kofam_genes_tab1<- dplyr::left_join(gene_mod_only_list, kofam_genes_tab1, by = "gene_callers_id")
mod_external_gene
write.table(mod_external_gene, "metabolism/00_HELPER_FILES_MOD_ONLY/mod_external_gene.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
write.table(mod_only_mag, "metabolism/00_HELPER_FILES_MOD_ONLY/mag-mods.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
write.table(mods_tax, "metabolism/00_HELPER_FILES_MOD_ONLY/mods_tax.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
write.table(add_view_part_1, "metabolism/00_HELPER_FILES_MOD_ONLY/mods.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
anvi-gen-contigs-database --contigs-fasta 00_HELPER_FILES_MOD_ONLY/kofam_genes_rn.fa --project-name MODS --output-db-path 00_HELPER_FILES_MOD_ONLY/KOFAMS_MODS-contigs.db --external-gene-calls 00_HELPER_FILES_MOD_ONLY/mod_external_gene.txt --ignore-internal-stop-codons --skip-predict-frame
anvi-import-taxonomy-for-genes --contigs-db 00_HELPER_FILES_MOD_ONLY/KOFAMS_MODS-contigs.db --input-files 00_HELPER_FILES_MOD_ONLY/mods_tax.txt --parser default_matrix
anvi-import-taxonomy-for-genes –contigs-db KOFAMS_MODS-contigs.db –input-files mods_tax.txt –parser default_matrix
anvi-import-misc-data mods.txt –pan-or-profile-db 03_KOFAM_PROFILES-MERGED/PROFILE.db –target-data-table items anvi-import-misc-data mag-mods.txt –pan-or-profile-db 03_KOFAM_PROFILES-MERGED/PROFILE.db –target-data-table items
distinct(genes_cat, gene_callers_id, .keep_all = T)
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00001_modules.txt",
mag01_mod sep = "\t", header = TRUE)
$unique_id <- NULL
mag01_mod$bin_name <- "MAG01"
mag01_mod
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00002_modules.txt",
mag02_mod sep = "\t", header = TRUE)
$unique_id <- NULL
mag02_mod$bin_name <- "MAG02"
mag02_mod
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00003_modules.txt",
mag03_mod sep = "\t", header = TRUE)
$unique_id <- NULL
mag03_mod$bin_name <- "MAG03"
mag03_mod
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00004_modules.txt",
mag04_mod sep = "\t", header = TRUE)
$unique_id <- NULL
mag04_mod$bin_name <- "MAG04"
mag04_mod
<- read.table("metabolism/06_MAG_KOFAMS/WATER_MAG_00005_modules.txt",
mag05_mod sep = "\t", header = TRUE)
$unique_id <- NULL
mag05_mod$bin_name <- "MAG05"
mag05_mod
<- read.table("metabolism/06_MAG_KOFAMS/WATER_Bin_00013_modules.txt",
bin13_mod sep = "\t", header = TRUE)
$unique_id <- NULL
bin13_mod$bin_name <- "Bin13"
bin13_mod
<- read.table("metabolism/00_HELPER_FILES/all_modules.txt",
all_mods sep = "\t", header = TRUE, quote = "")
<- all_mods %>% dplyr::rename(bin_name = metagenome_name)
all_mods $unique_id <- NULL
all_mods$bin_name <- "ALL" all_mods
<- dplyr::left_join(all_mods, mag01_mod,
modules by = c("kegg_module", "module_name",
"module_class", "module_category",
"module_subcategory"),
suffix = c("_all", "_MAG_01")) %>%
::left_join(., mag02_mod, by = c("kegg_module", "module_name",
dplyr"module_class", "module_category",
"module_subcategory"), suffix = c("XXXX", "_MAG_02")) %>%
::left_join(., mag03_mod, by = c("kegg_module", "module_name",
dplyr"module_class", "module_category",
"module_subcategory"), suffix = c("_MAG_02", "_MAG_03")) %>%
::left_join(., mag04_mod, by = c("kegg_module", "module_name",
dplyr"module_class", "module_category",
"module_subcategory"), suffix = c("_MAG_03", "_MAG_04")) %>%
::left_join(., mag05_mod, by = c("kegg_module", "module_name",
dplyr"module_class", "module_category",
"module_subcategory"), suffix = c("_MAG_04", "_MAG_05")) %>%
::left_join(., bin13_mod, by = c("kegg_module", "module_name",
dplyr"module_class", "module_category",
"module_subcategory"), suffix = c("_MAG_05", "_Bin_13"))
<- rbind(all_mods, mag01_mod, mag02_mod, mag03_mod, mag04_mod, mag05_mod, bin13_mod)
modules c("module_definition", "module_class")] <- list(NULL)
modules[, colnames(modules)
<- modules %>% tidyr::pivot_wider(
modules_w names_from = c("bin_name"),
values_from = c("module_completeness",
"module_is_complete",
"kofam_hits_in_module",
"gene_caller_ids_in_module"),
names_sep = "_")
<- modules_w
items_add_data c(5:max(ncol(items_add_data)))] <- NULL
items_add_data[, <- items_add_data %>% dplyr::rename("items" = "kegg_module")
items_add_data <- items_add_data[order(items_add_data$module_category, items_add_data$module_subcategory), ]
items_add_data write.table(items_add_data, "metabolism/00_HELPER_FILES/module_item_add_data.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
<- modules_w
data_tab <- data_tab[order(data_tab$module_category, data_tab$module_subcategory), ]
data_tab c(2:4)] <- NULL
data_tab[, c(9:max(ncol(data_tab)))] <- NULL
data_tab[, <- data_tab %>% replace(is.na(.), 0)
data_tab
names(data_tab) <- gsub(x = names(data_tab), pattern = "module_completeness_", replacement = "")
write.table(data_tab, "metabolism/00_HELPER_FILES/module_data.txt",
sep = "\t", quote = FALSE, row.names = FALSE)
write.table(data_tab$kegg_module, "metabolism/00_HELPER_FILES/module_order.txt",
sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)
[1] “unique_row_id” “bin_name” “kegg_module” “module_name”
[5] “module_class” “module_category” “module_subcategory” “module_definition”
[9] “module_completeness” “module_is_complete” “kofam_hits_in_module” “gene_caller_ids_in_module”
<- read.table("metabolism/00_HELPER_FILES/WATER-GENE-COVERAGES.txt", header = TRUE)
coverage $key <- as.factor(coverage$key)
coverage<- dplyr::left_join(kofam_unique, coverage, by = c("gene_callers_id" = "key"))
kofam_unique_coveage c("gene_callers_id", "source",
kofam_unique_coveage[, "ko_function", "e_value",
"WCCR_1913", "WCCR_1916")] <- list(NULL)
<- kofam_unique_coveage[which(rowSums(kofam_unique_coveage[,2:3]) != 0),]
kofam_unique_coveage1 write.table(kofam_unique_coveage1, "metabolism/00_HELPER_FILES/kofam_unique_coveage1.txt",
quote = FALSE, sep = "\t", row.names = FALSE)
<- kofam_unique_coveage1[!duplicated(kofam_unique_coveage1$accession),]
kofam_unique_coveage1_u <- kofam_unique_coveage1_u %>% tibble::remove_rownames() %>%
kofam_unique_coveage1_u ::column_to_rownames("accession")
tibble
<- data.matrix(kofam_unique_coveage1_u[1:2], rownames.force = NULL)
kofam_unique_coveage1_u <- 1
i <- pathview(gene.data = kofam_unique_coveage1_u[, 1:2],
pv.out pathway.id = "00920",
species = "ko", out.suffix = "all",
kegg.native = TRUE)
str(pv.out)
head(pv.out$plot.data.gene)
#result PNG file in current directory
<- read.table('metabolism/06_MAG_KOFAMS/16s-asv_data-2019.txt', header = T)
mag01 <- dplyr::bind_rows(rrna, amf, its)
all_2019
<- read.table("mag02.txt", header = FALSE)
mag02 <- c(mag02$V1)
mag02 #KEGG view: gene data only
#load data
data(gse16873.d)
data(demo.paths)
<- data.frame(gse16873.d)
test <- test %>% tibble::rownames_to_column("id")
test
duplicated(test$id),]
test[dim(test[duplicated(test$id),])[1]
$id
test
#KEGG view: gene data only
<- 1
i <- pathview(gene.data = gse16873.d[, 1],
pv.out pathway.id = demo.paths$sel.paths[i],
species = "hsa", out.suffix = "gse16873",
kegg.native = TRUE)
str(pv.out)
head(pv.out$plot.data.gene)
#result PNG file in current directory
#Graphviz view: gene data only
<- pathview(gene.data = gse16873.d[, 1],
pv.out pathway.id = demo.paths$sel.paths[i],
species = "hsa", out.suffix = "gse16873",
kegg.native = FALSE, sign.pos = demo.paths$spos[i])
#result PDF file in current directory
#KEGG view: both gene and compound data
=sim.mol.data(mol.type="cpd", nmol=3000)
sim.cpd.data<- 3
i print(demo.paths$sel.paths[i])
<- pathview(gene.data = gse16873.d[, 1], cpd.data = sim.cpd.data,
pv.out pathway.id = demo.paths$sel.paths[i],
species = "hsa", out.suffix = "gse16873.cpd",
keys.align = "y", kegg.native = TRUE,
key.pos = demo.paths$kpos1[i])
str(pv.out)
head(pv.out$plot.data.cpd)
#multiple states in one graph
set.seed(10)
= matrix(sample(sim.cpd.data, 18000, replace = TRUE), ncol = 6)
sim.cpd.data2 <- pathview(gene.data = gse16873.d[, 1:3],
pv.out cpd.data = sim.cpd.data2[, 1:2],
pathway.id = demo.paths$sel.paths[i],
species = "hsa", out.suffix = "gse16873.cpd.3-2s",
keys.align = "y", kegg.native = TRUE,
match.data = FALSE, multi.state = TRUE, same.layer = TRUE)
str(pv.out)
head(pv.out$plot.data.cpd)
#result PNG file in current directory
##more examples of pathview usages are shown in the vignette.
<- read.table('arco-phylo-kegg-modules_modules.txt', sep = "\t",
arco header = TRUE, quote = "")
<- arco
arco_t c("unique_id", "db_name", "module_name", "module_class", "module_category", "module_subcategory", "module_definition", "kofam_hits_in_module", "gene_caller_ids_in_module", "module_completeness")] <- list(NULL)
arco_t[,
$module_is_complete <- str_replace(arco_t$module_is_complete, "True", "1")
arco_t$module_is_complete <- str_replace(arco_t$module_is_complete, "False", "0")
arco_t$module_is_complete <- as.integer(arco_t$module_is_complete)
arco_t<- arco_t %>% tidyr::pivot_wider(names_from = c("kegg_module"), values_from = c("module_is_complete"))
arco_t_w <- arco_t_w[, colSums(arco_t_w != 0) > 0]
arco_t_w <- column_to_rownames(arco_t_w, var = "genome_name")
arco_t_w <- arco_t_w[, colSums(arco_t_w > 0) <= 70]
arco_t_w <- rownames_to_column(arco_t_w, var = "genome_name")
arco_t_w <- read.table('genome_order.txt',
g_order header = TRUE, quote = "")
<- dplyr::left_join(g_order, arco_t_w,
arco_t_w_order by = "genome_name")
<- arco_t_w_order %>% pivot_longer(cols = 2:34, names_to = "kegg_module", values_to = "module_is_complete")
arco_t_l
<- arco_t_w_order %>% tibble::column_to_rownames("genome_name")
arco_t_w_order_no_row_id
heatmaply(arco_t_w_order_no_row_id,
Rowv = NULL, Colv = NULL,
k_row = 1, k_col = 1,
file = "heatmaply_plot.html",
colors = c("white", "red"),
hide_colorbar = TRUE,
fontsize_row = 6)
<- read.table("WATER-GENE-COVERAGES.txt", header = TRUE)
coverage <- read.table("00_HELPER_FILES/KOfam.txt", sep = "\t",
kofam header = TRUE, quote = "")
<- dplyr::left_join(kofam, coverage, by = c("gene_callers_id" = "key")) kofam_cov
anvi-search-functions --contigs-db 03_CONTIGS/WATER-contigs.db --full-report ko_search.txt --annotation-sources KOfam --search-terms K17222,K17223,K17227,K17226,K17224,K17225,K22622,K00170,K00169,K00171,K00172,K00174,K00175,K00240,K01676,K00244,K00246,K00245,K01007,K01682,K00407,K00406,K00404,K00405
<- read.table("00_HELPER_FILES/ko_search.txt", sep = "\t",
kofam_search header = TRUE, quote = "")
<- dplyr::left_join(kofam_search, kofam_cov, by = "gene_callers_id")
kofam_seach_cov write.table(kofam_seach_cov, "kofam_seach_cov.txt", sep = "\t", quote = FALSE, row.names = FALSE)
The source code for this page can be accessed on GitHub by clicking this link.
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. Source code is available at https://github.com/hypocolypse/web/, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".