Accessing KEGG database from R/Bioconductor

KEGG database is a great resource for biological pathway information, which is an essential part of genome/transcriptome analysis where biological interpretation are formed.

For high throughput studies, it is preferred to access KEGG database programmatically. KEGG recently released the REST service to accomodate such needs. More information about REST service in KEGG can be found from here.

There is also a KEGGREST Bioconductor package available. Since KEGGREST is not working well under my computing environment, I wrote my own code. Below are some examples.

First, a simple example of getting an organism’s basic information.

getOrganismInfo <- function(organism){
  KEGG_INFO_BASE <- "http://rest.kegg.jp/info/"
  info_REST_url <- paste(KEGG_INFO_BASE, organism, sep="")
  info <- readLines(info_REST_url)
  info
}

Calling the above function for human (“hsa” as the organism code), you will get the following:

> getOrganismInfo(“hsa”)
[1] “T01001 Homo sapiens (human) KEGG Genes Database”
[2] “hsa Release 65.0+/02-21, Feb 13″
[3] ” Kanehisa Laboratories”
[4] ” 26,241 entries”

The second example demonstrate how to list KEGG pathway ids and names for a specified organism:

mapPathwayToName <- function(organism) {
  KEGG_PATHWAY_LIST_BASE <- "http://rest.kegg.jp/list/pathway/"
  pathway_list_REST_url <- paste(KEGG_PATHWAY_LIST_BASE, organism, sep="")

  pathway_id_name <- data.frame()

  for (line in readLines(pathway_list_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    pathway_id <- strsplit(tmp[1], organism)[[1]][2]
    pathway_name <- tmp[2]
    pathway_name <- strsplit(pathway_name, "\\s+-\\s+")[[1]][1]
    pathway_id_name[pathway_id, 1] = pathway_name

  }

  names(pathway_id_name) <- "pathway_name"
  pathway_id_name
}

Running this code with “hsa” lists all the human pathways stored in KEGG.

> human_pathways head(human_pathways)
pathway_name
00010 Glycolysis / Gluconeogenesis
00020 Citrate cycle (TCA cycle)
00030 Pentose phosphate pathway
00040 Pentose and glucuronate interconversions
00051 Fructose and mannose metabolism
00052 Galactose metabolism

The third example shows how to map a list of genes to pathways. Note since not all the genes have functions assigned, we’ll only get pathways for some of the genes.

mapGeneToPathway <- function(organism) {
  KEGG_PATHWAY_LINK_BASE <- "http://rest.kegg.jp/link/pathway/"
  pathway_link_REST_url <- paste(KEGG_PATHWAY_LINK_BASE, organism, sep="")
  
  gene_pathway <- data.frame()

  for (line in readLines(pathway_link_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    gene <- tmp[1]
    gene <- strsplit(gene, ":")[[1]][2]  
    pathway_id<- strsplit(tmp[2], organism)[[1]][2]

    if (is.null(gene_pathway[gene, 1])) {
      gene_pathway[gene,1] = pathway_id
    } else {
      if (is.na(gene_pathway[gene,1])) {
        gene_pathway[gene,1] = pathway_id
      } else {
        gene_pathway[gene,1] = paste(gene_pathway[gene, 1], pathway_id, sep=";")
      }
    }
  }
  names(gene_pathway) <- "pathway_id"
  gene_pathway
}

Try it with “hsa”:

> gene_to_pathway <- mapGeneToPathway("hsa")
> head(gene_to_pathway, n=3)
pathway_id
10 00232;00983;01100;05204
100 00230;01100;05340
1000 04514;05412

Note that some gene (10) are mapped to 4 pathway id. Combine this function with the second example, you will be able to add pathway names to the table.

20 thoughts on “Accessing KEGG database from R/Bioconductor”

Pingback: Find organism code used in KEGG reference pathways | biobeat

hakepon March 6, 2013 at 7:28 am

Dear Biobeat author,

Thanks a lot for your nice post, I found it extremely useful. I am new to R and I was wondering how would you feed this script with a given list of genes? Could we use gene symbols or do we need to use the b number as in the example? And could you give an example on how to merge pathway name and pathway ID, I think it would be useful for newbies like me.
Regards!

Reply ↓

biobeat Post authorMarch 6, 2013 at 10:23 am

I am glad that you found this post useful. I will try to post new blogs addressing your questions in the next few days. In general gene symbols are not the best way to search database in scripts. Usually people use some database entry IDs for that and convert between gene IDs from different database.

Reply ↓

hakepon March 7, 2013 at 6:12 am

Thank you for your kind answer. I have been using embl database until now but results for e. coli pathways are not so good so I look forward to be able to use KEGG!

Reply ↓

biobeat Post authorMarch 10, 2013 at 2:58 pm

I just posted some example code to address your question: https://biobeat.wordpress.com/2013/03/10/merge-pathway-name-and-pathway-id-from-kegg-database/

Reply ↓
1. hakepon March 11, 2013 at 4:25 am
  
  Dear biobeat,
  
  Thank you for your nice post. Sorry for my poor R level but I still don’t understand very well how to feed the example 3 you show here with my own list of genes. All the rest works great!
  
  Regards,
  
  Hakepon
2. biobeat Post authorMarch 14, 2013 at 4:24 pm
  
  Hakepon, to map your own list of genes to a pathway, you need to use gene IDs used in KEGG. If I remember correctly, KEGG use NCBI GeneID numbers. For example, running example #3 on human (“hsa” for organism), you will get gene “10” map to the pathways “00232;00983;01100;05204”. You can verify this by searching KEGG for “hsa:10”. So, I would recommend you get a gene-pathway table of your organism following example #3, then look for rows of your own genes based on NCBI GeneID.

Pingback: Merge pathway name and pathway ID from KEGG database | biobeat

Duy May 13, 2013 at 7:02 am

Really good tips. Thanks

Reply ↓

biobeat Post authorMay 14, 2013 at 10:00 pm

I am glad to know that you found my post useful and left me a comment. Thanks.

Reply ↓

Rosary May 28, 2013 at 9:48 pm

Hi there! I really love your posts! Really useful!
However, if I were only interested in getting pathway IDs for a specific pathway name (not organism), e.g. getting pathway ID for Starch and Sucrose metabolism pathway; would I be able to ‘modify’ the second example somehow and get away with it? Or does it not work that way?
Thanks!

Reply ↓

biobeat Post authorMay 30, 2013 at 10:56 pm

Hello Rosary, sorry that I got back to you late. Yes, you can write your own function by modifying the code in the second example. I just post a solution here: https://biobeat.wordpress.com/2013/05/30/list-kegg-pathway-maps/

I am glad that you found my posts useful. Thanks.

Reply ↓

Pingback: List KEGG pathway maps | biobeat

Alicia May 6, 2015 at 6:55 am

Hi biobeat!
I know it is two years after this post…
My question is: How can I download the list of microorganism that carry an specific gene for an enzyme?
In my case I need to download the list of microorganism that carry the beta-galactosidase gene, throught the enzyme database, ENZYME: 3.2.1.23. Is that possible?
Thank you!!

Reply ↓

amolkolte May 13, 2015 at 6:02 am

Hi, Thanks for the nice post!!
Do you know if this way is free for academic use without a license?

Reply ↓

biobeat Post authorMay 13, 2015 at 9:22 pm

Please contact KEGG directly for license questions. But the REST service seem to be free to use, at least for the example code that I posted here.

Reply ↓

biobeat Post authorMay 13, 2015 at 9:20 pm

Hello Alicia,

Sorry for getting back to you late. I have been quite busy for sometime, and didn’t pay attention to my blog. I have not use KEGG for more quite some time, so I don’t remember the technical details, but I believe there is a way to solve your problem in KEGG rest service. You may start by looking at the pathways that involve this enzyme. Alternatively, you may also look for enzyme specific database, which may already have a direct answer to your question. Hope this helps. Sorry, that I have no time to explore it for now.

Reply ↓

Kevin Lee September 30, 2016 at 1:17 am

This is great posts! Really useful !

Reply ↓

Umesh Kathad July 26, 2017 at 11:06 am

Its very useful and nice post with example,
I have a question regarding how to get the gene list (names) that involve in specific pathway.
Thank you in advance

Reply ↓

biobeat Post authorJuly 27, 2017 at 8:11 am

List of genes is available from the first column of the pathway information. For example, look at this: http://rest.kegg.jp/link/pathway/hsa/

Reply ↓

Pingback: Find organism code used in KEGG reference pathways | biobeat
hakepon March 6, 2013 at 7:28 am

Dear Biobeat author,

Thanks a lot for your nice post, I found it extremely useful. I am new to R and I was wondering how would you feed this script with a given list of genes? Could we use gene symbols or do we need to use the b number as in the example? And could you give an example on how to merge pathway name and pathway ID, I think it would be useful for newbies like me.
Regards!

Reply ↓
biobeat Post authorMarch 6, 2013 at 10:23 am

I am glad that you found this post useful. I will try to post new blogs addressing your questions in the next few days. In general gene symbols are not the best way to search database in scripts. Usually people use some database entry IDs for that and convert between gene IDs from different database.

Reply ↓
hakepon March 7, 2013 at 6:12 am

Thank you for your kind answer. I have been using embl database until now but results for e. coli pathways are not so good so I look forward to be able to use KEGG!

Reply ↓
1. biobeat Post authorMarch 10, 2013 at 2:58 pm
  
  I just posted some example code to address your question: https://biobeat.wordpress.com/2013/03/10/merge-pathway-name-and-pathway-id-from-kegg-database/
  
  Reply ↓
  1. hakepon March 11, 2013 at 4:25 am
    
    Dear biobeat,
    
    Thank you for your nice post. Sorry for my poor R level but I still don’t understand very well how to feed the example 3 you show here with my own list of genes. All the rest works great!
    
    Regards,
    
    Hakepon
  2. biobeat Post authorMarch 14, 2013 at 4:24 pm
    
    Hakepon, to map your own list of genes to a pathway, you need to use gene IDs used in KEGG. If I remember correctly, KEGG use NCBI GeneID numbers. For example, running example #3 on human (“hsa” for organism), you will get gene “10” map to the pathways “00232;00983;01100;05204”. You can verify this by searching KEGG for “hsa:10”. So, I would recommend you get a gene-pathway table of your organism following example #3, then look for rows of your own genes based on NCBI GeneID.
Pingback: Merge pathway name and pathway ID from KEGG database | biobeat
Duy May 13, 2013 at 7:02 am

Really good tips. Thanks

Reply ↓
1. biobeat Post authorMay 14, 2013 at 10:00 pm
  
  I am glad to know that you found my post useful and left me a comment. Thanks.
  
  Reply ↓
Rosary May 28, 2013 at 9:48 pm

Hi there! I really love your posts! Really useful!
However, if I were only interested in getting pathway IDs for a specific pathway name (not organism), e.g. getting pathway ID for Starch and Sucrose metabolism pathway; would I be able to ‘modify’ the second example somehow and get away with it? Or does it not work that way?
Thanks!

Reply ↓
1. biobeat Post authorMay 30, 2013 at 10:56 pm
  
  Hello Rosary, sorry that I got back to you late. Yes, you can write your own function by modifying the code in the second example. I just post a solution here: https://biobeat.wordpress.com/2013/05/30/list-kegg-pathway-maps/
  
  I am glad that you found my posts useful. Thanks.
  
  Reply ↓
Pingback: List KEGG pathway maps | biobeat
Alicia May 6, 2015 at 6:55 am

Hi biobeat!
I know it is two years after this post…
My question is: How can I download the list of microorganism that carry an specific gene for an enzyme?
In my case I need to download the list of microorganism that carry the beta-galactosidase gene, throught the enzyme database, ENZYME: 3.2.1.23. Is that possible?
Thank you!!

Reply ↓
amolkolte May 13, 2015 at 6:02 am

Hi, Thanks for the nice post!!
Do you know if this way is free for academic use without a license?

Reply ↓
1. biobeat Post authorMay 13, 2015 at 9:22 pm
  
  Please contact KEGG directly for license questions. But the REST service seem to be free to use, at least for the example code that I posted here.
  
  Reply ↓
biobeat Post authorMay 13, 2015 at 9:20 pm

Hello Alicia,

Sorry for getting back to you late. I have been quite busy for sometime, and didn’t pay attention to my blog. I have not use KEGG for more quite some time, so I don’t remember the technical details, but I believe there is a way to solve your problem in KEGG rest service. You may start by looking at the pathways that involve this enzyme. Alternatively, you may also look for enzyme specific database, which may already have a direct answer to your question. Hope this helps. Sorry, that I have no time to explore it for now.

Reply ↓
Kevin Lee September 30, 2016 at 1:17 am

This is great posts! Really useful !

Reply ↓
Umesh Kathad July 26, 2017 at 11:06 am

Its very useful and nice post with example,
I have a question regarding how to get the gene list (names) that involve in specific pathway.
Thank you in advance

Reply ↓
1. biobeat Post authorJuly 27, 2017 at 8:11 am
  
  List of genes is available from the first column of the pathway information. For example, look at this: http://rest.kegg.jp/link/pathway/hsa/
  
  Reply ↓

biobeat

Roaming the computational biology world.

Accessing KEGG database from R/Bioconductor

20 thoughts on “Accessing KEGG database from R/Bioconductor”

Leave a reply to biobeat Cancel reply

Share this:

Related

20 thoughts on “Accessing KEGG database from R/Bioconductor”

Leave a reply to biobeat Cancel reply