Tag Archives: R

Search KEGG Pathway Maps by Keyword in R

I was recently asked about how to list KEGG pathways not specific to any organism, which was not addressed in my previous post on accessing KEGG pathway from R.  Here, I post one example that shows you how to search KEGG pathway by keyword from R.

First, let’s define the function getKeggPathwayTable:

getKeggPathwayTable <- function(){
  pathway_link_REST_url <- "http://rest.kegg.jp/list/pathway/"
  kegg_pathways<- data.frame()

  current_row = 1
  for (line in readLines(pathway_link_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    map <- tmp[1]
    map <- strsplit(map, ":")[[1]][2]  
    pathway_name<- tmp[2]
    kegg_pathways[current_row, 1]=map
    kegg_pathways[current_row, 2]=pathway_name
    current_row = current_row + 1
  names(kegg_pathways) <- c("map_id","name")

Running the above function will give a data frame that lists all the current KEGG pathway IDs and names.

Next, lets define a function to search the data frame returned by getKeggPathwayTable by keyword.

searchKeggPathwayByKeyword <- function(keyword) {
  kegg_pathways <- getKeggPathwayTable()
  hits <- grep(keyword, kegg_pathways$name, ignore.case=TRUE)

Note that I use “ignore.case=TRUE” to make sure that the search is not case sensitive. Now let’s try it with the keyword “vitamin” and “sucrose.”

> searchKeggPathwayByKeyword("vitamin")
      map_id                             name
110 map00750            Vitamin B6 metabolism
303 map04977 Vitamin digestion and absorption

> searchKeggPathwayByKeyword("sucrose")
     map_id                          name
60 map00500 Starch and sucrose metabolism

Merge pathway name and pathway ID from KEGG database

 If an organism is listed in KEGG database, one can easily get a list of its pathways and map a list of genes to the pathways (see here for an example about how to do it in R/bioconductor).

However, the pathway mapping result only provides the pathway IDs and not pathway names. To merge pathway IDs with pathway names, we only need to add  one more column of pathway names for each pathway ID listed in the same row of the mapping result dataframe. The R code below provides an example.

my_organism.gene_pathway <- mapGeneToPathway(organism) # map each gene to pathway ID.
my_organism.pathway_id_name <- mapPathwayToName(organism) # get the pathway name for each pathway ID.

# adding pathway names to mapped pathway IDs.
for (gene in row.names(my_orgranism.gene_pathway)) { 
  pathway_id <- my_orgranism.gene_pathway[gene, 'pathway_id']
  pathway_id <- strsplit(pathway_id, ";")[[1]]
  if (length(pathway_id) == 1) {        # if a gene maps to a single pathway
    pathway_name <- my_orgranism.pathway_id_name[pathway_id, "pathway_name"]
    my_orgranism.gene_pathway[gene,"pathway_name"] = pathway_name
  } else {                              # if a gene mapes to multiple pathways
    for (a_id in pathway_id) {
      if (is.null(my_orgranism.gene_pathway[gene, "pathway_name"])) {
        my_orgranism.gene_pathway[gene, "pathway_name"] = my_orgranism.pathway_id_name[a_id, "pathway_name"]
      } else {
        name <- my_orgranism.gene_pathway[gene, "pathway_name"]
        if (is.na(name)) {
          name <- my_orgranism.pathway_id_name[a_id, "pathway_name"]
        } else {
          name <- paste(name, my_orgranism.pathway_id_name[a_id, "pathway_name"], sep=";")
        my_orgranism.gene_pathway[gene, "pathway_name"] <- name

Accessing KEGG database from R/Bioconductor

KEGG database is a great resource for biological pathway information, which is an essential part of genome/transcriptome analysis where biological interpretation are formed.

For high throughput studies, it is preferred to access KEGG database programmatically. KEGG recently released the REST service to accomodate such needs. More information about REST service in KEGG can be found from here.

There is also a KEGGREST Bioconductor package available. Since KEGGREST is not working well under my computing environment, I wrote my own code. Below are some examples.

First, a simple example of getting an organism’s basic information.

getOrganismInfo <- function(organism){
  KEGG_INFO_BASE <- "http://rest.kegg.jp/info/"
  info_REST_url <- paste(KEGG_INFO_BASE, organism, sep="")
  info <- readLines(info_REST_url)

Calling the above function for human (“hsa” as the organism code), you will get the following:

> getOrganismInfo(“hsa”)
[1] “T01001 Homo sapiens (human) KEGG Genes Database”
[2] “hsa Release 65.0+/02-21, Feb 13″
[3] ” Kanehisa Laboratories”
[4] ” 26,241 entries”

The second example demonstrate how to list KEGG pathway ids and names for a specified organism:

mapPathwayToName <- function(organism) {
  KEGG_PATHWAY_LIST_BASE <- "http://rest.kegg.jp/list/pathway/"
  pathway_list_REST_url <- paste(KEGG_PATHWAY_LIST_BASE, organism, sep="")

  pathway_id_name <- data.frame()

  for (line in readLines(pathway_list_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    pathway_id <- strsplit(tmp[1], organism)[[1]][2]
    pathway_name <- tmp[2]
    pathway_name <- strsplit(pathway_name, "\\s+-\\s+")[[1]][1]
    pathway_id_name[pathway_id, 1] = pathway_name


  names(pathway_id_name) <- "pathway_name"

Running this code with “hsa” lists all the human pathways stored in KEGG.

> human_pathways head(human_pathways)
00010 Glycolysis / Gluconeogenesis
00020 Citrate cycle (TCA cycle)
00030 Pentose phosphate pathway
00040 Pentose and glucuronate interconversions
00051 Fructose and mannose metabolism
00052 Galactose metabolism

The third example shows how to map a list of genes to pathways. Note since not all the genes have functions assigned, we’ll only get pathways for some of the genes.

mapGeneToPathway <- function(organism) {
  KEGG_PATHWAY_LINK_BASE <- "http://rest.kegg.jp/link/pathway/"
  pathway_link_REST_url <- paste(KEGG_PATHWAY_LINK_BASE, organism, sep="")
  gene_pathway <- data.frame()

  for (line in readLines(pathway_link_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    gene <- tmp[1]
    gene <- strsplit(gene, ":")[[1]][2]  
    pathway_id<- strsplit(tmp[2], organism)[[1]][2]

    if (is.null(gene_pathway[gene, 1])) {
      gene_pathway[gene,1] = pathway_id
    } else {
      if (is.na(gene_pathway[gene,1])) {
        gene_pathway[gene,1] = pathway_id
      } else {
        gene_pathway[gene,1] = paste(gene_pathway[gene, 1], pathway_id, sep=";")
  names(gene_pathway) <- "pathway_id"

Try it with “hsa”:

> gene_to_pathway <- mapGeneToPathway("hsa")
> head(gene_to_pathway, n=3)
10 00232;00983;01100;05204
100 00230;01100;05340
1000 04514;05412

Note that some gene (10) are mapped to 4 pathway id. Combine this function with the second example, you will be able to add pathway names to the table.

Retrieve short gene names in cummeRbund

CummeRbund is a R package for analyzing cufflink results. By default, it uses XLOC values as the identifiers for each gene. All the gene information is stored in CuffGene instances, including the short name (if available). To get short gene names instead of XLOC values, the following R code is helpful.

>diffGenes <- getGenes(cuff_data, sig_gene_data$gene_id)

>head( featureNames(diffGenes), n=3)

tracking_id gene_short_name
1 XLOC_000186 GRHL3
2 XLOC_000302 NDUFS5
3 XLOC_000374 UQCRH

> head(featureNames(diffGenes)$gene_short_name, n=3)

*** Levels …..

There is some recommended method for newer version of CummeRbund as discussed here, but the idea is similar (use featureNames on CuffGeneSet instance).

Aquamacs + ESS + remote R session

I have been using Aquamacs as my default editor and the ESS (Emacs Speaks Statistics) as the default environment for all my R projects. Today I came across a nice tip about using Aquamacs and ESS to work on a remote R session. The original post by Matthew Keller can be found from here.

Steps to run Aquamacs and ESS to talk to a remote R session

1. Open a terminal in Aquamacs (M-x shell);

2. In the terminal type: ssh -XC name@server (-XC for data compression and X11 forwarding);

3. Start R in the terminal;

4. M-x ess-remote (let Aquamacs ESS talk to remote R session)

5. options(device=’x11′) (in R command line, but not needed if your server is running Linux).


Plot Genome Coverage Graphs using grid package in R

It is a common task to visualize the genome coverages across different conditions. This usually requires a multiple panel graph with some shared axis labels and a title shared by all the panels. One easy way to do this is using the grid packages in R. The code below shows how to do it.

coverages <- read.csv(‘Ambion_coverage_small.csv’, header=TRUE, sep=’,’)


## 26 Ambion
title = expression(paste(“26”, degree, “C rep 1”))
p23 <- ggplot(coverages, aes(x=position, y=NBA.23)) + geom_line()
p23 <- p23 + theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
p23 <- p23 + ggtitle(title)

title = expression(paste(“26”, degree, “C rep 2”))
p25 <- ggplot(coverages, aes(x=position, y=NBA.25)) + geom_line()
p25 <- p25 + theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
p25 <- p25 + ggtitle(title)

title = expression(paste(“37”, degree, “C rep 1”))
p20 <- ggplot(coverages, aes(x=position, y=NBA.20)) + geom_line()
p20 <- p20 + theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
p20 <- p20 + ggtitle(title)

title = expression(paste(“37”, degree, “C rep 2”))
p21 <- ggplot(coverages, aes(x=position, y=NBA.21)) + geom_line()
p21 <- p21 + theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
p21 <- p21 + ggtitle(title)

title = expression(paste(“37”, degree, “C rep 3”))
p22 <- ggplot(coverages, aes(x=position, y=NBA.22)) + geom_line()
p22 <- p22 + theme(axis.title.x = element_blank(),
axis.title.y = element_blank())
p22 <- p22 + ggtitle(title)


pushViewport(viewport(layout=grid.layout(6,2, heights= unit(c(0.8, 5, 5, 5, 5, 5), ‘null’), widths=unit(c(0.5, 5), ‘null’))))

grid.text(“Genome Coverages of Ambion Libraries”, vp = viewport(layout.pos.row = 1)) # a title for the entire grid

grid.text(“Row read counts”, rot=90, vp = viewport(layout.pos.col = 1)) # a title for the entire grid
print(p23, vp = viewport(layout.pos.row=2, layout.pos.col=2))
print(p25, vp = viewport(layout.pos.row=3, layout.pos.col=2))
print(p20, vp = viewport(layout.pos.row=4, layout.pos.col=2))
print(p21, vp = viewport(layout.pos.row=5, layout.pos.col=2))
print(p22, vp = viewport(layout.pos.row=6, layout.pos.col=2))

The above code uses a grid of 6 rows and 2 columns. The first column and the first row are used for shared y-axis and a title for the entire grid. The rest 5 rows in 1 column are used to plot all the panels. There are also some tricks for hide x- and y-axis labels for every panel, and a way to print the degree sign along with other text as panel titles.

These code generate the following figure.


Mapping genes to pathways using biomaRt

One common task in RNA-seq analysis is to map significantly regulated genes into pathways. Pathway information are available from various databases, e.g. reactome, uniprot, etc. The code below shows how to map genes of interest (as reported by cufflinks/cummeRbund) to human pathways in uniprot.

data <- readCufflinks()
gene_list <- genes(data)
gene_diff_data <- diffData(gene_list)
gene_list <- features(gene_list)
gene_list <- subset(gene_list, select=c('gene_id', 'gene_short_name'))
sig_gene_data <- subset(gene_diff_data, (significant=='yes'))
unimart = useMart('unimart', dataset='uniprot')
pathway.data <- data.frame('gene_name', 'go_name')
for (i in sig_gene_data$gene_id) {
 gene_name = subset(gene_list, gene_id == i, select='gene_short_name')
 tmp <- getBM(attributes=c('gene_name','organism', 'go_name'),
 tmp <- subset(tmp, tmp$organism=='Homo sapiens',
 select=c('gene_name', 'go_name'))
 tmp <- tmp[grep('^P:', tmp$go_name),]
 colnames(tmp) <- colnames(pathway.data)
 pathway.data <- rbind(pathway.data, tmp)
write.table(pathway.data, 'sig_gene_pathway.csv', sep=',', row.names=FALSE)