Monthly Archives: February 2013

Download remote files using wget

The Unix/Linux command line tool wget is a very handy tool for downloading files. Just like many other powerful command line tools, it is much faster and easier to get the job down using wget than its GUI alternatives (e.g. FileZilla). For example the following command can download/mirror a remote ftp directory for a user.

wget -m

Find organism code used in KEGG reference pathways

In my recent blog on accessing KEGG database from R/bioconductor, I used examples from human data, which is identified as “hsa” in KEGG. One has to know the organism code to retrieve information from KEGG for that specified organism. The table below lists the first three codes used in KEGG.

eco Escherichia coli K-12 MG1655
ecj Escherichia coli K-12 W3110
ecd Escherichia coli K-12 DH10B

The first text string of 3 or 4 characters in the table are the code, followed by the organism name.

This table generated from the source code of a pathway page in KEGG, which lists organism code for the organism name table.

<form method="get" name="selmenu"><select name="org_name"><option value="map">Reference pathway</option><option value="ko">Reference pathway (KO)</option><option value="set_cookie">-----< Set personalized menu >-----</option><option value="has.sort_alp">-----< Sort below by alphabet >-----</option><option value="eco">Escherichia coli K-12 MG1655</option><option value="ecj">Escherichia coli K-12 W3110</option><option value="ecd">Escherichia coli K-12 DH10B</option></select>
<select name="org_name">...</select>
<select name="org_name">

Accessing KEGG database from R/Bioconductor

KEGG database is a great resource for biological pathway information, which is an essential part of genome/transcriptome analysis where biological interpretation are formed.

For high throughput studies, it is preferred to access KEGG database programmatically. KEGG recently released the REST service to accomodate such needs. More information about REST service in KEGG can be found from here.

There is also a KEGGREST Bioconductor package available. Since KEGGREST is not working well under my computing environment, I wrote my own code. Below are some examples.

First, a simple example of getting an organism’s basic information.

getOrganismInfo <- function(organism){
  info_REST_url <- paste(KEGG_INFO_BASE, organism, sep="")
  info <- readLines(info_REST_url)

Calling the above function for human (“hsa” as the organism code), you will get the following:

> getOrganismInfo(“hsa”)
[1] “T01001 Homo sapiens (human) KEGG Genes Database”
[2] “hsa Release 65.0+/02-21, Feb 13″
[3] ” Kanehisa Laboratories”
[4] ” 26,241 entries”

The second example demonstrate how to list KEGG pathway ids and names for a specified organism:

mapPathwayToName <- function(organism) {
  pathway_list_REST_url <- paste(KEGG_PATHWAY_LIST_BASE, organism, sep="")

  pathway_id_name <- data.frame()

  for (line in readLines(pathway_list_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    pathway_id <- strsplit(tmp[1], organism)[[1]][2]
    pathway_name <- tmp[2]
    pathway_name <- strsplit(pathway_name, "\\s+-\\s+")[[1]][1]
    pathway_id_name[pathway_id, 1] = pathway_name


  names(pathway_id_name) <- "pathway_name"

Running this code with “hsa” lists all the human pathways stored in KEGG.

> human_pathways head(human_pathways)
00010 Glycolysis / Gluconeogenesis
00020 Citrate cycle (TCA cycle)
00030 Pentose phosphate pathway
00040 Pentose and glucuronate interconversions
00051 Fructose and mannose metabolism
00052 Galactose metabolism

The third example shows how to map a list of genes to pathways. Note since not all the genes have functions assigned, we’ll only get pathways for some of the genes.

mapGeneToPathway <- function(organism) {
  pathway_link_REST_url <- paste(KEGG_PATHWAY_LINK_BASE, organism, sep="")
  gene_pathway <- data.frame()

  for (line in readLines(pathway_link_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    gene <- tmp[1]
    gene <- strsplit(gene, ":")[[1]][2]  
    pathway_id<- strsplit(tmp[2], organism)[[1]][2]

    if (is.null(gene_pathway[gene, 1])) {
      gene_pathway[gene,1] = pathway_id
    } else {
      if ([gene,1])) {
        gene_pathway[gene,1] = pathway_id
      } else {
        gene_pathway[gene,1] = paste(gene_pathway[gene, 1], pathway_id, sep=";")
  names(gene_pathway) <- "pathway_id"

Try it with “hsa”:

> gene_to_pathway <- mapGeneToPathway("hsa")
> head(gene_to_pathway, n=3)
10 00232;00983;01100;05204
100 00230;01100;05340
1000 04514;05412

Note that some gene (10) are mapped to 4 pathway id. Combine this function with the second example, you will be able to add pathway names to the table.

Batch trimming Poly-A tails

To assemble transcripts from RNA-seq short reads, one needs to trim the poly-A tails first. Several packages exits for that needs, including trimpoly in the seqclean package, and trimest from the EMBOSS package. I prefer trimest.

Like described in my previous blogs, I like to store a list of fastq files in a text file so that the entire project can be easily wrapped into some workflow. The following bash code does the poly-A trimming for every fastq file in the list.

while read line; do
   name=`basename $line`;
   echo Trimming $name;
   trimest $line $output;
done &lt; trimmed_fastq_list.txt

Emacs color theme for eye ease

Emacs is my favorite editor, however the default color theme comes from emacs (Aquamacs) is not eye friendly, if you have to stare at it for a long time. A gallery of emacs color themes can be found from here, where I found Alato Light theme by Jari Aalto is nice theme for my eyes. Adding the following code to the .emacs file makes it the default color theme when emacs starts.

;; Set my favorite color theme
(require ‘color-theme)

To try it yourself before setting it as the default theme, you can do the following from emacs:

M-x color-theme-select

, then select Aalto Light.