We just moved to a new site.

Hello All,

After being silent in posting for a long time, I have finally decided to resume my knowledge sharing practice. If you are interested in seeing the new posts, please go to the new site: http://C2Bio.com/blog

All the old posts have been ported to the new site. You are welcome to leave comments there and I will reply them. I am not going to reply to comments in this old site. Thanks.


New Illumina platforms and software support

It has been a while since I blogged last time. Life has been really busy after my relocation and still trying to adjust to Atlanta. But here is my latest post: I just attended an Illumina platform demonstration seminar and found a few interesting things.

1) 2X150 bp PE reads from NextSeq 500 has roughly 75% or more reads above Q30, which is similar to the MiSeq and HiSeq platforms introduced years back.

2) NextSeq 500 uses fast 2-channel SBS technology for base detection. So instead of using 4 colors (as in MiSeq and HiSeq), it uses only red or green for base color detection. Unlike MiSeq there is no analysis software installed on the machine. Analysis software is external (and supported by BaseCloud platform, maybe others too).

3) NextSeq uses a base-calling software similar to current MiSeq software.

4) The MiSeq software was written from scratch using C# and will replace the historical CASAVA software written mainly in C.

It is interesting that the C#-based code will replace CASAVA for future Illumina platforms and HiSeq does not use the fast 2-channel SBS technology.

Other interesting things:

5) Illumina recommends to use BWA and GACT for variant calling, although this decision was based on the prefernce of the external user community and Illumina is aware that GACT was designed for diploid chromosomes.

6) Illumina has no answer about mixed read sequence interpretation.

Adding LaTex styles to user directory

I just learned the following tricks about how to add LaTex styles to a user directory (with no admin privileges).

1) Use the following command in shell to check out TeX variable $TEXINPTUS:

$ kpsewhich -expand-var ‘$TEXINPUTS’
, which generates output likes the following:


Please notice that that “/users/biobetat/texmf” is used.

2) Now we can add all the style file collections there, or a subdirectory of that path. Below is where I put my files.

$ ls ~/texmf/tex/latex/mycls/
spbasic.bst spmpsci.bst spphys.bst svind.ist svmono.cls

One-line script for checking BAM alignment ratio

It is a common practice to check the percentage of reads aligned/mapped to a reference genome. For indexed BAM files, the following one-line script solves the problem.


[biobeat mapping]$ for i in `ls *.bam`; do samtools idxstats $i|awk -v file="$i" '{mapped+=$3; unmapped+=$4} END{ print file, 100*mapped/(mapped+unmapped)}'; done
Study_10_L005.bam 60.8237
Study_11_L001.bam 10.2694
Study_11_L005.bam 10.311
Study_13_L006.bam 97.686
Study_14_L006.bam 23.2649
Study_15_L004.bam 0.86006
Study_16_L004.bam 4.99298
Study_1_L005.bam 0.144142
Study_20_L003.bam 99.6083
Study_21_L003.bam 98.424
Study_22_L003.bam 21.2799
Study_23_L002.bam 2.48101
Study_24_L002.bam 0.0676792
Study_2_L005.bam 5.7271
Study_3_L005.bam 0.360636
Study_9_L001.bam 99.422
Study_9_L005.bam 99.443

Search KEGG Pathway Maps by Keyword in R

I was recently asked about how to list KEGG pathways not specific to any organism, which was not addressed in my previous post on accessing KEGG pathway from R.  Here, I post one example that shows you how to search KEGG pathway by keyword from R.

First, let’s define the function getKeggPathwayTable:

getKeggPathwayTable <- function(){
  pathway_link_REST_url <- "http://rest.kegg.jp/list/pathway/"
  kegg_pathways<- data.frame()

  current_row = 1
  for (line in readLines(pathway_link_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    map <- tmp[1]
    map <- strsplit(map, ":")[[1]][2]  
    pathway_name<- tmp[2]
    kegg_pathways[current_row, 1]=map
    kegg_pathways[current_row, 2]=pathway_name
    current_row = current_row + 1
  names(kegg_pathways) <- c("map_id","name")

Running the above function will give a data frame that lists all the current KEGG pathway IDs and names.

Next, lets define a function to search the data frame returned by getKeggPathwayTable by keyword.

searchKeggPathwayByKeyword <- function(keyword) {
  kegg_pathways <- getKeggPathwayTable()
  hits <- grep(keyword, kegg_pathways$name, ignore.case=TRUE)

Note that I use “ignore.case=TRUE” to make sure that the search is not case sensitive. Now let’s try it with the keyword “vitamin” and “sucrose.”

> searchKeggPathwayByKeyword("vitamin")
      map_id                             name
110 map00750            Vitamin B6 metabolism
303 map04977 Vitamin digestion and absorption

> searchKeggPathwayByKeyword("sucrose")
     map_id                          name
60 map00500 Starch and sucrose metabolism

How to draw Venn diagrams from differential gene expression data

Venn diagrams is commonly used to visualize the overlapping among data sets, including differential gene expression data under various condition. Here I shaw you one way to draw venn diagram (see below) in R using VennDiagram package using data generated from DESeq.


A.cds.sig <- subset(A.cds.result, pval<THRESHOLD) ## A.cds.result was generated from DESeq 
B.cds.sig <- subset(B.cds.result, pval<THRESHOLD) ## B.cds.result was generated from DESeq 

venn.plot <- venn.diagram(list(A.cds.sig$id, B.cds.sig$id), NULL, fill=c("red", "green"), alpha=c(0.5,0.5), cex = 2, cat.fontface=4, category.names=c("Condition A", "Condition B"))

Here I used grid::grid.draw function to generate the PDF, because the venn.diagram complains generating PDF. I have manually changed the position and font size of the category legends in the figure above. However, there are some parameters you can adjust in the venn.diagram function to set the legends, layout, transparency (alpha-channel), etc.

Benefits of Paired-end Directional RNA-seq

The most frequently asked questions for my poster presentation at ASM general meeting was the benefits of PE directional RNA-seq over single-end and/or non-directional sequencing. 

PE sequencing has the following advantages over SE:

1) Better mapping results.

With PE reads, you know roughly the range of the distance of two reads. This may help you when you map the reads back to a reference genome. For example, with splicing junction events, two exons maybe joined together. If mate 1 and mate 2 are mapped to a genome with 1kb between them, while you expect the distance between them in the transcript is only about 500bp at maximum, then one intron around 500 bp may be moved. You don’t get such information from SE. Similarly, PE may also solve some non-uniq (secondary) mapping problem, if one of the mates come from a repetitive region and another comes from a non-repetitive region.

2) Improved directional sequencing accuracy

Because PE reads point to opposite directions, we can extract all the read pairs of opposite directions to get better confidence in directional RNA-seq. If, for example, you have some read pairs that point to the same direction, then you can ignore those reads for directional information extraction. With SE, you jut cannot tell when there is such a sequencing error.