Category Archives: Tips

We just moved to a new site.

Hello All,

After being silent in posting for a long time, I have finally decided to resume my knowledge sharing practice. If you are interested in seeing the new posts, please go to the new site:

All the old posts have been ported to the new site. You are welcome to leave comments there and I will reply them. I am not going to reply to comments in this old site. Thanks.



New Illumina platforms and software support

It has been a while since I blogged last time. Life has been really busy after my relocation and still trying to adjust to Atlanta. But here is my latest post: I just attended an Illumina platform demonstration seminar and found a few interesting things.

1) 2X150 bp PE reads from NextSeq 500 has roughly 75% or more reads above Q30, which is similar to the MiSeq and HiSeq platforms introduced years back.

2) NextSeq 500 uses fast 2-channel SBS technology for base detection. So instead of using 4 colors (as in MiSeq and HiSeq), it uses only red or green for base color detection. Unlike MiSeq there is no analysis software installed on the machine. Analysis software is external (and supported by BaseCloud platform, maybe others too).

3) NextSeq uses a base-calling software similar to current MiSeq software.

4) The MiSeq software was written from scratch using C# and will replace the historical CASAVA software written mainly in C.

It is interesting that the C#-based code will replace CASAVA for future Illumina platforms and HiSeq does not use the fast 2-channel SBS technology.

Other interesting things:

5) Illumina recommends to use BWA and GACT for variant calling, although this decision was based on the prefernce of the external user community and Illumina is aware that GACT was designed for diploid chromosomes.

6) Illumina has no answer about mixed read sequence interpretation.

One-line script for checking BAM alignment ratio

It is a common practice to check the percentage of reads aligned/mapped to a reference genome. For indexed BAM files, the following one-line script solves the problem.


[biobeat mapping]$ for i in `ls *.bam`; do samtools idxstats $i|awk -v file="$i" '{mapped+=$3; unmapped+=$4} END{ print file, 100*mapped/(mapped+unmapped)}'; done
Study_10_L005.bam 60.8237
Study_11_L001.bam 10.2694
Study_11_L005.bam 10.311
Study_13_L006.bam 97.686
Study_14_L006.bam 23.2649
Study_15_L004.bam 0.86006
Study_16_L004.bam 4.99298
Study_1_L005.bam 0.144142
Study_20_L003.bam 99.6083
Study_21_L003.bam 98.424
Study_22_L003.bam 21.2799
Study_23_L002.bam 2.48101
Study_24_L002.bam 0.0676792
Study_2_L005.bam 5.7271
Study_3_L005.bam 0.360636
Study_9_L001.bam 99.422
Study_9_L005.bam 99.443

Search KEGG Pathway Maps by Keyword in R

I was recently asked about how to list KEGG pathways not specific to any organism, which was not addressed in my previous post on accessing KEGG pathway from R.  Here, I post one example that shows you how to search KEGG pathway by keyword from R.

First, let’s define the function getKeggPathwayTable:

getKeggPathwayTable <- function(){
  pathway_link_REST_url <- ""
  kegg_pathways<- data.frame()

  current_row = 1
  for (line in readLines(pathway_link_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    map <- tmp[1]
    map <- strsplit(map, ":")[[1]][2]  
    pathway_name<- tmp[2]
    kegg_pathways[current_row, 1]=map
    kegg_pathways[current_row, 2]=pathway_name
    current_row = current_row + 1
  names(kegg_pathways) <- c("map_id","name")

Running the above function will give a data frame that lists all the current KEGG pathway IDs and names.

Next, lets define a function to search the data frame returned by getKeggPathwayTable by keyword.

searchKeggPathwayByKeyword <- function(keyword) {
  kegg_pathways <- getKeggPathwayTable()
  hits <- grep(keyword, kegg_pathways$name,

Note that I use “” to make sure that the search is not case sensitive. Now let’s try it with the keyword “vitamin” and “sucrose.”

> searchKeggPathwayByKeyword("vitamin")
      map_id                             name
110 map00750            Vitamin B6 metabolism
303 map04977 Vitamin digestion and absorption

> searchKeggPathwayByKeyword("sucrose")
     map_id                          name
60 map00500 Starch and sucrose metabolism

How to draw Venn diagrams from differential gene expression data

Venn diagrams is commonly used to visualize the overlapping among data sets, including differential gene expression data under various condition. Here I shaw you one way to draw venn diagram (see below) in R using VennDiagram package using data generated from DESeq.


A.cds.sig <- subset(A.cds.result, pval<THRESHOLD) ## A.cds.result was generated from DESeq 
B.cds.sig <- subset(B.cds.result, pval<THRESHOLD) ## B.cds.result was generated from DESeq 

venn.plot <- venn.diagram(list(A.cds.sig$id, B.cds.sig$id), NULL, fill=c("red", "green"), alpha=c(0.5,0.5), cex = 2, cat.fontface=4, category.names=c("Condition A", "Condition B"))

Here I used grid::grid.draw function to generate the PDF, because the venn.diagram complains generating PDF. I have manually changed the position and font size of the category legends in the figure above. However, there are some parameters you can adjust in the venn.diagram function to set the legends, layout, transparency (alpha-channel), etc.

bowtie2, htseq-count, and reference seqeunces downloaded from NCBI

In a typical alignment-based RNA-seq differential gene expression analysis, a bioinformatician downloads reference data (.fna, .gbk, and .gff files) from NCBI, indexes the .fna files for aligners (e.g. bowtie2, bwa, etc.), uses htseq-count to count reads mapped to a specific gene/exon, do statistical analysis to extract significantly regulated genes, and add annotation from the gbk file to those genes. There are several traps in this process. I list a few here.

1) reference names .fna and .gff files

Below are lines extracted from .fna and .gff files downloaded from NCBI:

$ head -3 NC_003143.fna
>gi|16120353|ref|NC_003143.1| Yersinia pestis CO92 chromosome, complete genome

$ head -8 NC_003143.gff
##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_003143.1 1 4653728
NC_003143.1	RefSeq	region	1	4653728	.	+	.	ID=id0;Dbxref=taxon:214092;Is_circular=true;biovar=Orientalis;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=CO92
NC_003143.1	RefSeq	gene	271	711	.	-	.	ID=gene0;Name=YPO0001;Dbxref=GeneID:1172846;gbkey=Gene;locus_tag=YPO0001
NC_003143.1	RefSeq	CDS	271	711	.	-	0	ID=cds0;Name=YP_002345097.1;Parent=gene0;Note=An electron-transfer protein%3B flavodoxin binds one FMN molecule%2C which serves as a redox-active prosthetic group;Dbxref=InterPro:IPR001094,InterPro:IPR008254,UniProtKB%2FTrEMBL:Q7CLF0,Genbank:YP_002345097.1,GeneID:1172846;gbkey=CDS;product=flavodoxin;protein_id=YP_002345097.1;transl_table=11

Notice that in gff file, the reference sequence name is “NC_003143.1” and in .fna file the name is “gi|16120353|ref|NC_003143.1|”. The latter sequence name is also used by bowtie2 when you index that fna file. The discrepancy in sequence names cause error when you run htseq-count, because htseq-count will get the reference sequence names from the SAM file (, which falls back to the bowtie2 index sequence name), and will try to find that same reference names in the gff file, which do not exist. I usually use some code like below to “fix” the gff file.

$sed -i s/^NC_003143.1/gi\|16120353\|ref\|NC_003143.1\|/g YP_combined.gff 

With the above fix, htseq-count may work. But to get the right results, you still need to feed the right argument to this program.

2) arguments in htseq-count

By default, htseq-count uses “exon” as the feature type for counting reads. While this may work perfectly with Ensembl GTF files, it may not always be the right one for your study. For example, in Y. pestis only 97 exons entries are available in the gff file. But there are over 4305 genes (see below).[/p]

$ cut -f3 ~/scratch/db/src/Yersinia_pestis_CO92_uid57621/YP_combined.gff |sort|uniq -c
      4 ###
   4067 CDS
     97 exon
      2 five_prime_UTR
   4305 gene
      4 #!gff-spec-version 1.20
      4 ##gff-version 3
      6 ncRNA
      4 #!processor NCBI annotwriter
      1 promoter
   9583 region
    413 repeat_region
    203 ribosome_entry_site
     19 rRNA
      1 ##sequence-region NC_003131.1 1 70305
      1 ##sequence-region NC_003132.1 1 9612
      1 ##sequence-region NC_003134.1 1 96210
      1 ##sequence-region NC_003143.1 1 4653728
      3 sequence_variant
      4 ##species
      1 stem_loop
      1 tmRNA
      1 transcript
     70 tRNA

Thus, it may be safer to use “gene” as the feature name, although the best approach is always check how many “exon” entries are there first.

The above code also shows 19 rRNAs in the reference. In case you are interested in evaluating how successful your rRNA removal step worked, you can use “rRNA” as the feature name to let htseq-count count the rRNA reads.

3) Strand direction in htseq-count

One nice feature of htseq-count is that it takes into account the directions of the read and the reference so that a read from the wrong direction, even if it is mapped to the right place of an exon/gene, it will not be counted as a read from that exon/gene. The option “-s” tells htseq-count that you are doing directional RNA-seq and the direction each mate shall be aligned in a pair. The default value is suitable for Illumina platform.

4) ID in htseq-count

I found use “locus_tag” as the ID string for htseq-count makes annotation easier than “gene_id”, at least for bacteria .gff files downloaded from NCBI. This can be set by the “-i locus_tag” option.

In practice I use the following job script for counting expressions, after making sure that the bowtie2 and gff are using the same sequence names. This code count all the gene hits sequentially. You can also parallel the counting for each bam file to save time.

#$ -N htseq-count
#$ -pe smp 4 -l dev -l h_vmem=8G
#$ -o  htseq_count.out
#$ -e  htseq_count.err
#$ -S /bin/bash

for i in `ls *.bam`; 
    echo -e "Counting "$bam
    samtools view $bam|htseq-count -t gene -i locus_tag -s yes -q - genome_plasmids_combined.gff > $count