Monthly Archives: May 2013

Search KEGG Pathway Maps by Keyword in R

I was recently asked about how to list KEGG pathways not specific to any organism, which was not addressed in my previous post on accessing KEGG pathway from R.  Here, I post one example that shows you how to search KEGG pathway by keyword from R.

First, let’s define the function getKeggPathwayTable:

getKeggPathwayTable <- function(){
  pathway_link_REST_url <- "http://rest.kegg.jp/list/pathway/"
  kegg_pathways<- data.frame()

  current_row = 1
  for (line in readLines(pathway_link_REST_url)) {
    tmp <- strsplit(line, "\t")[[1]]
    map <- tmp[1]
    map <- strsplit(map, ":")[[1]][2]  
    pathway_name<- tmp[2]
    kegg_pathways[current_row, 1]=map
    kegg_pathways[current_row, 2]=pathway_name
    current_row = current_row + 1
  }
  names(kegg_pathways) <- c("map_id","name")
  kegg_pathways
}

Running the above function will give a data frame that lists all the current KEGG pathway IDs and names.

Next, lets define a function to search the data frame returned by getKeggPathwayTable by keyword.

searchKeggPathwayByKeyword <- function(keyword) {
  kegg_pathways <- getKeggPathwayTable()
  hits <- grep(keyword, kegg_pathways$name, ignore.case=TRUE)
  kegg_pathways[hits,]
}

Note that I use “ignore.case=TRUE” to make sure that the search is not case sensitive. Now let’s try it with the keyword “vitamin” and “sucrose.”

> searchKeggPathwayByKeyword("vitamin")
      map_id                             name
110 map00750            Vitamin B6 metabolism
303 map04977 Vitamin digestion and absorption

> searchKeggPathwayByKeyword("sucrose")
     map_id                          name
60 map00500 Starch and sucrose metabolism

How to draw Venn diagrams from differential gene expression data

Venn diagrams is commonly used to visualize the overlapping among data sets, including differential gene expression data under various condition. Here I shaw you one way to draw venn diagram (see below) in R using VennDiagram package using data generated from DESeq.

Image

THRESHOLD <- 0.05
A.cds.sig <- subset(A.cds.result, pval<THRESHOLD) ## A.cds.result was generated from DESeq 
B.cds.sig <- subset(B.cds.result, pval<THRESHOLD) ## B.cds.result was generated from DESeq 

library(VennDiagram)
pdf("venn_diagram.pdf")
venn.plot <- venn.diagram(list(A.cds.sig$id, B.cds.sig$id), NULL, fill=c("red", "green"), alpha=c(0.5,0.5), cex = 2, cat.fontface=4, category.names=c("Condition A", "Condition B"))
grid.draw(venn.plot)
dev.off()

Here I used grid::grid.draw function to generate the PDF, because the venn.diagram complains generating PDF. I have manually changed the position and font size of the category legends in the figure above. However, there are some parameters you can adjust in the venn.diagram function to set the legends, layout, transparency (alpha-channel), etc.

Benefits of Paired-end Directional RNA-seq

The most frequently asked questions for my poster presentation at ASM general meeting was the benefits of PE directional RNA-seq over single-end and/or non-directional sequencing. 

PE sequencing has the following advantages over SE:

1) Better mapping results.

With PE reads, you know roughly the range of the distance of two reads. This may help you when you map the reads back to a reference genome. For example, with splicing junction events, two exons maybe joined together. If mate 1 and mate 2 are mapped to a genome with 1kb between them, while you expect the distance between them in the transcript is only about 500bp at maximum, then one intron around 500 bp may be moved. You don’t get such information from SE. Similarly, PE may also solve some non-uniq (secondary) mapping problem, if one of the mates come from a repetitive region and another comes from a non-repetitive region.

2) Improved directional sequencing accuracy

Because PE reads point to opposite directions, we can extract all the read pairs of opposite directions to get better confidence in directional RNA-seq. If, for example, you have some read pairs that point to the same direction, then you can ignore those reads for directional information extraction. With SE, you jut cannot tell when there is such a sequencing error.

bowtie2, htseq-count, and reference seqeunces downloaded from NCBI

In a typical alignment-based RNA-seq differential gene expression analysis, a bioinformatician downloads reference data (.fna, .gbk, and .gff files) from NCBI, indexes the .fna files for aligners (e.g. bowtie2, bwa, etc.), uses htseq-count to count reads mapped to a specific gene/exon, do statistical analysis to extract significantly regulated genes, and add annotation from the gbk file to those genes. There are several traps in this process. I list a few here.

1) reference names .fna and .gff files

Below are lines extracted from .fna and .gff files downloaded from NCBI:

$ head -3 NC_003143.fna
>gi|16120353|ref|NC_003143.1| Yersinia pestis CO92 chromosome, complete genome
GATCTTTTTATTTAAACGATCTCTTTATTAGATCTCTTATTAGGATCATGATCCTCTGTGGATAAGTGAT
TATTCACATGGCAGATCATATAATTAAGGAGGATCGTTTGTTGTGAGTGACCGGTGATCGTATTGCGTAT

$ head -8 NC_003143.gff
##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_003143.1 1 4653728
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=214092
NC_003143.1	RefSeq	region	1	4653728	.	+	.	ID=id0;Dbxref=taxon:214092;Is_circular=true;biovar=Orientalis;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=CO92
NC_003143.1	RefSeq	gene	271	711	.	-	.	ID=gene0;Name=YPO0001;Dbxref=GeneID:1172846;gbkey=Gene;locus_tag=YPO0001
NC_003143.1	RefSeq	CDS	271	711	.	-	0	ID=cds0;Name=YP_002345097.1;Parent=gene0;Note=An electron-transfer protein%3B flavodoxin binds one FMN molecule%2C which serves as a redox-active prosthetic group;Dbxref=InterPro:IPR001094,InterPro:IPR008254,UniProtKB%2FTrEMBL:Q7CLF0,Genbank:YP_002345097.1,GeneID:1172846;gbkey=CDS;product=flavodoxin;protein_id=YP_002345097.1;transl_table=11

Notice that in gff file, the reference sequence name is “NC_003143.1” and in .fna file the name is “gi|16120353|ref|NC_003143.1|”. The latter sequence name is also used by bowtie2 when you index that fna file. The discrepancy in sequence names cause error when you run htseq-count, because htseq-count will get the reference sequence names from the SAM file (, which falls back to the bowtie2 index sequence name), and will try to find that same reference names in the gff file, which do not exist. I usually use some code like below to “fix” the gff file.

$sed -i s/^NC_003143.1/gi\|16120353\|ref\|NC_003143.1\|/g YP_combined.gff 

With the above fix, htseq-count may work. But to get the right results, you still need to feed the right argument to this program.

2) arguments in htseq-count

By default, htseq-count uses “exon” as the feature type for counting reads. While this may work perfectly with Ensembl GTF files, it may not always be the right one for your study. For example, in Y. pestis only 97 exons entries are available in the gff file. But there are over 4305 genes (see below).[/p]

$ cut -f3 ~/scratch/db/src/Yersinia_pestis_CO92_uid57621/YP_combined.gff |sort|uniq -c
      4 ###
   4067 CDS
     97 exon
      2 five_prime_UTR
   4305 gene
      4 #!gff-spec-version 1.20
      4 ##gff-version 3
      6 ncRNA
      4 #!processor NCBI annotwriter
      1 promoter
   9583 region
    413 repeat_region
    203 ribosome_entry_site
     19 rRNA
      1 ##sequence-region NC_003131.1 1 70305
      1 ##sequence-region NC_003132.1 1 9612
      1 ##sequence-region NC_003134.1 1 96210
      1 ##sequence-region NC_003143.1 1 4653728
      3 sequence_variant
      4 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=214092
      1 stem_loop
      1 tmRNA
      1 transcript
     70 tRNA

Thus, it may be safer to use “gene” as the feature name, although the best approach is always check how many “exon” entries are there first.

The above code also shows 19 rRNAs in the reference. In case you are interested in evaluating how successful your rRNA removal step worked, you can use “rRNA” as the feature name to let htseq-count count the rRNA reads.

3) Strand direction in htseq-count

One nice feature of htseq-count is that it takes into account the directions of the read and the reference so that a read from the wrong direction, even if it is mapped to the right place of an exon/gene, it will not be counted as a read from that exon/gene. The option “-s” tells htseq-count that you are doing directional RNA-seq and the direction each mate shall be aligned in a pair. The default value is suitable for Illumina platform.

4) ID in htseq-count

I found use “locus_tag” as the ID string for htseq-count makes annotation easier than “gene_id”, at least for bacteria .gff files downloaded from NCBI. This can be set by the “-i locus_tag” option.

In practice I use the following job script for counting expressions, after making sure that the bowtie2 and gff are using the same sequence names. This code count all the gene hits sequentially. You can also parallel the counting for each bam file to save time.

#!/bin/bash
#$ -N htseq-count
#$ -pe smp 4 -l dev -l h_vmem=8G
#$ -o  htseq_count.out
#$ -e  htseq_count.err
#$ -S /bin/bash

date
for i in `ls *.bam`; 
do
    bam=$PWD/$i;
    count=${bam/.bam/_htseq.count}
    echo -e "Counting "$bam
    samtools view $bam|htseq-count -t gene -i locus_tag -s yes -q - genome_plasmids_combined.gff > $count
done
date

Directional RNA-seq— Part 2: Explore SAM flags using samtools

In Directional RNA-seq— Part 1: SAM file flags I showed you what flags in SAM file are important for directional RNA-seq analysis. In this blog, I will show you how to identify flags from SAM file using samtools. 

We will start with how to view a BAM file. BAM file is binary equivalent of SAM file. Compared to the text-based SAM file, BAM file is smaller, can be sorted, and indexed for fast access.

The following command uses the view tool in samtools to show two sam records:

$ samtools view my_sorted.bam |head -2

HISEQ:166:D1M8RACXX:5:1102:6585:48925	99	myref	1	1101M	=	5	105	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@?@DDEFA;DHFH+AFEEEHIDE>F<*?1?FF@GH=DAFGGIGGEGGEHGAA@7D6:?3;C@@C@@CD@:>A:AC5;AA((8<+++(:A>:@C>CCC####	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:0	YT:Z:CP
HISEQ:166:D1M8RACXX:5:2207:3731:104462	99	myref	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@CCDDFADFDHHD<CFGGHHCBHHDHGE>GHIHIIGAHGGGIIIIG=FHDE@DHEDHFCBCCDCDCCCCCCCCCCCDC:>C?8<+:>CC>:>4:ACC>AC?	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:-4	YT:Z:CP

In each sam record there are a few columns separated by tabs. The first column is the read id, which tells you from which location in the flow cell a read is generated. The second column is the SAM flags that we are focusing on. Here, both reads show 99, which indicates these two reads were mate 1s and the template transcript has an orientation that goes from 5′->3′.

To get all the mapped reads come from the 5′->3′ template transcript, we can use the following command:

$ samtools view my_sorted.bam |awk '{if ($2==99||$2=147) print $1, $2}'

We can also use the the filtering option in samtools:

$ samtools view -f 99  EHC_9_D1M8RACXX_GCCAAT_L005__yp_sorted.bam

In the above code, argument -f specifies that we only want to view reads with flag 99. More arguments are available from $ samtools view -?. However, the current filter function does not support boolean operators, so you cannot filter all the 5′->3′ transcripts in one run. You will need to use -f 99 then -f 147. The AWK way seems more convenient.

To split a BAM file to 2 BAM files based on the transcript orientation, we just need to pipe the reads to samtools view and ask it to output bam file. And if your original BAM files have been sorted already, there is no need to sort the new bam files again, since the order have been retained. However, you will need to index them again.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if($2==99||$2==147) print $0}'|head
@HD	VN:1.0	SO:unsorted
@SQ	SN:my_ref_1	LN:72305
@SQ	SN:my_ref_2	LN:9012
@SQ	SN:my_ref_3	LN:5623228
@SQ	SN:my_ref_4	LN:86410
@PG	ID:bowtie2	PN:bowtie2	VN:2.0.0-beta7
HISEQ:166:D1M8RACXX:1:1308:16355:134016	99	my_ref_1	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGN	8??DBD:?1?D@<EE6@EAEE?+?4?*99D=DD?B?BDD>?9=B9)8)7@A@C;)'''39>AAA>:>5>A>A>A:AA>A>A?10+:3>AA###########	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100A0	YS:i:-11	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1106:12252:127620	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	CC@FFFFFHHHHHIJJJJIIHEHIJJIIIGIGIIJJJJJIJIJJIJJIIJIIGJIHF3;CDEDEDDDDDDDDDDDDDDDDDAB@C>ACDCACDDDDDDC??	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-19	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:9208:148034	99	my_ref_1	2	1101M	=	236	335	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	@CBFDDFFHHHHCFGHIJFIFHHGIIEIGEHIIIJJIHEHIGIIJJEIJJIGEGIHF9>ADECEDDDDCCDCDDD@CCC><8<:C>:>A@:>ACCCCCC@@	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-20	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:19344:148604	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	?@8?BDD7:CFFD;)AEE<E<AA?ED<FEFFEEFBGFI<FFFEFE==@;@FFEFIFD',;;>@BB>ABBA::5(:>:@B######################	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-11	YT:Z:CP

The example above shows the code that we can use to pipe to samtools view to get a BAM file. A few things need to be noted here: 1) The -h arugment in samtools view ask it to print out header, which is needed for samtools view to encode a SAM file to BAM file; 2) In the AWK code “{if (NR<7) print $0}" allows the header lines (6 lines in the example) be printed out instead of being filtered out by the following awk line; 3) In the header, there are 4 lines start with "SN:". Each of these lines specifies a reference that was used in the alignment. Here, I mapped all the reads to one genome and three plasmids. In your case, it may be be different, thus you need to change the NR number accordingly.

With all these explained, it is easy to code the output to a BAM file—a BAM file that only has the reads generated from template transcripts from 5′->3′.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if ($2==99||$2==147) print $0}'|samtools view -bS - > my_sorted_plus_strand.bam

Here arugments -bS tells samtools view to read in SAM format, and output BAM. Option “-” tells it the input is from stdin—the pipe. Finally “>” tells the program to direct its output, a BAM file in this case, to a file.

In my next post, I will show you how to generate coverage plots with strand information.