bowtie2, htseq-count, and reference seqeunces downloaded from NCBI

In a typical alignment-based RNA-seq differential gene expression analysis, a bioinformatician downloads reference data (.fna, .gbk, and .gff files) from NCBI, indexes the .fna files for aligners (e.g. bowtie2, bwa, etc.), uses htseq-count to count reads mapped to a specific gene/exon, do statistical analysis to extract significantly regulated genes, and add annotation from the gbk file to those genes. There are several traps in this process. I list a few here.

1) reference names .fna and .gff files

Below are lines extracted from .fna and .gff files downloaded from NCBI:

$ head -3 NC_003143.fna
>gi|16120353|ref|NC_003143.1| Yersinia pestis CO92 chromosome, complete genome

$ head -8 NC_003143.gff
##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_003143.1 1 4653728
NC_003143.1	RefSeq	region	1	4653728	.	+	.	ID=id0;Dbxref=taxon:214092;Is_circular=true;biovar=Orientalis;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=CO92
NC_003143.1	RefSeq	gene	271	711	.	-	.	ID=gene0;Name=YPO0001;Dbxref=GeneID:1172846;gbkey=Gene;locus_tag=YPO0001
NC_003143.1	RefSeq	CDS	271	711	.	-	0	ID=cds0;Name=YP_002345097.1;Parent=gene0;Note=An electron-transfer protein%3B flavodoxin binds one FMN molecule%2C which serves as a redox-active prosthetic group;Dbxref=InterPro:IPR001094,InterPro:IPR008254,UniProtKB%2FTrEMBL:Q7CLF0,Genbank:YP_002345097.1,GeneID:1172846;gbkey=CDS;product=flavodoxin;protein_id=YP_002345097.1;transl_table=11

Notice that in gff file, the reference sequence name is “NC_003143.1” and in .fna file the name is “gi|16120353|ref|NC_003143.1|”. The latter sequence name is also used by bowtie2 when you index that fna file. The discrepancy in sequence names cause error when you run htseq-count, because htseq-count will get the reference sequence names from the SAM file (, which falls back to the bowtie2 index sequence name), and will try to find that same reference names in the gff file, which do not exist. I usually use some code like below to “fix” the gff file.

$sed -i s/^NC_003143.1/gi\|16120353\|ref\|NC_003143.1\|/g YP_combined.gff 

With the above fix, htseq-count may work. But to get the right results, you still need to feed the right argument to this program.

2) arguments in htseq-count

By default, htseq-count uses “exon” as the feature type for counting reads. While this may work perfectly with Ensembl GTF files, it may not always be the right one for your study. For example, in Y. pestis only 97 exons entries are available in the gff file. But there are over 4305 genes (see below).[/p]

$ cut -f3 ~/scratch/db/src/Yersinia_pestis_CO92_uid57621/YP_combined.gff |sort|uniq -c
      4 ###
   4067 CDS
     97 exon
      2 five_prime_UTR
   4305 gene
      4 #!gff-spec-version 1.20
      4 ##gff-version 3
      6 ncRNA
      4 #!processor NCBI annotwriter
      1 promoter
   9583 region
    413 repeat_region
    203 ribosome_entry_site
     19 rRNA
      1 ##sequence-region NC_003131.1 1 70305
      1 ##sequence-region NC_003132.1 1 9612
      1 ##sequence-region NC_003134.1 1 96210
      1 ##sequence-region NC_003143.1 1 4653728
      3 sequence_variant
      4 ##species
      1 stem_loop
      1 tmRNA
      1 transcript
     70 tRNA

Thus, it may be safer to use “gene” as the feature name, although the best approach is always check how many “exon” entries are there first.

The above code also shows 19 rRNAs in the reference. In case you are interested in evaluating how successful your rRNA removal step worked, you can use “rRNA” as the feature name to let htseq-count count the rRNA reads.

3) Strand direction in htseq-count

One nice feature of htseq-count is that it takes into account the directions of the read and the reference so that a read from the wrong direction, even if it is mapped to the right place of an exon/gene, it will not be counted as a read from that exon/gene. The option “-s” tells htseq-count that you are doing directional RNA-seq and the direction each mate shall be aligned in a pair. The default value is suitable for Illumina platform.

4) ID in htseq-count

I found use “locus_tag” as the ID string for htseq-count makes annotation easier than “gene_id”, at least for bacteria .gff files downloaded from NCBI. This can be set by the “-i locus_tag” option.

In practice I use the following job script for counting expressions, after making sure that the bowtie2 and gff are using the same sequence names. This code count all the gene hits sequentially. You can also parallel the counting for each bam file to save time.

#$ -N htseq-count
#$ -pe smp 4 -l dev -l h_vmem=8G
#$ -o  htseq_count.out
#$ -e  htseq_count.err
#$ -S /bin/bash

for i in `ls *.bam`; 
    echo -e "Counting "$bam
    samtools view $bam|htseq-count -t gene -i locus_tag -s yes -q - genome_plasmids_combined.gff > $count

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s