In a typical alignment-based RNA-seq differential gene expression analysis, a bioinformatician downloads reference data (.fna, .gbk, and .gff files) from NCBI, indexes the .fna files for aligners (e.g. bowtie2, bwa, etc.), uses htseq-count to count reads mapped to a specific gene/exon, do statistical analysis to extract significantly regulated genes, and add annotation from the gbk file to those genes. There are several traps in this process. I list a few here.
1) reference names .fna and .gff files
Below are lines extracted from .fna and .gff files downloaded from NCBI:
$ head -3 NC_003143.fna
>gi|16120353|ref|NC_003143.1| Yersinia pestis CO92 chromosome, complete genome
$ head -8 NC_003143.gff
#!processor NCBI annotwriter
##sequence-region NC_003143.1 1 4653728
NC_003143.1 RefSeq region 1 4653728 . + . ID=id0;Dbxref=taxon:214092;Is_circular=true;biovar=Orientalis;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=CO92
NC_003143.1 RefSeq gene 271 711 . - . ID=gene0;Name=YPO0001;Dbxref=GeneID:1172846;gbkey=Gene;locus_tag=YPO0001
NC_003143.1 RefSeq CDS 271 711 . - 0 ID=cds0;Name=YP_002345097.1;Parent=gene0;Note=An electron-transfer protein%3B flavodoxin binds one FMN molecule%2C which serves as a redox-active prosthetic group;Dbxref=InterPro:IPR001094,InterPro:IPR008254,UniProtKB%2FTrEMBL:Q7CLF0,Genbank:YP_002345097.1,GeneID:1172846;gbkey=CDS;product=flavodoxin;protein_id=YP_002345097.1;transl_table=11
Notice that in gff file, the reference sequence name is “NC_003143.1” and in .fna file the name is “gi|16120353|ref|NC_003143.1|”. The latter sequence name is also used by bowtie2 when you index that fna file. The discrepancy in sequence names cause error when you run htseq-count, because htseq-count will get the reference sequence names from the SAM file (, which falls back to the bowtie2 index sequence name), and will try to find that same reference names in the gff file, which do not exist. I usually use some code like below to “fix” the gff file.
$sed -i s/^NC_003143.1/gi\|16120353\|ref\|NC_003143.1\|/g YP_combined.gff
With the above fix, htseq-count may work. But to get the right results, you still need to feed the right argument to this program.
2) arguments in htseq-count
By default, htseq-count uses “exon” as the feature type for counting reads. While this may work perfectly with Ensembl GTF files, it may not always be the right one for your study. For example, in Y. pestis only 97 exons entries are available in the gff file. But there are over 4305 genes (see below).[/p]
$ cut -f3 ~/scratch/db/src/Yersinia_pestis_CO92_uid57621/YP_combined.gff |sort|uniq -c
4 #!gff-spec-version 1.20
4 ##gff-version 3
4 #!processor NCBI annotwriter
1 ##sequence-region NC_003131.1 1 70305
1 ##sequence-region NC_003132.1 1 9612
1 ##sequence-region NC_003134.1 1 96210
1 ##sequence-region NC_003143.1 1 4653728
4 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=214092
Thus, it may be safer to use “gene” as the feature name, although the best approach is always check how many “exon” entries are there first.
The above code also shows 19 rRNAs in the reference. In case you are interested in evaluating how successful your rRNA removal step worked, you can use “rRNA” as the feature name to let htseq-count count the rRNA reads.
3) Strand direction in htseq-count
One nice feature of htseq-count is that it takes into account the directions of the read and the reference so that a read from the wrong direction, even if it is mapped to the right place of an exon/gene, it will not be counted as a read from that exon/gene. The option “-s” tells htseq-count that you are doing directional RNA-seq and the direction each mate shall be aligned in a pair. The default value is suitable for Illumina platform.
4) ID in htseq-count
I found use “locus_tag” as the ID string for htseq-count makes annotation easier than “gene_id”, at least for bacteria .gff files downloaded from NCBI. This can be set by the “-i locus_tag” option.
In practice I use the following job script for counting expressions, after making sure that the bowtie2 and gff are using the same sequence names. This code count all the gene hits sequentially. You can also parallel the counting for each bam file to save time.
#$ -N htseq-count
#$ -pe smp 4 -l dev -l h_vmem=8G
#$ -o htseq_count.out
#$ -e htseq_count.err
#$ -S /bin/bash
for i in `ls *.bam`;
echo -e "Counting "$bam
samtools view $bam|htseq-count -t gene -i locus_tag -s yes -q - genome_plasmids_combined.gff > $count