bowtie2, htseq-count, and reference seqeunces downloaded from NCBI

In a typical alignment-based RNA-seq differential gene expression analysis, a bioinformatician downloads reference data (.fna, .gbk, and .gff files) from NCBI, indexes the .fna files for aligners (e.g. bowtie2, bwa, etc.), uses htseq-count to count reads mapped to a specific gene/exon, do statistical analysis to extract significantly regulated genes, and add annotation from the gbk file to those genes. There are several traps in this process. I list a few here.

1) reference names .fna and .gff files

Below are lines extracted from .fna and .gff files downloaded from NCBI:

$ head -3 NC_003143.fna
>gi|16120353|ref|NC_003143.1| Yersinia pestis CO92 chromosome, complete genome
GATCTTTTTATTTAAACGATCTCTTTATTAGATCTCTTATTAGGATCATGATCCTCTGTGGATAAGTGAT
TATTCACATGGCAGATCATATAATTAAGGAGGATCGTTTGTTGTGAGTGACCGGTGATCGTATTGCGTAT

$ head -8 NC_003143.gff
##gff-version 3
#!gff-spec-version 1.20
#!processor NCBI annotwriter
##sequence-region NC_003143.1 1 4653728
##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=214092
NC_003143.1	RefSeq	region	1	4653728	.	+	.	ID=id0;Dbxref=taxon:214092;Is_circular=true;biovar=Orientalis;gbkey=Src;genome=chromosome;mol_type=genomic DNA;strain=CO92
NC_003143.1	RefSeq	gene	271	711	.	-	.	ID=gene0;Name=YPO0001;Dbxref=GeneID:1172846;gbkey=Gene;locus_tag=YPO0001
NC_003143.1	RefSeq	CDS	271	711	.	-	0	ID=cds0;Name=YP_002345097.1;Parent=gene0;Note=An electron-transfer protein%3B flavodoxin binds one FMN molecule%2C which serves as a redox-active prosthetic group;Dbxref=InterPro:IPR001094,InterPro:IPR008254,UniProtKB%2FTrEMBL:Q7CLF0,Genbank:YP_002345097.1,GeneID:1172846;gbkey=CDS;product=flavodoxin;protein_id=YP_002345097.1;transl_table=11

Notice that in gff file, the reference sequence name is “NC_003143.1” and in .fna file the name is “gi|16120353|ref|NC_003143.1|”. The latter sequence name is also used by bowtie2 when you index that fna file. The discrepancy in sequence names cause error when you run htseq-count, because htseq-count will get the reference sequence names from the SAM file (, which falls back to the bowtie2 index sequence name), and will try to find that same reference names in the gff file, which do not exist. I usually use some code like below to “fix” the gff file.

$sed -i s/^NC_003143.1/gi\|16120353\|ref\|NC_003143.1\|/g YP_combined.gff 

With the above fix, htseq-count may work. But to get the right results, you still need to feed the right argument to this program.

2) arguments in htseq-count

By default, htseq-count uses “exon” as the feature type for counting reads. While this may work perfectly with Ensembl GTF files, it may not always be the right one for your study. For example, in Y. pestis only 97 exons entries are available in the gff file. But there are over 4305 genes (see below).[/p]

$ cut -f3 ~/scratch/db/src/Yersinia_pestis_CO92_uid57621/YP_combined.gff |sort|uniq -c
      4 ###
   4067 CDS
     97 exon
      2 five_prime_UTR
   4305 gene
      4 #!gff-spec-version 1.20
      4 ##gff-version 3
      6 ncRNA
      4 #!processor NCBI annotwriter
      1 promoter
   9583 region
    413 repeat_region
    203 ribosome_entry_site
     19 rRNA
      1 ##sequence-region NC_003131.1 1 70305
      1 ##sequence-region NC_003132.1 1 9612
      1 ##sequence-region NC_003134.1 1 96210
      1 ##sequence-region NC_003143.1 1 4653728
      3 sequence_variant
      4 ##species http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=214092
      1 stem_loop
      1 tmRNA
      1 transcript
     70 tRNA

Thus, it may be safer to use “gene” as the feature name, although the best approach is always check how many “exon” entries are there first.

The above code also shows 19 rRNAs in the reference. In case you are interested in evaluating how successful your rRNA removal step worked, you can use “rRNA” as the feature name to let htseq-count count the rRNA reads.

3) Strand direction in htseq-count

One nice feature of htseq-count is that it takes into account the directions of the read and the reference so that a read from the wrong direction, even if it is mapped to the right place of an exon/gene, it will not be counted as a read from that exon/gene. The option “-s” tells htseq-count that you are doing directional RNA-seq and the direction each mate shall be aligned in a pair. The default value is suitable for Illumina platform.

4) ID in htseq-count

I found use “locus_tag” as the ID string for htseq-count makes annotation easier than “gene_id”, at least for bacteria .gff files downloaded from NCBI. This can be set by the “-i locus_tag” option.

In practice I use the following job script for counting expressions, after making sure that the bowtie2 and gff are using the same sequence names. This code count all the gene hits sequentially. You can also parallel the counting for each bam file to save time.

#!/bin/bash
#$ -N htseq-count
#$ -pe smp 4 -l dev -l h_vmem=8G
#$ -o  htseq_count.out
#$ -e  htseq_count.err
#$ -S /bin/bash

date
for i in `ls *.bam`; 
do
    bam=$PWD/$i;
    count=${bam/.bam/_htseq.count}
    echo -e "Counting "$bam
    samtools view $bam|htseq-count -t gene -i locus_tag -s yes -q - genome_plasmids_combined.gff > $count
done
date

Directional RNA-seq— Part 2: Explore SAM flags using samtools

In Directional RNA-seq— Part 1: SAM file flags I showed you what flags in SAM file are important for directional RNA-seq analysis. In this blog, I will show you how to identify flags from SAM file using samtools. 

We will start with how to view a BAM file. BAM file is binary equivalent of SAM file. Compared to the text-based SAM file, BAM file is smaller, can be sorted, and indexed for fast access.

The following command uses the view tool in samtools to show two sam records:

$ samtools view my_sorted.bam |head -2

HISEQ:166:D1M8RACXX:5:1102:6585:48925	99	myref	1	1101M	=	5	105	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@?@DDEFA;DHFH+AFEEEHIDE>F<*?1?FF@GH=DAFGGIGGEGGEHGAA@7D6:?3;C@@C@@CD@:>A:AC5;AA((8<+++(:A>:@C>CCC####	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:0	YT:Z:CP
HISEQ:166:D1M8RACXX:5:2207:3731:104462	99	myref	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@CCDDFADFDHHD<CFGGHHCBHHDHGE>GHIHIIGAHGGGIIIIG=FHDE@DHEDHFCBCCDCDCCCCCCCCCCCDC:>C?8<+:>CC>:>4:ACC>AC?	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:-4	YT:Z:CP

In each sam record there are a few columns separated by tabs. The first column is the read id, which tells you from which location in the flow cell a read is generated. The second column is the SAM flags that we are focusing on. Here, both reads show 99, which indicates these two reads were mate 1s and the template transcript has an orientation that goes from 5′->3′.

To get all the mapped reads come from the 5′->3′ template transcript, we can use the following command:

$ samtools view my_sorted.bam |awk '{if ($2==99||$2=147) print $1, $2}'

We can also use the the filtering option in samtools:

$ samtools view -f 99  EHC_9_D1M8RACXX_GCCAAT_L005__yp_sorted.bam

In the above code, argument -f specifies that we only want to view reads with flag 99. More arguments are available from $ samtools view -?. However, the current filter function does not support boolean operators, so you cannot filter all the 5′->3′ transcripts in one run. You will need to use -f 99 then -f 147. The AWK way seems more convenient.

To split a BAM file to 2 BAM files based on the transcript orientation, we just need to pipe the reads to samtools view and ask it to output bam file. And if your original BAM files have been sorted already, there is no need to sort the new bam files again, since the order have been retained. However, you will need to index them again.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if($2==99||$2==147) print $0}'|head
@HD	VN:1.0	SO:unsorted
@SQ	SN:my_ref_1	LN:72305
@SQ	SN:my_ref_2	LN:9012
@SQ	SN:my_ref_3	LN:5623228
@SQ	SN:my_ref_4	LN:86410
@PG	ID:bowtie2	PN:bowtie2	VN:2.0.0-beta7
HISEQ:166:D1M8RACXX:1:1308:16355:134016	99	my_ref_1	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGN	8??DBD:?1?D@<EE6@EAEE?+?4?*99D=DD?B?BDD>?9=B9)8)7@A@C;)'''39>AAA>:>5>A>A>A:AA>A>A?10+:3>AA###########	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100A0	YS:i:-11	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1106:12252:127620	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	CC@FFFFFHHHHHIJJJJIIHEHIJJIIIGIGIIJJJJJIJIJJIJJIIJIIGJIHF3;CDEDEDDDDDDDDDDDDDDDDDAB@C>ACDCACDDDDDDC??	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-19	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:9208:148034	99	my_ref_1	2	1101M	=	236	335	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	@CBFDDFFHHHHCFGHIJFIFHHGIIEIGEHIIIJJIHEHIGIIJJEIJJIGEGIHF9>ADECEDDDDCCDCDDD@CCC><8<:C>:>A@:>ACCCCCC@@	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-20	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:19344:148604	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	?@8?BDD7:CFFD;)AEE<E<AA?ED<FEFFEEFBGFI<FFFEFE==@;@FFEFIFD',;;>@BB>ABBA::5(:>:@B######################	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-11	YT:Z:CP

The example above shows the code that we can use to pipe to samtools view to get a BAM file. A few things need to be noted here: 1) The -h arugment in samtools view ask it to print out header, which is needed for samtools view to encode a SAM file to BAM file; 2) In the AWK code “{if (NR<7) print $0}" allows the header lines (6 lines in the example) be printed out instead of being filtered out by the following awk line; 3) In the header, there are 4 lines start with "SN:". Each of these lines specifies a reference that was used in the alignment. Here, I mapped all the reads to one genome and three plasmids. In your case, it may be be different, thus you need to change the NR number accordingly.

With all these explained, it is easy to code the output to a BAM file—a BAM file that only has the reads generated from template transcripts from 5′->3′.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if ($2==99||$2==147) print $0}'|samtools view -bS - > my_sorted_plus_strand.bam

Here arugments -bS tells samtools view to read in SAM format, and output BAM. Option “-” tells it the input is from stdin—the pipe. Finally “>” tells the program to direct its output, a BAM file in this case, to a file.

In my next post, I will show you how to generate coverage plots with strand information.

Directional RNA-seq— Part 1: SAM file flags

Unlike traditional RNA-seq protocols, strand information in directional RNA-seq is maintained. The strand information of a transcript can help us to tell whether that transcript come from the coding sequence or a non-coding sequence in the reverse strand. In some cases, two coding sequences may exist in different direction with an overlap region, and directional RNA-seq can also tell us whether a read is coming from which coding sequence.

 

In the discussion below, I use data generated from ScriptSeq library kit for paired-end sequencing on the Illumina platform, where the first mate in the pair (PE-1) has the same direction as the original transcript (5’->3’), and the second mate (PE-2) has a orientation that is reverse to the original transcript. If you use different kit for directional RNA-seq library creation, you’ll need to check the directions of your read accordingly.

Before we discuss about strand information extraction, let’s first look at SAM file format. SAM file is the de facto short read alignment format and is the default format used by most popular alignment software like Bowtie2 and BWA. From SAM file format documentation, we know the second column of a SAM file has the FLAG information. A table lists all the commonly defined flags can be found on page 4 of its documentation. Specifically, the presence of bit 0x10 or 0x20 tell a sequence (or its pair, respectively) maps to the reverse complement of the reference genome. The corresponding decimal codes for 0x10 and 0x20 are 16 and 32.

The table below lists the flags that are important for directional RNA-seq analysis.Image

Let’s taking a closer look at the flags listed in the table. All the four flags (red numbers) can be decomposed to elementary flags in the column named “composition”. Notice that they all share flag value: 1and 2, which indicate these reads are coming from paired-end sequencing (1/0x1), and both mate in a pair were properly aligned (2/0x2). Flag 64 (0x40) tells us a read is from the first mate, and 128 (0x80) the second mate. Flag 16 (0x10) or 32 (0x20) tells us the read or its mate is reversely mapped. Thus, a combination of 64, 32, 2, and 1, which gives a FLAG=99, indicates this read comes from the PE-1, it is aligned to the reference genome. Its mate (PE-2) is aligned reversely to the genome. Remember PE-2 always point to the opposite direction of the original transcript (and has an opposite direction compared to PE-1 in Illumina), we know that the original transcript where PE-1 and PE-2 were generated came from the positive strand (5’->3’). As summarized in the table, flags 99 and 147 come from the positive strand, and flags 83 and 163 come from the negative strand. For directional paired-end sequencing, we only look for these 4 flags.

If we only look at properly paired mates (both mates are mapped and transcript distance is within the maximum expected insertion size, which can be specified in mapping software, -X for Bowtie2), there are always equal numbers of reads with flags 99 and 147. The same holds for 83 and 163.

In my next blog, I will demonstrate how to extract flags from samfiles efficiently from command line.

Write Technical Notes with listings Package in LaTex

LaTex is an ideal tool for writing technical notes. There is even a package, named listings,  to add non-formatted text such as computer source code. This URL shows you how to use it. And, R, Python, C/C++, and Perl are all supported.

To use listing package with the beamer presentation environment of LaTex, one needs to use \begin{frame}[fragile] (note the added [fragile] option) to make it work.

Separating bam files to forward and reverse strand files.

When mapping RNA-seq reads to a reference genome, sequences from both directions in the reference are used. To separate the reads in a bam file map to either the forward or reverse strand specifically, we can use the following trick.

samtools view -F 0x10  -h input.bam | samtools view -bS - > forward.bam
samtools view -f 0x10  -h input.bam | samtools view -bS - > forward.bam

Here 0x10 is a flag to indicate whether the reads map to the reverse complement or not.

Merge pathway name and pathway ID from KEGG database

 If an organism is listed in KEGG database, one can easily get a list of its pathways and map a list of genes to the pathways (see here for an example about how to do it in R/bioconductor).

However, the pathway mapping result only provides the pathway IDs and not pathway names. To merge pathway IDs with pathway names, we only need to add  one more column of pathway names for each pathway ID listed in the same row of the mapping result dataframe. The R code below provides an example.

my_organism.gene_pathway <- mapGeneToPathway(organism) # map each gene to pathway ID.
my_organism.pathway_id_name <- mapPathwayToName(organism) # get the pathway name for each pathway ID.

# adding pathway names to mapped pathway IDs.
for (gene in row.names(my_orgranism.gene_pathway)) { 
  pathway_id <- my_orgranism.gene_pathway[gene, 'pathway_id']
  pathway_id <- strsplit(pathway_id, ";")[[1]]
  
  if (length(pathway_id) == 1) {        # if a gene maps to a single pathway
    pathway_name <- my_orgranism.pathway_id_name[pathway_id, "pathway_name"]
    my_orgranism.gene_pathway[gene,"pathway_name"] = pathway_name
  } else {                              # if a gene mapes to multiple pathways
    for (a_id in pathway_id) {
      if (is.null(my_orgranism.gene_pathway[gene, "pathway_name"])) {
        my_orgranism.gene_pathway[gene, "pathway_name"] = my_orgranism.pathway_id_name[a_id, "pathway_name"]
      } else {
        name <- my_orgranism.gene_pathway[gene, "pathway_name"]
        if (is.na(name)) {
          name <- my_orgranism.pathway_id_name[a_id, "pathway_name"]
        } else {
          name <- paste(name, my_orgranism.pathway_id_name[a_id, "pathway_name"], sep=";")
        }
        my_orgranism.gene_pathway[gene, "pathway_name"] <- name
      }
    }
  }
}