Unlike traditional RNA-seq protocols, strand information in directional RNA-seq is maintained. The strand information of a transcript can help us to tell whether that transcript come from the coding sequence or a non-coding sequence in the reverse strand. In some cases, two coding sequences may exist in different direction with an overlap region, and directional RNA-seq can also tell us whether a read is coming from which coding sequence.
In the discussion below, I use data generated from ScriptSeq library kit for paired-end sequencing on the Illumina platform, where the first mate in the pair (PE-1) has the same direction as the original transcript (5’->3’), and the second mate (PE-2) has a orientation that is reverse to the original transcript. If you use different kit for directional RNA-seq library creation, you’ll need to check the directions of your read accordingly.
Before we discuss about strand information extraction, let’s first look at SAM file format. SAM file is the de facto short read alignment format and is the default format used by most popular alignment software like Bowtie2 and BWA. From SAM file format documentation, we know the second column of a SAM file has the FLAG information. A table lists all the commonly defined flags can be found on page 4 of its documentation. Specifically, the presence of bit 0x10 or 0x20 tell a sequence (or its pair, respectively) maps to the reverse complement of the reference genome. The corresponding decimal codes for 0x10 and 0x20 are 16 and 32.
Let’s taking a closer look at the flags listed in the table. All the four flags (red numbers) can be decomposed to elementary flags in the column named “composition”. Notice that they all share flag value: 1and 2, which indicate these reads are coming from paired-end sequencing (1/0x1), and both mate in a pair were properly aligned (2/0x2). Flag 64 (0x40) tells us a read is from the first mate, and 128 (0x80) the second mate. Flag 16 (0x10) or 32 (0x20) tells us the read or its mate is reversely mapped. Thus, a combination of 64, 32, 2, and 1, which gives a FLAG=99, indicates this read comes from the PE-1, it is aligned to the reference genome. Its mate (PE-2) is aligned reversely to the genome. Remember PE-2 always point to the opposite direction of the original transcript (and has an opposite direction compared to PE-1 in Illumina), we know that the original transcript where PE-1 and PE-2 were generated came from the positive strand (5’->3’). As summarized in the table, flags 99 and 147 come from the positive strand, and flags 83 and 163 come from the negative strand. For directional paired-end sequencing, we only look for these 4 flags.
If we only look at properly paired mates (both mates are mapped and transcript distance is within the maximum expected insertion size, which can be specified in mapping software, -X for Bowtie2), there are always equal numbers of reads with flags 99 and 147. The same holds for 83 and 163.
In my next blog, I will demonstrate how to extract flags from samfiles efficiently from command line.