Tag Archives: directional RNA-seq

Benefits of Paired-end Directional RNA-seq

The most frequently asked questions for my poster presentation at ASM general meeting was the benefits of PE directional RNA-seq over single-end and/or non-directional sequencing. 

PE sequencing has the following advantages over SE:

1) Better mapping results.

With PE reads, you know roughly the range of the distance of two reads. This may help you when you map the reads back to a reference genome. For example, with splicing junction events, two exons maybe joined together. If mate 1 and mate 2 are mapped to a genome with 1kb between them, while you expect the distance between them in the transcript is only about 500bp at maximum, then one intron around 500 bp may be moved. You don’t get such information from SE. Similarly, PE may also solve some non-uniq (secondary) mapping problem, if one of the mates come from a repetitive region and another comes from a non-repetitive region.

2) Improved directional sequencing accuracy

Because PE reads point to opposite directions, we can extract all the read pairs of opposite directions to get better confidence in directional RNA-seq. If, for example, you have some read pairs that point to the same direction, then you can ignore those reads for directional information extraction. With SE, you jut cannot tell when there is such a sequencing error.

Directional RNA-seq— Part 2: Explore SAM flags using samtools

In Directional RNA-seq— Part 1: SAM file flags I showed you what flags in SAM file are important for directional RNA-seq analysis. In this blog, I will show you how to identify flags from SAM file using samtools. 

We will start with how to view a BAM file. BAM file is binary equivalent of SAM file. Compared to the text-based SAM file, BAM file is smaller, can be sorted, and indexed for fast access.

The following command uses the view tool in samtools to show two sam records:

$ samtools view my_sorted.bam |head -2

HISEQ:166:D1M8RACXX:5:1102:6585:48925	99	myref	1	1101M	=	5	105	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@?@DDEFA;DHFH+AFEEEHIDE>F<*?1?FF@GH=DAFGGIGGEGGEHGAA@7D6:?3;C@@C@@CD@:>A:AC5;AA((8<+++(:A>:@C>CCC####	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:0	YT:Z:CP
HISEQ:166:D1M8RACXX:5:2207:3731:104462	99	myref	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@CCDDFADFDHHD<CFGGHHCBHHDHGE>GHIHIIGAHGGGIIIIG=FHDE@DHEDHFCBCCDCDCCCCCCCCCCCDC:>C?8<+:>CC>:>4:ACC>AC?	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:-4	YT:Z:CP

In each sam record there are a few columns separated by tabs. The first column is the read id, which tells you from which location in the flow cell a read is generated. The second column is the SAM flags that we are focusing on. Here, both reads show 99, which indicates these two reads were mate 1s and the template transcript has an orientation that goes from 5′->3′.

To get all the mapped reads come from the 5′->3′ template transcript, we can use the following command:

$ samtools view my_sorted.bam |awk '{if ($2==99||$2=147) print $1, $2}'

We can also use the the filtering option in samtools:

$ samtools view -f 99  EHC_9_D1M8RACXX_GCCAAT_L005__yp_sorted.bam

In the above code, argument -f specifies that we only want to view reads with flag 99. More arguments are available from $ samtools view -?. However, the current filter function does not support boolean operators, so you cannot filter all the 5′->3′ transcripts in one run. You will need to use -f 99 then -f 147. The AWK way seems more convenient.

To split a BAM file to 2 BAM files based on the transcript orientation, we just need to pipe the reads to samtools view and ask it to output bam file. And if your original BAM files have been sorted already, there is no need to sort the new bam files again, since the order have been retained. However, you will need to index them again.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if($2==99||$2==147) print $0}'|head
@HD	VN:1.0	SO:unsorted
@SQ	SN:my_ref_1	LN:72305
@SQ	SN:my_ref_2	LN:9012
@SQ	SN:my_ref_3	LN:5623228
@SQ	SN:my_ref_4	LN:86410
@PG	ID:bowtie2	PN:bowtie2	VN:2.0.0-beta7
HISEQ:166:D1M8RACXX:1:1308:16355:134016	99	my_ref_1	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGN	8??DBD:?1?D@<EE6@EAEE?+?4?*99D=DD?B?BDD>?9=B9)8)7@A@C;)'''39>AAA>:>5>A>A>A:AA>A>A?10+:3>AA###########	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100A0	YS:i:-11	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1106:12252:127620	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	CC@FFFFFHHHHHIJJJJIIHEHIJJIIIGIGIIJJJJJIJIJJIJJIIJIIGJIHF3;CDEDEDDDDDDDDDDDDDDDDDAB@C>ACDCACDDDDDDC??	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-19	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:9208:148034	99	my_ref_1	2	1101M	=	236	335	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	@CBFDDFFHHHHCFGHIJFIFHHGIIEIGEHIIIJJIHEHIGIIJJEIJJIGEGIHF9>ADECEDDDDCCDCDDD@CCC><8<:C>:>A@:>ACCCCCC@@	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-20	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:19344:148604	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	?@8?BDD7:CFFD;)AEE<E<AA?ED<FEFFEEFBGFI<FFFEFE==@;@FFEFIFD',;;>@BB>ABBA::5(:>:@B######################	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-11	YT:Z:CP

The example above shows the code that we can use to pipe to samtools view to get a BAM file. A few things need to be noted here: 1) The -h arugment in samtools view ask it to print out header, which is needed for samtools view to encode a SAM file to BAM file; 2) In the AWK code “{if (NR<7) print $0}" allows the header lines (6 lines in the example) be printed out instead of being filtered out by the following awk line; 3) In the header, there are 4 lines start with "SN:". Each of these lines specifies a reference that was used in the alignment. Here, I mapped all the reads to one genome and three plasmids. In your case, it may be be different, thus you need to change the NR number accordingly.

With all these explained, it is easy to code the output to a BAM file—a BAM file that only has the reads generated from template transcripts from 5′->3′.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if ($2==99||$2==147) print $0}'|samtools view -bS - > my_sorted_plus_strand.bam

Here arugments -bS tells samtools view to read in SAM format, and output BAM. Option “-” tells it the input is from stdin—the pipe. Finally “>” tells the program to direct its output, a BAM file in this case, to a file.

In my next post, I will show you how to generate coverage plots with strand information.