Directional RNA-seq— Part 2: Explore SAM flags using samtools

In Directional RNA-seq— Part 1: SAM file flags I showed you what flags in SAM file are important for directional RNA-seq analysis. In this blog, I will show you how to identify flags from SAM file using samtools. 

We will start with how to view a BAM file. BAM file is binary equivalent of SAM file. Compared to the text-based SAM file, BAM file is smaller, can be sorted, and indexed for fast access.

The following command uses the view tool in samtools to show two sam records:

$ samtools view my_sorted.bam |head -2

HISEQ:166:D1M8RACXX:5:1102:6585:48925	99	myref	1	1101M	=	5	105	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@?@DDEFA;DHFH+AFEEEHIDE>F<*?1?FF@GH=DAFGGIGGEGGEHGAA@7D6:?3;C@@C@@CD@:>A:AC5;AA((8<+++(:A>:@C>CCC####	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:0	YT:Z:CP
HISEQ:166:D1M8RACXX:5:2207:3731:104462	99	myref	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@CCDDFADFDHHD<CFGGHHCBHHDHGE>GHIHIIGAHGGGIIIIG=FHDE@DHEDHFCBCCDCDCCCCCCCCCCCDC:>C?8<+:>CC>:>4:ACC>AC?	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:-4	YT:Z:CP

In each sam record there are a few columns separated by tabs. The first column is the read id, which tells you from which location in the flow cell a read is generated. The second column is the SAM flags that we are focusing on. Here, both reads show 99, which indicates these two reads were mate 1s and the template transcript has an orientation that goes from 5′->3′.

To get all the mapped reads come from the 5′->3′ template transcript, we can use the following command:

$ samtools view my_sorted.bam |awk '{if ($2==99||$2=147) print $1, $2}'

We can also use the the filtering option in samtools:

$ samtools view -f 99  EHC_9_D1M8RACXX_GCCAAT_L005__yp_sorted.bam

In the above code, argument -f specifies that we only want to view reads with flag 99. More arguments are available from $ samtools view -?. However, the current filter function does not support boolean operators, so you cannot filter all the 5′->3′ transcripts in one run. You will need to use -f 99 then -f 147. The AWK way seems more convenient.

To split a BAM file to 2 BAM files based on the transcript orientation, we just need to pipe the reads to samtools view and ask it to output bam file. And if your original BAM files have been sorted already, there is no need to sort the new bam files again, since the order have been retained. However, you will need to index them again.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if($2==99||$2==147) print $0}'|head
@HD	VN:1.0	SO:unsorted
@SQ	SN:my_ref_1	LN:72305
@SQ	SN:my_ref_2	LN:9012
@SQ	SN:my_ref_3	LN:5623228
@SQ	SN:my_ref_4	LN:86410
@PG	ID:bowtie2	PN:bowtie2	VN:2.0.0-beta7
HISEQ:166:D1M8RACXX:1:1308:16355:134016	99	my_ref_1	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGN	8??DBD:?1?D@<EE6@EAEE?+?4?*99D=DD?B?BDD>?9=B9)8)7@A@C;)'''39>AAA>:>5>A>A>A:AA>A>A?10+:3>AA###########	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100A0	YS:i:-11	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1106:12252:127620	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	CC@FFFFFHHHHHIJJJJIIHEHIJJIIIGIGIIJJJJJIJIJJIJJIIJIIGJIHF3;CDEDEDDDDDDDDDDDDDDDDDAB@C>ACDCACDDDDDDC??	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-19	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:9208:148034	99	my_ref_1	2	1101M	=	236	335	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	@CBFDDFFHHHHCFGHIJFIFHHGIIEIGEHIIIJJIHEHIGIIJJEIJJIGEGIHF9>ADECEDDDDCCDCDDD@CCC><8<:C>:>A@:>ACCCCCC@@	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-20	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:19344:148604	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	?@8?BDD7:CFFD;)AEE<E<AA?ED<FEFFEEFBGFI<FFFEFE==@;@FFEFIFD',;;>@BB>ABBA::5(:>:@B######################	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-11	YT:Z:CP

The example above shows the code that we can use to pipe to samtools view to get a BAM file. A few things need to be noted here: 1) The -h arugment in samtools view ask it to print out header, which is needed for samtools view to encode a SAM file to BAM file; 2) In the AWK code “{if (NR<7) print $0}" allows the header lines (6 lines in the example) be printed out instead of being filtered out by the following awk line; 3) In the header, there are 4 lines start with "SN:". Each of these lines specifies a reference that was used in the alignment. Here, I mapped all the reads to one genome and three plasmids. In your case, it may be be different, thus you need to change the NR number accordingly.

With all these explained, it is easy to code the output to a BAM file—a BAM file that only has the reads generated from template transcripts from 5′->3′.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if ($2==99||$2==147) print $0}'|samtools view -bS - > my_sorted_plus_strand.bam

Here arugments -bS tells samtools view to read in SAM format, and output BAM. Option “-” tells it the input is from stdin—the pipe. Finally “>” tells the program to direct its output, a BAM file in this case, to a file.

In my next post, I will show you how to generate coverage plots with strand information.

Advertisements

4 thoughts on “Directional RNA-seq— Part 2: Explore SAM flags using samtools

    1. biobeat Post author

      Thanks for your comment. Although I have never done it, but one shall be able to split a paired-end BAM file into R1 and R2 files by checking SAM flags and extracting read sequences and qualities. For example, FLAG 0x40 and 0x80 tells whether a sequence comes from R1 or R2. And SEQ and QUAL fields in SAM tell you the segment sequence and Phred-scaled base quality. The first column/field tells you the location where a read was originally generated in a flow cell, e.g. HISEQ:166:D1M8RACXX:5:1102:3731:104462.

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s