Tag Archives: alignment

One-line script for checking BAM alignment ratio

It is a common practice to check the percentage of reads aligned/mapped to a reference genome. For indexed BAM files, the following one-line script solves the problem.

 

[biobeat mapping]$ for i in `ls *.bam`; do samtools idxstats $i|awk -v file="$i" '{mapped+=$3; unmapped+=$4} END{ print file, 100*mapped/(mapped+unmapped)}'; done
Study_10_L005.bam 60.8237
Study_11_L001.bam 10.2694
Study_11_L005.bam 10.311
Study_13_L006.bam 97.686
Study_14_L006.bam 23.2649
Study_15_L004.bam 0.86006
Study_16_L004.bam 4.99298
Study_1_L005.bam 0.144142
Study_20_L003.bam 99.6083
Study_21_L003.bam 98.424
Study_22_L003.bam 21.2799
Study_23_L002.bam 2.48101
Study_24_L002.bam 0.0676792
Study_2_L005.bam 5.7271
Study_3_L005.bam 0.360636
Study_9_L001.bam 99.422
Study_9_L005.bam 99.443
Advertisements

Directional RNA-seq— Part 2: Explore SAM flags using samtools

In Directional RNA-seq— Part 1: SAM file flags I showed you what flags in SAM file are important for directional RNA-seq analysis. In this blog, I will show you how to identify flags from SAM file using samtools. 

We will start with how to view a BAM file. BAM file is binary equivalent of SAM file. Compared to the text-based SAM file, BAM file is smaller, can be sorted, and indexed for fast access.

The following command uses the view tool in samtools to show two sam records:

$ samtools view my_sorted.bam |head -2

HISEQ:166:D1M8RACXX:5:1102:6585:48925	99	myref	1	1101M	=	5	105	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@?@DDEFA;DHFH+AFEEEHIDE>F<*?1?FF@GH=DAFGGIGGEGGEHGAA@7D6:?3;C@@C@@CD@:>A:AC5;AA((8<+++(:A>:@C>CCC####	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:0	YT:Z:CP
HISEQ:166:D1M8RACXX:5:2207:3731:104462	99	myref	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTNN	@CCDDFADFDHHD<CFGGHHCBHHDHGE>GHIHIIGAHGGGIIIIG=FHDE@DHEDHFCBCCDCDCCCCCCCCCCCDC:>C?8<+:>CC>:>4:ACC>AC?	AS:i:-2	XS:i:-2	XN:i:0	XM:i:2	XO:i:0	XG:i:0	NM:i:2	MD:Z:99G0A0	YS:i:-4	YT:Z:CP

In each sam record there are a few columns separated by tabs. The first column is the read id, which tells you from which location in the flow cell a read is generated. The second column is the SAM flags that we are focusing on. Here, both reads show 99, which indicates these two reads were mate 1s and the template transcript has an orientation that goes from 5′->3′.

To get all the mapped reads come from the 5′->3′ template transcript, we can use the following command:

$ samtools view my_sorted.bam |awk '{if ($2==99||$2=147) print $1, $2}'

We can also use the the filtering option in samtools:

$ samtools view -f 99  EHC_9_D1M8RACXX_GCCAAT_L005__yp_sorted.bam

In the above code, argument -f specifies that we only want to view reads with flag 99. More arguments are available from $ samtools view -?. However, the current filter function does not support boolean operators, so you cannot filter all the 5′->3′ transcripts in one run. You will need to use -f 99 then -f 147. The AWK way seems more convenient.

To split a BAM file to 2 BAM files based on the transcript orientation, we just need to pipe the reads to samtools view and ask it to output bam file. And if your original BAM files have been sorted already, there is no need to sort the new bam files again, since the order have been retained. However, you will need to index them again.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if($2==99||$2==147) print $0}'|head
@HD	VN:1.0	SO:unsorted
@SQ	SN:my_ref_1	LN:72305
@SQ	SN:my_ref_2	LN:9012
@SQ	SN:my_ref_3	LN:5623228
@SQ	SN:my_ref_4	LN:86410
@PG	ID:bowtie2	PN:bowtie2	VN:2.0.0-beta7
HISEQ:166:D1M8RACXX:1:1308:16355:134016	99	my_ref_1	1	1101M	=	4	104	GTGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGN	8??DBD:?1?D@<EE6@EAEE?+?4?*99D=DD?B?BDD>?9=B9)8)7@A@C;)'''39>AAA>:>5>A>A>A:AA>A>A?10+:3>AA###########	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100A0	YS:i:-11	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1106:12252:127620	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	CC@FFFFFHHHHHIJJJJIIHEHIJJIIIGIGIIJJJJJIJIJJIJJIIJIIGJIHF3;CDEDEDDDDDDDDDDDDDDDDDAB@C>ACDCACDDDDDDC??	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-19	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:9208:148034	99	my_ref_1	2	1101M	=	236	335	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	@CBFDDFFHHHHCFGHIJFIFHHGIIEIGEHIIIJJIHEHIGIIJJEIJJIGEGIHF9>ADECEDDDDCCDCDDD@CCC><8<:C>:>A@:>ACCCCCC@@	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-20	YT:Z:CP
HISEQ:166:D1M8RACXX:1:1202:19344:148604	99	my_ref_1	2	1101M	=	237	336	TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCCTGATTCAGGAGAGTTTATGGTCACTTTTGAN	?@8?BDD7:CFFD;)AEE<E<AA?ED<FEFFEEFBGFI<FFFEFE==@;@FFEFIFD',;;>@BB>ABBA::5(:>:@B######################	AS:i:-1	XS:i:-1	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:100G0	YS:i:-11	YT:Z:CP

The example above shows the code that we can use to pipe to samtools view to get a BAM file. A few things need to be noted here: 1) The -h arugment in samtools view ask it to print out header, which is needed for samtools view to encode a SAM file to BAM file; 2) In the AWK code “{if (NR<7) print $0}" allows the header lines (6 lines in the example) be printed out instead of being filtered out by the following awk line; 3) In the header, there are 4 lines start with "SN:". Each of these lines specifies a reference that was used in the alignment. Here, I mapped all the reads to one genome and three plasmids. In your case, it may be be different, thus you need to change the NR number accordingly.

With all these explained, it is easy to code the output to a BAM file—a BAM file that only has the reads generated from template transcripts from 5′->3′.

$ samtools view -h my_sorted.bam | awk '{if (NR<7) print $0}; {if ($2==99||$2==147) print $0}'|samtools view -bS - > my_sorted_plus_strand.bam

Here arugments -bS tells samtools view to read in SAM format, and output BAM. Option “-” tells it the input is from stdin—the pipe. Finally “>” tells the program to direct its output, a BAM file in this case, to a file.

In my next post, I will show you how to generate coverage plots with strand information.