Tag Archives: RNA-seq

Separating bam files to forward and reverse strand files.

When mapping RNA-seq reads to a reference genome, sequences from both directions in the reference are used. To separate the reads in a bam file map to either the forward or reverse strand specifically, we can use the following trick.

samtools view -F 0x10  -h input.bam | samtools view -bS - > forward.bam
samtools view -f 0x10  -h input.bam | samtools view -bS - > forward.bam

Here 0x10 is a flag to indicate whether the reads map to the reverse complement or not.


Batch trimming Poly-A tails

To assemble transcripts from RNA-seq short reads, one needs to trim the poly-A tails first. Several packages exits for that needs, including trimpoly in the seqclean package, and trimest from the EMBOSS package. I prefer trimest.

Like described in my previous blogs, I like to store a list of fastq files in a text file so that the entire project can be easily wrapped into some workflow. The following bash code does the poly-A trimming for every fastq file in the list.

while read line; do
   name=`basename $line`;
   echo Trimming $name;
   trimest $line $output;
done < trimmed_fastq_list.txt

Scripts for spike-in control statistics

Spike-in RNA-seq controls provides an important QC information. A simple QC use case is to calculate the ratio of the reads from spike-in sequences to that of the total reads in one experiment. The following code uses the idxstat function of samtools to do this.

echo "bam_file Spike-in_reads total_reads spike-in/total_reads"
for i in `ls mapping/*.bam`;
  echo -n $i "";
  samtools idxstats $i|awk '{s+=$3; total+=$3 + $4} END {print s, total, s/total}';

Mapping genes to pathways using biomaRt

One common task in RNA-seq analysis is to map significantly regulated genes into pathways. Pathway information are available from various databases, e.g. reactome, uniprot, etc. The code below shows how to map genes of interest (as reported by cufflinks/cummeRbund) to human pathways in uniprot.

data <- readCufflinks()
gene_list <- genes(data)
gene_diff_data <- diffData(gene_list)
gene_list <- features(gene_list)
gene_list <- subset(gene_list, select=c('gene_id', 'gene_short_name'))
sig_gene_data <- subset(gene_diff_data, (significant=='yes'))
unimart = useMart('unimart', dataset='uniprot')
pathway.data <- data.frame('gene_name', 'go_name')
for (i in sig_gene_data$gene_id) {
 gene_name = subset(gene_list, gene_id == i, select='gene_short_name')
 tmp <- getBM(attributes=c('gene_name','organism', 'go_name'),
 tmp <- subset(tmp, tmp$organism=='Homo sapiens',
 select=c('gene_name', 'go_name'))
 tmp <- tmp[grep('^P:', tmp$go_name),]
 colnames(tmp) <- colnames(pathway.data)
 pathway.data <- rbind(pathway.data, tmp)
write.table(pathway.data, 'sig_gene_pathway.csv', sep=',', row.names=FALSE)