Monthly Archives: March 2013

Separating bam files to forward and reverse strand files.

When mapping RNA-seq reads to a reference genome, sequences from both directions in the reference are used. To separate the reads in a bam file map to either the forward or reverse strand specifically, we can use the following trick.

samtools view -F 0x10  -h input.bam | samtools view -bS - > forward.bam
samtools view -f 0x10  -h input.bam | samtools view -bS - > forward.bam

Here 0x10 is a flag to indicate whether the reads map to the reverse complement or not.

Merge pathway name and pathway ID from KEGG database

 If an organism is listed in KEGG database, one can easily get a list of its pathways and map a list of genes to the pathways (see here for an example about how to do it in R/bioconductor).

However, the pathway mapping result only provides the pathway IDs and not pathway names. To merge pathway IDs with pathway names, we only need to add  one more column of pathway names for each pathway ID listed in the same row of the mapping result dataframe. The R code below provides an example.

my_organism.gene_pathway <- mapGeneToPathway(organism) # map each gene to pathway ID.
my_organism.pathway_id_name <- mapPathwayToName(organism) # get the pathway name for each pathway ID.

# adding pathway names to mapped pathway IDs.
for (gene in row.names(my_orgranism.gene_pathway)) { 
  pathway_id <- my_orgranism.gene_pathway[gene, 'pathway_id']
  pathway_id <- strsplit(pathway_id, ";")[[1]]
  
  if (length(pathway_id) == 1) {        # if a gene maps to a single pathway
    pathway_name <- my_orgranism.pathway_id_name[pathway_id, "pathway_name"]
    my_orgranism.gene_pathway[gene,"pathway_name"] = pathway_name
  } else {                              # if a gene mapes to multiple pathways
    for (a_id in pathway_id) {
      if (is.null(my_orgranism.gene_pathway[gene, "pathway_name"])) {
        my_orgranism.gene_pathway[gene, "pathway_name"] = my_orgranism.pathway_id_name[a_id, "pathway_name"]
      } else {
        name <- my_orgranism.gene_pathway[gene, "pathway_name"]
        if (is.na(name)) {
          name <- my_orgranism.pathway_id_name[a_id, "pathway_name"]
        } else {
          name <- paste(name, my_orgranism.pathway_id_name[a_id, "pathway_name"], sep=";")
        }
        my_orgranism.gene_pathway[gene, "pathway_name"] <- name
      }
    }
  }
}

submit batch GRID jobs using BASH

The GRID engine provides a nice way to manage computing resources and is commonly used in bioinformatics labs. In the following example, I give an example about how to submit a batch of jobs in a BASH script.

#!/bin/bash
template="#!/bin/bash\n
#$ -N SAMPLE_NAME\n
#$ -pe smp 8 -l dev -l h_vmem=8G\n
#$ -o SAMPLE_OUT\n
#$ -e SAMPLE_ERR\n
#$ -cwd\n
#$ -S /bin/bash\n

date\n
/user/bin/sortmerna --I SAMPLE_FASTQ -n 1 --db /db/src/SILVA_version_111.fasta --accept SAMPLE_ACCEPT -a 8 --other SAMPLE_REJECT --log SAMPLE_LOG --paired-in\n
date\n"

while read line;
do 
    starts_with=${line:0:1}
    if [ "$starts_with" != "#" ]; then # skip commented out entries
     output_base=`basename $line`
     output_base=$PWD/$output_base
     output_out=${output_base/".trimmed.fastq"/".out"}
     output_err=${output_base/".trimmed.fastq"/".err"}
     output_accepted_reads=${output_base/".trimmed.fastq"/"_accepted_reads"}
     output_rejected_reads=${output_base/".trimmed.fastq"/"_rejected_reads"}
     output_log=${output_base/".trimmed.fastq"/""}
     

     job=`basename $line`
     job=SORT_${job:0:6}
     job=${template/"SAMPLE_NAME"/$job}
     job=${job/"SAMPLE_OUT"/$output_out}
     job=${job/"SAMPLE_ERR"/$output_err}
     job=${job/"SAMPLE_FASTQ"/$line}
     job=${job/"SAMPLE_ACCEPT"/$output_accepted_reads}
     job=${job/"SAMPLE_REJECT"/$output_rejected_reads}
     job=${job/"SAMPLE_LOG"/$output_log}

     job_file=${output_log/".log"/".qsub"}
     echo -e $job > $job_file
     qsub $job_file
    fi
done

The above shell script generates a job (.qsub) file for each line from the standard input and submits the job using qsub. The program used in the jobs is sortmerna, a rRNA reads filtering tool.

One can run the above code using the following syntax:

[me@bioinfo sortmerna]$ batch_sortmerna_submit.bash < trimmed_fastq_list.txt