Category Archives: Shell

One-line script for checking BAM alignment ratio

It is a common practice to check the percentage of reads aligned/mapped to a reference genome. For indexed BAM files, the following one-line script solves the problem.

 

[biobeat mapping]$ for i in `ls *.bam`; do samtools idxstats $i|awk -v file="$i" '{mapped+=$3; unmapped+=$4} END{ print file, 100*mapped/(mapped+unmapped)}'; done
Study_10_L005.bam 60.8237
Study_11_L001.bam 10.2694
Study_11_L005.bam 10.311
Study_13_L006.bam 97.686
Study_14_L006.bam 23.2649
Study_15_L004.bam 0.86006
Study_16_L004.bam 4.99298
Study_1_L005.bam 0.144142
Study_20_L003.bam 99.6083
Study_21_L003.bam 98.424
Study_22_L003.bam 21.2799
Study_23_L002.bam 2.48101
Study_24_L002.bam 0.0676792
Study_2_L005.bam 5.7271
Study_3_L005.bam 0.360636
Study_9_L001.bam 99.422
Study_9_L005.bam 99.443

submit batch GRID jobs using BASH

The GRID engine provides a nice way to manage computing resources and is commonly used in bioinformatics labs. In the following example, I give an example about how to submit a batch of jobs in a BASH script.

#!/bin/bash
template="#!/bin/bash\n
#$ -N SAMPLE_NAME\n
#$ -pe smp 8 -l dev -l h_vmem=8G\n
#$ -o SAMPLE_OUT\n
#$ -e SAMPLE_ERR\n
#$ -cwd\n
#$ -S /bin/bash\n

date\n
/user/bin/sortmerna --I SAMPLE_FASTQ -n 1 --db /db/src/SILVA_version_111.fasta --accept SAMPLE_ACCEPT -a 8 --other SAMPLE_REJECT --log SAMPLE_LOG --paired-in\n
date\n"

while read line;
do 
    starts_with=${line:0:1}
    if [ "$starts_with" != "#" ]; then # skip commented out entries
     output_base=`basename $line`
     output_base=$PWD/$output_base
     output_out=${output_base/".trimmed.fastq"/".out"}
     output_err=${output_base/".trimmed.fastq"/".err"}
     output_accepted_reads=${output_base/".trimmed.fastq"/"_accepted_reads"}
     output_rejected_reads=${output_base/".trimmed.fastq"/"_rejected_reads"}
     output_log=${output_base/".trimmed.fastq"/""}
     

     job=`basename $line`
     job=SORT_${job:0:6}
     job=${template/"SAMPLE_NAME"/$job}
     job=${job/"SAMPLE_OUT"/$output_out}
     job=${job/"SAMPLE_ERR"/$output_err}
     job=${job/"SAMPLE_FASTQ"/$line}
     job=${job/"SAMPLE_ACCEPT"/$output_accepted_reads}
     job=${job/"SAMPLE_REJECT"/$output_rejected_reads}
     job=${job/"SAMPLE_LOG"/$output_log}

     job_file=${output_log/".log"/".qsub"}
     echo -e $job > $job_file
     qsub $job_file
    fi
done

The above shell script generates a job (.qsub) file for each line from the standard input and submits the job using qsub. The program used in the jobs is sortmerna, a rRNA reads filtering tool.

One can run the above code using the following syntax:

[me@bioinfo sortmerna]$ batch_sortmerna_submit.bash < trimmed_fastq_list.txt 

Download remote files using wget

The Unix/Linux command line tool wget is a very handy tool for downloading files. Just like many other powerful command line tools, it is much faster and easier to get the job down using wget than its GUI alternatives (e.g. FileZilla). For example the following command can download/mirror a remote ftp directory for a user.

wget -m ftp://username:password@my.ftp.org/

Batch trimming Poly-A tails

To assemble transcripts from RNA-seq short reads, one needs to trim the poly-A tails first. Several packages exits for that needs, including trimpoly in the seqclean package, and trimest from the EMBOSS package. I prefer trimest.

Like described in my previous blogs, I like to store a list of fastq files in a text file so that the entire project can be easily wrapped into some workflow. The following bash code does the poly-A trimming for every fastq file in the list.

while read line; do
   name=`basename $line`;
   echo Trimming $name;
   output=${name/trimmed.fastq/poly_AT_trimmed.fasta};
   trimest $line $output;
done &lt; trimmed_fastq_list.txt

Scripts for spike-in control statistics

Spike-in RNA-seq controls provides an important QC information. A simple QC use case is to calculate the ratio of the reads from spike-in sequences to that of the total reads in one experiment. The following code uses the idxstat function of samtools to do this.

echo "bam_file Spike-in_reads total_reads spike-in/total_reads"
for i in `ls mapping/*.bam`;
do 
  echo -n $i "";
  samtools idxstats $i|awk '{s+=$3; total+=$3 + $4} END {print s, total, s/total}';
done

Generate fastq_list file from shell

In practice, I feel it is more convenient to put the absolute path of fastq files in a single text file, and put this text file to some analysis pipeline. The following BASH code shows an example of generate such a fastq_list file that includes all the fastq files has more than 1,000,000 bytes.

for i in *.fastq;
 do size=$(stat -c%s "$i");
 if [ $size -gt 1000000 ]; then 
   echo /home/me/reads/$i;
 fi;
done> fastq_list.txt

Inflate gzipped fastq files

Fastq files are commonly zipped after they go through the de-multiplexing steps from sequencers.  For analysis programs that do not support those zipped files, bioinformatician needs to inflate those files first. Suppose we only want to inflate a certain zipped fastq files stored on a server directory to a user directory, the following bash script may help.

#!/bin/bash
PROJECT_NAME_LOW=1			# we only work on PROJECT_NAME-1-10
PROJECT_NAME_HIGH=10
OUT_DIR=/users/me/projects/PROJECT_NAME/reads

for f in /server_home/PROJECT_NAME/reads_origin/*.gz;
do
 STEM=$(basename "${f}" .gz)

 index=$(echo $STEM|tr "_" "\n")
 sample=$(echo $index|cut -f 1 -d " ")
 sample_id=${sample[@]#PROJECT_NAME-}		# remove PROJECT_NAME-
 
 if [ "$sample_id" -ge $PROJECT_NAME_LOW ] && [ "$sample_id" -le $PROJECT_NAME_HIGH ]
 then
     output=$OUT_DIR/$STEM
     echo Inflating $STEM
     gunzip -c $f > $output

 fi
done