Monthly Archives: January 2013

Scripts for spike-in control statistics

Spike-in RNA-seq controls provides an important QC information. A simple QC use case is to calculate the ratio of the reads from spike-in sequences to that of the total reads in one experiment. The following code uses the idxstat function of samtools to do this.

echo "bam_file Spike-in_reads total_reads spike-in/total_reads"
for i in `ls mapping/*.bam`;
  echo -n $i "";
  samtools idxstats $i|awk '{s+=$3; total+=$3 + $4} END {print s, total, s/total}';

Generate fastq_list file from shell

In practice, I feel it is more convenient to put the absolute path of fastq files in a single text file, and put this text file to some analysis pipeline. The following BASH code shows an example of generate such a fastq_list file that includes all the fastq files has more than 1,000,000 bytes.

for i in *.fastq;
 do size=$(stat -c%s "$i");
 if [ $size -gt 1000000 ]; then 
   echo /home/me/reads/$i;
done> fastq_list.txt

Inflate gzipped fastq files

Fastq files are commonly zipped after they go through the de-multiplexing steps from sequencers.  For analysis programs that do not support those zipped files, bioinformatician needs to inflate those files first. Suppose we only want to inflate a certain zipped fastq files stored on a server directory to a user directory, the following bash script may help.

PROJECT_NAME_LOW=1			# we only work on PROJECT_NAME-1-10

for f in /server_home/PROJECT_NAME/reads_origin/*.gz;
 STEM=$(basename "${f}" .gz)

 index=$(echo $STEM|tr "_" "\n")
 sample=$(echo $index|cut -f 1 -d " ")
 sample_id=${sample[@]#PROJECT_NAME-}		# remove PROJECT_NAME-
 if [ "$sample_id" -ge $PROJECT_NAME_LOW ] && [ "$sample_id" -le $PROJECT_NAME_HIGH ]
     echo Inflating $STEM
     gunzip -c $f > $output


Retrieve short gene names in cummeRbund

CummeRbund is a R package for analyzing cufflink results. By default, it uses XLOC values as the identifiers for each gene. All the gene information is stored in CuffGene instances, including the short name (if available). To get short gene names instead of XLOC values, the following R code is helpful.

>diffGenes <- getGenes(cuff_data, sig_gene_data$gene_id)

>head( featureNames(diffGenes), n=3)

tracking_id gene_short_name
1 XLOC_000186 GRHL3
2 XLOC_000302 NDUFS5
3 XLOC_000374 UQCRH

> head(featureNames(diffGenes)$gene_short_name, n=3)

*** Levels …..

There is some recommended method for newer version of CummeRbund as discussed here, but the idea is similar (use featureNames on CuffGeneSet instance).

GRID job with dependencies

In bioinformatics, it is common submit jobs to grid servers. In many cases, those grid jobs have dependencies on other jobs: one job shall only start after certain jobs have finished. For example, in the Tuxedo workflow for RNA-seq, cufflinks always run after (successful) tophat alignment and cuffdiff always runs after cuffdiff. The -hold_jid option in qsub command help to solve the dependency needs:

$qsub ../scripts/cuffmerge_7_9.qsub 

Your job 737624 (“CM_HCC_7_9”) has been submitted

$qsub -hold_jid 737624 ../scripts/cuffdiff_7_9.qsub

Your job 737625 (“CD_HCC_7_9”) has been submitted

$ q
737624 0.50284 CM_HCC_7_9 me r 01/18/2013 09:53:50 96mem_dev.q@n0204.myclust 8
737625 0.00000 CD_HCC_7_9 me hqw 01/18/2013 09:54:28

Note the “hqw” status in the second job indicates this job is on hold. In case the dependency is on more than one job, a list of jobs can be put in the -hold_jid option.

More information about this option and the grid environment can found from here.

Aquamacs + ESS + remote R session

I have been using Aquamacs as my default editor and the ESS (Emacs Speaks Statistics) as the default environment for all my R projects. Today I came across a nice tip about using Aquamacs and ESS to work on a remote R session. The original post by Matthew Keller can be found from here.

Steps to run Aquamacs and ESS to talk to a remote R session

1. Open a terminal in Aquamacs (M-x shell);

2. In the terminal type: ssh -XC name@server (-XC for data compression and X11 forwarding);

3. Start R in the terminal;

4. M-x ess-remote (let Aquamacs ESS talk to remote R session)

5. options(device=’x11′) (in R command line, but not needed if your server is running Linux).