Monthly Archives: November 2012

samtools tips

To calculate the total number of reads in a bam file:

$samtools view -c mapping.bam
119244250

To calculate only mapped reads:

$samtools view -c -F 4 mapping.bam
103642171

Here “-F 4” is the flag used in the sam format. Similarly we can use “-f 4” to get the number of unmapped reads.

$samtools view -f4 mapping.bam
15602079

The sum of the mapped and unmapped reads equals the total reads:

$echo "103642171 + 15602079" |bc
119244250
Advertisements

Read non-standard GenBank file using BioPython SeqIO

A very convenient tool in Biopython (http://biopython.org) is the SeqIO module, which allows one to read and write multiple file formats, including GenBank (.gbk) files. A limitation of the SeqIO is that it has a strict format requirement, which insists index 9 of a gbk sequence line must be the sequence position (line 923-928 in Scanner.py distributed in EPD Python 7.3-2):

if len(line) > 9 and  line[9:10]!=' ': 
   raise ValueError("Sequence line mal-formed, '%s'" % line)
seq_lines.append(line[10:]) #remove spaces later
line = self.handle.readline()

This has been noticed by biopython users and reported as bug (#3309) in 2011 (http://biopython.org/pipermail/biopython-dev/2011-October/009270.html). In the bug report, the author (Liam Childs) also reported a simple fix, which is the following (line #914-928):

line = self.line
idx = line.find('1') + 1 # added to debug #3309
while True:
    if not line:
        raise ValueError("Premature end of file in sequence data")
    line = line.rstrip()
    if not line:
        import warnings
        warnings.warn("Blank line in sequence data")
        line = self.handle.readline()
        continue
    if line=='//':
        break
    if line.find('CONTIG')==0:
        break
    # if len(line) > 9 and  line[9:10]!=' ': # removed to debug #3309
    if len(line) > idx and  line[idx:idx + 1]!=' ': # added to debug 3309
        raise ValueError("Sequence line mal-formed, '%s'" % line)
    seq_lines.append(line[10:]) #remove spaces later
    line = self.handle.readline()

Compile C++ code with gsl installed in user directory

GSL (http://www.gnu.org/software/gsl/) is a great C/C++ library for scientific computing. I always install a current copy of GSL in the server that I use. However, I don’t have admin access on those servers, and, as a result, I  cannot use ldconfig to add gsl library.  The following trick help me to compile my gsl C++ code.

1). Define environment variable $LD_LIBRARY_PATH (in BASH);

export LD_LIBRARY_PATH=$HOME/ext/lib

, where $HOME/ext/lib is where gsl library are installed and contains “libgsl.so.0” and other files.

2). Compile the C++ code and link them to gsl.

g++ gen_data.cpp -o gen_data -I/home/biobeatext/include -L /home/biobeat/ext/lib -lgsl -lgslcblas

3). Now ./gen_data will execute the code.

Without defining $LD_LIBRARY_PATH in Step 1, the following error message will be displayed when executing gen_data:

./gen_data: error while loading shared libraries: libgsl.so.0: cannot open shared object file: No such file or directory

This is because the shell cannot find libgsl.so.0 in ldconfig path, which can only be set by admin.

Prefix figure numbers in Appendix in LaTex

The following code will use “Figure S[1, 2, ..]” for figures in appendix.

\makeatletter
\renewcommand{\thefigure}{S\@arabic\c@figure}
\makeatletter

If “Supplementary Figure” instead of “Figure” is preferred, one can use the following command:

\renewcommand{\figurename}{Supplementary Figure}

Similarly, to put “S” before equation numbers (e.g. [S1], [S2], instead of [1], [2]), one can use the following command:

\makeatletter

\renewcommand{\theequation}{S\@arabic\c@equation}

\makeatletter