Category Archives: Python

Read non-standard GenBank file using BioPython SeqIO

A very convenient tool in Biopython (http://biopython.org) is the SeqIO module, which allows one to read and write multiple file formats, including GenBank (.gbk) files. A limitation of the SeqIO is that it has a strict format requirement, which insists index 9 of a gbk sequence line must be the sequence position (line 923-928 in Scanner.py distributed in EPD Python 7.3-2):

if len(line) > 9 and  line[9:10]!=' ': 
   raise ValueError("Sequence line mal-formed, '%s'" % line)
seq_lines.append(line[10:]) #remove spaces later
line = self.handle.readline()

This has been noticed by biopython users and reported as bug (#3309) in 2011 (http://biopython.org/pipermail/biopython-dev/2011-October/009270.html). In the bug report, the author (Liam Childs) also reported a simple fix, which is the following (line #914-928):

line = self.line
idx = line.find('1') + 1 # added to debug #3309
while True:
    if not line:
        raise ValueError("Premature end of file in sequence data")
    line = line.rstrip()
    if not line:
        import warnings
        warnings.warn("Blank line in sequence data")
        line = self.handle.readline()
        continue
    if line=='//':
        break
    if line.find('CONTIG')==0:
        break
    # if len(line) > 9 and  line[9:10]!=' ': # removed to debug #3309
    if len(line) > idx and  line[idx:idx + 1]!=' ': # added to debug 3309
        raise ValueError("Sequence line mal-formed, '%s'" % line)
    seq_lines.append(line[10:]) #remove spaces later
    line = self.handle.readline()
Advertisements