Parsing FASTA files
What, again?
FASTA files are a bit more complex than what I talked about a couple weeks ago. That FASTA parser won't handle some FASTA files. It assumes there is a blank line between each record. I made that assumption because it's easier to write a parser when there is a well-defined "end of record" identifier. Real FASTA files don't always have that blank line, as with the following.
>YAL059W-36310.36514 Putative promoter sequence AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTT TATAATATTTGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTT TTTTCTTCTTTTTTTTTCCTTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAA GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence TTTCATATGAAAGGTCCTAGGAATACACGATTCTTGTACGCATTCTTCTTTTTTCTATCT TCTTTCATTCTTTGTACATTAGATAACATGGTTTTAGCTTAGTTTTATTTTATTTTTTAT ATATCTGGATGTATACTATTATTGAAAAACTTCATTAATAGTTACAACTTTTTCAATATC AAGTTGATTAAGAAAAAGAAAATTATTATGGGTTAGCTGAAAACCGTGTGATGCATGTCG TTTAAGGATTGTGTAAAAAAGTGAACGGCAACGCATTTCTAATATAGATAACGGCCACAC AAAGTAGTACTATGAA >YAL056W-38979.39264 Putative promoter sequence AAGTGTTAGTTTATAACATGGTCTCAATAATTGCACCACAACGGCTTCTCTTTTATAGAT GGTTAACATTATAGTATCAATATTATCATCATGATTAAATGATGATGTATAATACTTACC CGATGTTAAATCTTATTTTTTCATGCAGTAAGTAATCATGCAACAAGAAAAACCCGTAAT TAAGCGAACATAGAACAACTAGCATCCCCGATAAGACGGAATAGAATAGTAAAGATTGTG ATTCATTGGCAGGTCCATTGTCGCATTACTAAATCATAGGCATGGALess often you'll come across FASTA files with extra blank lines between records or at the beginning or end of the file, but I've never seen one with an extra blank line inside of a record. I expect that many parsers won't handle that case correctly.
What happens if I use the old FASTA parser to read these FASTA records? I'll save them to the file "test.fasta" and use fasta_reader.py as a library to read the first record in the file.
>>> import fasta_reader
>>> rec = fasta_reader.read_fasta_record(open("test.fasta"))
>>> rec.title
'YAL059W-36310.36514 Putative promoter sequence'
>>> rec.sequence
'AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTTTATAATATT
TGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTTTTTTCTTCTTTTTTTTTCC
TTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAAGGACCCTTTAGAAGTATTGAATGTG>
YAL058W-37154.37469 Putative promoter sequenceTTTCATATGAAAGGTCCTAGGAAT
ACACGATTCTTGTACGCATTCTTCTTTTTTCTATCTTCTTTCATTCTTTGTACATTAGATAACATGGTTT
TAGCTTAGTTTTATTTTATTTTTTATATATCTGGATGTATACTATTATTGAAAAACTTCATTAATAGTTA
CAACTTTTTCAATATCAAGTTGATTAAGAAAAAGAAAATTATTATGGGTTAGCTGAAAACCGTGTGATGC
ATGTCGTTTAAGGATTGTGTAAAAAAGTGAACGGCAACGCATTTCTAATATAGATAACGGCCACACAAAG
TAGTACTATGAA>YAL056W-38979.39264 Putative promoter sequenceAAGTGTTA
GTTTATAACATGGTCTCAATAATTGCACCACAACGGCTTCTCTTTTATAGATGGTTAACATTATAGTATC
AATATTATCATCATGATTAAATGATGATGTATAATACTTACCCGATGTTAAATCTTATTTTTTCATGCAG
TAAGTAATCATGCAACAAGAAAAACCCGTAATTAAGCGAACATAGAACAACTAGCATCCCCGATAAGACG
GAATAGAATAGTAAAGATTGTGATTCATTGGCAGGTCCATTGTCGCATTACTAAATCATAGGCATGGA'
>>>
There is no blank line so the code that reads the sequence lines kept
on reading lines until it reached the end of the file. All of the
rest of the file was interpreted as sequence, so the other two title
files were included in the sequence output.
You know how I keep talking about sanity checking? I should have done a bit more checking to make sure that I didn't use a line starting with a ">" as a sequence line. The change would occur in the following of the existing code
sequence_lines = []
while 1:
line = infile.readline().rstrip()
if line == "":
break
sequence_lines.append(line)
and the code with the change would look like
sequence_lines = []
while 1:
line = infile.readline().rstrip()
if line == "":
break
elif line.startswith(">"):
raise TypeError("Read title line when expecting sequence line")
sequence_lines.append(line)
I'm not going to show the full program with this change because I'm
instead going to handle that case correctly.
Look-ahead
The problem with this more complex FASTA format is that there's no way to identify the end of a record except by reading the start of the next record. If I read a line after the title line it could be sequence data or it could be a blank line (marking the end of the sequence data) or it could be the next record's title line (implicitly marking the end of the sequence data and the start of the next record). In computer science terms, it needs to look-ahead to see if it's at the end of a record.
My fasta_record_reader() function reads a function and leaves the input file positioned ready to read the next record. After it reads a record it's ready to read the next record. If there's no blank line then it will read the first line of the next record, notice it's finished with the original record and return it to the caller. But in this case it's read one line too many. The file pointer will be on the sequence data line after the title line of the next record.
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
# Throw away the contents of the line?
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
|
♦>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
| title == ">YAL059W-36310.36514 Putative promoter sequence\n" | |||
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
# Throw away the contents of the line?
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
|
>YAL059W-36310.36514 Putative promoter sequence ♦GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
|
| |||
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
# Throw away the contents of the line?
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
|
>YAL059W-36310.36514 Putative promoter sequence ♦GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
| line == "GGACCCTTTAGAAGTATTGAATGT" | |||
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
# Throw away the contents of the line?
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
|
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG ♦>YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
# Throw away the contents of the line?
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
|
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG ♦>YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
|
| |||
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
# Throw away the contents of the line?
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
|
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG ♦>YAL058W-37154.37469 Putative promoter sequence CGCATTCTTCTTTTTTCTATCT | ||
| line == ">YAL058W-37154.37469 Putative promoter sequence\n" | |||
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
# Throw away the contents of the line?
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
|
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence ♦CGCATTCTTCTTTTTTCTATCT | ||
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
# Throw away the contents of the line?
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
|
>YAL059W-36310.36514 Putative promoter sequence GGACCCTTTAGAAGTATTGAATGTG >YAL058W-37154.37469 Putative promoter sequence ♦CGCATTCTTCTTTTTTCTATCT | ||
Possible solution #1 - seek
Files support a seek function, which moves the location of the file pointer. Here's how to move backwards in the file:
title = infile.readline()
if not title: return None # End-of-file?
data = []
while 1:
line = infile.readline()
if line.isspace():
return FastaRecord(title, "".join(data))
if line.startswith(">"):
infile.seek(-len(line), 1) # the 1 means "relative to current position"
return FastaRecord(title, "".join(data))
data.append(line.rstrip("\n"))
Python supports many file-like objects but not all are seekable. One example is a network connection
>>> import urllib2 >>> f = urllib2.urlopen("http://www.dalkescientific.com/writings/NBN/data/ls_orchid.fasta") >>> f.read(40) '>gi|2765658|emb|Z78533.1|CIZ78533 C.irap' >>> f.read(40) 'eanum 5.8S rRNA gene and ITS1 and ITS2 D' >>> f.read(40) 'NA\nCGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGAT' >>> f.read(40) 'CATTGATGAGACCGTGGAATAAACGATCGAGTG\nAATCCG' >>> f.seek(-5, 1) Traceback (most recent call last): File "<stdin>", line 1, in ? AttributeError: addinfourl instance has no attribute 'seek' >>>while another is program input from a terminal or unix pipe connected to sys.stdin.
Because of the limitations in different file-like objects it's best to have a solution which works without the ability to seek.
Possible solution #2 - a new file-like object
It's easy to make a file-like object. Actually, it's easy to make a sufficiently file-like object. The FASTA record reader only uses the readline method of the file object. I can implement as a wrapper to an existing file. Here's one which is useful for debugging.
class DebugFile(object):
def __init__(self, infile):
self.infile = infile
def readline(self):
line = self.infile.readline()
print "Read:", repr(line)
return line
>>> f = DebugFile(open("test.fasta"))
>>> s = f.readline()
Read: '>YAL059W-36310.36514 Putative promoter sequence\n'
>>> s
'>YAL059W-36310.36514 Putative promoter sequence\n'
>>>
I'll modify this to have an UndoFile with the new method saveline which takes a string. The string is saved so the next readline returns that string instead of reading the next line of the file. Once returned it's reset to None to mean that no line is saved.
class UndoFile:
def __init__(self, infile):
self.infile = infile
self._saved = None
def readline(self):
if self._saved is not None:
s = self._saved
self._saved = None
return s
return self.infile.readline()
def saveline(self, s):
if self._saved is not None:
raise TypeError("Only one line may be saved")
self._saved = s
Here it is in action on a file-like object which doesn't support seeks.
>>> import urllib2
>>> infile = urllib2.urlopen("http://www.dalkescientific.com/writings/NBN/data/test.fasta")
>>> f = UndoFile(infile)
>>> f.readline()
'>YAL059W-36310.36514 Putative promoter sequence\n'
>>> f.readline()
'AAATAATATTTGGGGCCCCTCGCGGCTCATTTGTAGTATCTAAGATTATGTATTTTCTTT\n'
>>> f.readline()
'TATAATATTTGTTGTTATGAAACAGACAGAAGTAAGTTTCTGCGACTATATTATTTTTTT\n'
>>> f.readline()
'TTTTCTTCTTTTTTTTTCCTTTATTCAACTTGGCGATGAGCTGAAAATTTTTTTGGTTAA\n'
>>> f.readline()
'GGACCCTTTAGAAGTATTGAATGTG\n'
>>> f.readline()
'>YAL058W-37154.37469 Putative promoter sequence\n'
>>> s = _
>>> s
'>YAL058W-37154.37469 Putative promoter sequence\n'
>>> f.saveline(s)
>>> f.readline()
'>YAL058W-37154.37469 Putative promoter sequence\n'
>>>
and here's how to use it to read a FASTA record
def read_fasta_record(infile):
line = infile.readline()
if not line:
return # End-of-file
# Double-check that it's a title line
if not line.startswith(">"):
raise TypeError(
"The title line must start with a '>': %r" % line)
title = line[1:].rstrip("\n")
sequences = []
while 1:
line = infile.readline()
if line.isspace():
break
if line.startswith(">"):
# Read the first line of the next record.
# Save for next time and go with what we've had
infile.saveline(line)
break
sequences.append(line)
return FastaRecord(title, "".join(sequences))
This is a good general-purpose solution. You can use the UndoFile with anything that parses files using readline and you can use it with anything that understands that new "saveline" protocol. Protocol is the Python word which says that there's an agreement on what a given objects must do to be labled "*-like". The more an object implements the file protocol the more it can be used any place that a file can be used.
Some of the Biopython parsers use this approach. The biggest problem is the performance. Every readline() first goes to the UndoHandle (that's the Biopython version of UndoFile) and nearly all then go to actual file object. Function calls in Python are somewhat slow.
There's also the complication in the caller of wrapping the actual file into an undo-able object.
The iterator protocol
The for-loop works on lists, dictionaries, strings, files, generator objects and other data types. The technical term for the things you iterate over is container; you iterator over elements in the container.
You can make your own objects work in a for-loop by implementing the iterator protocol. The for-loop starts by asking the object for its iterator. The object can be a container (like a string or dictionary) or an iterator. If it's an iterator it should return itself. The for-loop gets the iterator by calling the special method __iter__().
The iterator must implement two methods: __iter__() and next(). The first almost certainly returns itself. The second is more interesting. When the for-loop needs the next element from the iterator it gets it by calling next() and using the return value. If there are no more values then next() must raise an exception named StopIteration.
To see that in action, here's an iterator which counts from a given number down to 0.
>>> class Countdown(object): ... def __init__(self, value): ... self.value = value ... def __iter__(self): ... return self ... def next(self): ... if self.value < 0: ... raise StopIteration ... value = self.value ... self.value = value-1 ... return value ... >>> for i in Countdown(5): ... print i ... 5 4 3 2 1 0 >>>In detail here's what the for-loop uses from the container and the iterator.
>>> container = Countdown(5) >>> iter(container) <__main__.Countdown object at 0x554f0> >>> iterobj = iter(container) >>> iterobj.next() 5 >>> iterobj.next() 4 >>> iterobj.next() 3 >>> iterobj.next() 2 >>> iterobj.next() 1 >>> iterobj.next() 0 >>> iterobj.next() Traceback (most recent call last): File "<stdin>", line 1, in ? File "<stdin>", line 8, in next StopIteration >>>
Possible solution #3 - a FASTA record iterator
I could make an iterator over the records in a FASTA file pretty easily if I have a read_fasta_record function which reads a record and has the file handle set up for the next record.
import fasta_reader
class FastaRecordReader(object):
def __init__(self, infile):
self.infile = infile
def __iter__(self):
return self
def next(self):
x = fasta_reader.read_fasta_record(self.infile)
if x is None:
raise StopIteration
return x
>>> for rec in FastaRecordReader(open("simple.fasta")):
... print rec.title
...
first sequence record
second sequence record
third sequence record
>>>
This is effectively identical to the generator solution of
import fasta_reader
def read_fasta_records(infile):
while 1:
rec = fasta_reader.read_fasta_record(infile)
if rec is None:
break
yield rec
What's useful about the FastaRecordReader class solution is that the object provides a place to store information needed to read the next record, so long as there isn't lookahead information which needs to be shared with other parsers which expect a standard file object.
Instead of saving the look-ahead information in its own class, I'll merge it with the record reading code to get a new FastaRecordReader. The interface is unchanged, except that this will correctly handle cases where there is no blank line to end a record.
class FastaRecordReader(object):
def __init__(self, infile):
self.infile = infile
self._saved = None
def __iter__(self):
return self
def next(self):
# The first line could come from the lookahead
if self._saved is not None:
line, self._saved = self._saved, None
else:
line = self.infile.readline()
if not line:
# end-of-file
raise StopIteration
# Double-check that it's a title line
if not line.startswith(">"):
raise TypeError(
"The title line must start with a '>': %r" % line)
title = line[1:].rstrip("\n")
sequences = []
while 1:
line = self.infile.readline()
if line.isspace():
break
if line.startswith(">"):
self._saved = line
break
sequences.append(line)
return FastaRecord(title, "".join(sequences))
Possible solution #4 - a generator
I showed how the first FastaRecordReader class was "effectively identical to the generator implementation of read_fasta_records." The same is true with the improved FastaRecordReader – I can make an improved generator version. The code only changes in a few places; I use the local variable "saved" instead of the instance attribute "self._saved" and I "yield" items in a while loop instead of "return"ing items each time the function is called.
def read_fasta_records(infile):
saved = None
while 1:
# Look for the start of a record
if saved is not None:
line = saved
saved = None
else:
line = infile.readline()
if not line:
return
# skip blank lines
if line.isspace():
continue
# Double-check that it's a title line
if not line.startswith(">"):
raise TypeError(
"The title line must start with a '>': %r" % line)
title = line.rstrip()
# Read the sequence lines until the next record, a blank
# line, or the end of file
sequences = []
while 1:
line = infile.readline()
if not line or line.isspace():
break
if line.startswith(">"):
# The start of the next record
saved = line
break
sequences.append(line.rstrip("\n"))
yield FastaRecord(title, "".join(sequences))
The only major difference is in how they save state. The class approach uses instance variables while the generator uses local variables. The technical term for saving state in local variables is closure. Some people prefer closures over objects and classes.
I've found that the generator code for examples like this are easier than a class-based solution, but mostly because everything is in one function instead of in two methods of a class. In theory anything one one form has a solution in the other form and the translation from one to the other is mechanical. My experience says that the differences are also very minor.
(The more complex translations - though still mechanical - occur when the closure solution yields when inside of several loops. Bioinformatics data files are made of records and they can be processed a record at a time so this doesn't happen frequently. I've only needed the more complex class solution once.)
Which is best?
That depends on your style and what you need it for. I usually prefer the generator function first (easiest for me to use), the look-ahead solution second (very general and easiest to undestand), and the class-based one last.
Copyright © 2001-2008 Dalke Scientific Software, LLC.


