parsing a fasta file using a generator ( python )

I am trying to parse a large fasta file and I am encountering out of memory errors. Some suggestions to improve the data handling would be appreciated. Currently the program correctly prints out the names however partially through the file I get a MemoryError

Here is the generator

def readFastaEntry( fp ):
    name = ""
    seq = ""
    for line in fp:
        if line.startswith( ">" ):
            tmp = []
            tmp.append( name )
            tmp.append( seq )
            name = line
            seq = ""
            yield tmp
        else:
            seq = seq.join( line )

and here is the caller stub more will be added after this part works

fp = open( sys.argv[1], 'r' )

for seq in readFastaEntry( fp ) :
    print seq[0]

For those not fimilar with the fasta format here is an example

>1 (PB2)
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC
TGCACTCAGGATGAAGTGGATGATG
>2 (PB1)
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC
ACATTTCCCTATACTGGAGACCCTCC

each entry starts with a “>” stating the name etc then the next N lines are data. There is no defined ending of the data other than the next line having a “>” at the beginning.

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

Have you considered using BioPython. They have a sequence reader that can read fasta files. And if you are interested in coding one yourself, you can take a look at BioPython’s code.

Edit: Code added

def read_fasta(fp):
    name, seq = None, []
    for line in fp:
        line = line.rstrip()
        if line.startswith(">"):
            if name: yield (name, ''.join(seq))
            name, seq = line, []
        else:
            seq.append(line)
    if name: yield (name, ''.join(seq))

with open('f.fasta') as fp:
    for name, seq in read_fasta(fp):
        print(name, seq)

Method 2

A pyparsing parser for this format is only a few lines long. See the annotations in the following code:

data = """>1 (PB2) 
AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATAC 
TCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCC 
TGCACTCAGGATGAAGTGGATGATG 
>2 (PB1) 
AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACC 
ACATTTCCCTATACTGGAGACCCTCC"""

from pyparsing import Word, nums, QuotedString, Combine, OneOrMore

# define some basic forms
integer = Word(nums)
key = QuotedString("(", endQuoteChar=")")

# sequences are "words" made up of the characters A, G, C, and T
# we want to match one or more of them, and have the parser combine
# them into a single string (Combine by default requires all of its
# elements to be adjacent within the input string, but we want to allow
# for the intervening end of lines, so we add adjacent=False)
sequence = Combine(OneOrMore(Word("AGCT")), adjacent=False)

# define the overall pattern to scan for - attach results names
# to each matched element
seqEntry = ">" + integer("index") + key("key") + sequence("sequence")

for seq,s,e in seqEntry.scanString(data):
    # just dump out the matched data
    print seq.dump()
    # could also access fields as seq.index, seq.key and seq.sequence

Prints:

['>', '1', 'PB2', 'AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG']
- index: 1
- key: PB2
- sequence: AATATATTCAATATGGAGAGAATAAAAGAACTAAGAGATCTAATGTCACAGTCTCGCACTCGCGAGATACTCACCAAAACCACTGTGGACCACATGGCCATAATCAAAAAGTACACATCAGGAAGGCAAGAGAAGAACCCTGCACTCAGGATGAAGTGGATGATG
['>', '2', 'PB1', 'AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC']
- index: 2
- key: PB1
- sequence: AACCATTTGAATGGATGTCAATCCGACTTTACTTTTCTTGAAAGTTCCAGCGCAAAATGCCATAAGCACCACATTTCCCTATACTGGAGACCCTCC

Method 3

Without having a great understanding of what you are doing, I would have written the code like this:

def readFastaEntry( fp ):
    name = ""
    while True:
        line = name or f.readline()
        if not line:
            break
        seq = []
        while True:
            name = f.readline()
            if not name or name.startswith(">"):
                break
            else:
                seq.append(name)
        yield (line, "".join(seq))

This gathers up the data after a starting line up to the next starting line. Making seq an array means that you minimize the string joining until the last possible moment. Yielding a tuple makes more sense than a list.

Method 4

def read_fasta(filename):
    name = None
    with open(filename) as file:
        for line in file:
            if line[0] == ">":
                if name:
                    yield (name, seq)
                name = line[1:-1].split("|")[0]
                seq = ""
            else:
                seq += line[:-1]
        yield (name, seq)

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating