cc @peterjc
I came across some gb files where the BASE COUNT (normally a sequence header) is located above the features, so it messes up parsing of genbank files. This is properly handled by SnapGene and other parsers, but raises an error in Biopython. Dummy example below:
LOCUS name 136 bp DNA linear UNK 01-JAN-1980
DEFINITION description.
ACCESSION id
VERSION id
KEYWORDS .
SOURCE .
ORGANISM .
.
BASE COUNT 1284 a 1068 c 1078 g 1308 t
FEATURES Location/Qualifiers
protein_bind 1..34
/label="loxP"
protein_bind 35..68
/label="lox66"
protein_bind complement(69..102)
/label="lox66"
protein_bind 69..102
/label="lox71"
protein_bind complement(35..68)
/label="lox71"
protein_bind 103..136
/label="loxP_mutant"
ORIGIN
1 ataacttcgt atattttatt ttatacgaag ttatataact tcgtatattt tattttatac
61 gaacggtata ccgttcgtat attttatttt atacgaagtt attaccgttc gtatatttta
121 ttttatacga acggta
//
As far as I can tell, the value of BASE COUNT is not included in any way in the parsed SeqRecord, so it could be just dropped. I will include a PR for this.
As ever with gb files, in an ideal world they would be properly formatted, but other tools often create misformatted files.
cc @peterjc
I came across some gb files where the
BASE COUNT(normally a sequence header) is located above the features, so it messes up parsing of genbank files. This is properly handled by SnapGene and other parsers, but raises an error in Biopython. Dummy example below:As far as I can tell, the value of
BASE COUNTis not included in any way in the parsedSeqRecord, so it could be just dropped. I will include a PR for this.As ever with gb files, in an ideal world they would be properly formatted, but other tools often create misformatted files.