In my plotting lecture I showed how to plot the hydrophobicity profile for the bacteriorhodopsin record gi:66360541. I used hard-coded information; the actual sequence data must be in bacteriorhodopsin.fasta and I manually copied over the secondary structure elements.
I want my program to handle any GenBank record with seconary structure information. That is, I want it to read a GenPept file to get the sequence and SecStr fields then plot the hydrophobicity profile and highlight the helix and sheet regions.
I need a data file so I went to that GenBank like and changed the "Send to" to "File". Here's the record, which I'll save as "bR.gp".
LOCUS 1XJI_A 247 aa linear BCT 23-SEP-2004 DEFINITION Chain A, Bacteriorhodopsin Crystallized In Bicelles At Room Temperature. ACCESSION 1XJI_A VERSION 1XJI_A GI:66360541 DBSOURCE pdb: molecule 1XJI, chain 65, release Sep 23, 2004; deposition: Sep 23, 2004; class: Membrane Protein; source: Mol_id: 1; Organism_scientific: Halobacterium Salinarium; Organism_common: Bacteria; Strain: L33; Exp. method: X-Ray Diffraction. KEYWORDS . SOURCE Halobacterium salinarum ORGANISM Halobacterium salinarum Archaea; Euryarchaeota; Halobacteria; Halobacteriales; Halobacteriaceae; Halobacterium. REFERENCE 1 (residues 1 to 247) AUTHORS Faham,S., Boulting,G.L., Massey,E.A., Yohannan,S., Yang,D. and Bowie,J.U. TITLE Crystallization of bacteriorhodopsin from bicelle formulations at room temperature JOURNAL Protein Sci. 14 (3), 836-840 (2005) PUBMED 15689517 REFERENCE 2 (residues 1 to 247) AUTHORS Faham,S., Boulting,G.L., Massey,E.A., Yohannan,S., Yang,D. and Bowie,J.U. TITLE Direct Submission JOURNAL Submitted (23-SEP-2004) COMMENT Revision History: APR 19 5 Initial Entry. FEATURES Location/Qualifiers source 1..247 /organism="Halobacterium salinarum" /db_xref="taxon:2242" SecStr 9..32 /sec_str_type="helix" /note="helix 1" SecStr 36..61 /sec_str_type="helix" /note="helix 2" SecStr 64..71 /sec_str_type="sheet" /note="strand 1" SecStr 72..79 /sec_str_type="sheet" /note="strand 2" SecStr 82..100 /sec_str_type="helix" /note="helix 3" SecStr 104..126 /sec_str_type="helix" /note="helix 4" SecStr 130..159 /sec_str_type="helix" /note="helix 5" SecStr 164..190 /sec_str_type="helix" /note="helix 6" SecStr 200..224 /sec_str_type="helix" /note="helix 7" Het bond(215) /heterogen="(RET, 301 )" ORIGIN 1 aqitgrpewi wlalgtalmg lgtlyflvkg mgvsdpdakk fyaittlvpa iaftmylsml 61 lgygltmvpf ggeqnpiywa ryadwlfttp lllldlallv dadqgtilal vgadgimigt 121 glvgaltkvy syrfvwwais taamlyilyv lffgftskae smrpevastf kvlrnvtvvl 181 wsaypvvwli gsegagivpl nietllfmvl dvsakvgfgl illrsraifg eaeapepsag 241 dgaaats //
Biopython include a GenBank parser which supports GenPept. The parser is in Bio.GenBank and uses the same style as the Biopython FASTA parser. You need to create the parser first then use the parser to parse the opened input file.
>>> from Bio import GenBank >>> parser = GenBank.RecordParser() >>> record = parser.parse(open("bR.gp")) >>> record <Bio.GenBank.Record.Record instance at 0x13332b0> >>>A GenBank record has many fields and Biopython parsers just about all of them. For interactive use you can get a list of object properties using the dir() function.
>>> dir(record) ['BASE_FEATURE_FORMAT', 'BASE_FORMAT', 'GB_BASE_INDENT', 'GB_FEATURE_INDENT', 'GB_FEATURE_INTERNAL_INDENT', 'GB_INTERNAL_INDENT', 'GB_LINE_LENGTH', 'GB_OTHER_INTERNAL_INDENT', 'GB_SEQUENCE_INDENT', 'INTERNAL_FEATURE_FORMAT', 'INTERNAL_FORMAT', 'OTHER_INTERNAL_FORMAT', 'SEQUENCE_FORMAT', '__doc__', '__init__', '__module__', '__str__', '_accession_line', '_base_count_line', '_comment_line', '_contig_line', '_db_source_line', '_definition_line', '_features_line', '_keywords_line', '_locus_line', '_nid_line', '_organism_line', '_origin_line', '_pid_line', '_segment_line', '_sequence_line', '_source_line', '_version_line', 'accession', 'base_counts', 'comment', 'contig', 'data_file_division', 'date', 'db_source', 'definition', 'features', 'gi', 'keywords', 'locus', 'nid', 'organism', 'origin', 'pid', 'primary', 'references', 'residue_type', 'segment', 'sequence', 'size', 'source', 'taxonomy', 'version'] >>>The inspect module provides other ways to do introspection. (Introspection is the programming term for figuring out what's in a data structure using functions in the programming language instead of by looking at the source code.) Here I'll use it to show all the attributes and methods of a Bio.GenBank.Record.Record.
>>> import inspect >>> for (name, value) in inspect.getmembers(record): ... print repr(name), "==", repr(str(value)[:50]) ... 'BASE_FEATURE_FORMAT' == '%-21s' 'BASE_FORMAT' == '%-12s' 'GB_BASE_INDENT' == '12' 'GB_FEATURE_INDENT' == '21' 'GB_FEATURE_INTERNAL_INDENT' == '5' 'GB_INTERNAL_INDENT' == '2' 'GB_LINE_LENGTH' == '79' 'GB_OTHER_INTERNAL_INDENT' == '3' 'GB_SEQUENCE_INDENT' == '9' 'INTERNAL_FEATURE_FORMAT' == ' %-16s' 'INTERNAL_FORMAT' == ' %-10s' 'OTHER_INTERNAL_FORMAT' == ' %-9s' 'SEQUENCE_FORMAT' == '%9s' '__doc__' == 'Hold GenBank information in a format similar to th' '__init__' == '<bound method Record.__init__ of <Bio.GenBank.Reco' '__module__' == 'Bio.GenBank.Record' '__str__' == '<bound method Record.__str__ of <Bio.GenBank.Recor' '_accession_line' == '<bound method Record._accession_line of <Bio.GenBa' '_base_count_line' == '<bound method Record._base_count_line of <Bio.GenB' '_comment_line' == '<bound method Record._comment_line of <Bio.GenBank' '_contig_line' == '<bound method Record._contig_line of <Bio.GenBank.' '_db_source_line' == '<bound method Record._db_source_line of <Bio.GenBa' '_definition_line' == '<bound method Record._definition_line of <Bio.GenB' '_features_line' == '<bound method Record._features_line of <Bio.GenBan' '_keywords_line' == '<bound method Record._keywords_line of <Bio.GenBan' '_locus_line' == '<bound method Record._locus_line of <Bio.GenBank.R' '_nid_line' == '<bound method Record._nid_line of <Bio.GenBank.Rec' '_organism_line' == '<bound method Record._organism_line of <Bio.GenBan' '_origin_line' == '<bound method Record._origin_line of <Bio.GenBank.' '_pid_line' == '<bound method Record._pid_line of <Bio.GenBank.Rec' '_segment_line' == '<bound method Record._segment_line of <Bio.GenBank' '_sequence_line' == '<bound method Record._sequence_line of <Bio.GenBan' '_source_line' == '<bound method Record._source_line of <Bio.GenBank.' '_version_line' == '<bound method Record._version_line of <Bio.GenBank' 'accession' == "['1XJI_A']" 'base_counts' == '' 'comment' == 'Revision History:\nAPR 19 5 Initial Entry.' 'contig' == '' 'data_file_division' == 'BCT' 'date' == '23-SEP-2004' 'db_source' == 'pdb: molecule 1XJI, chain 65, release Sep 23, 2004' 'definition' == 'Chain A, Bacteriorhodopsin Crystallized In Bicelle' 'features' == '[<Bio.GenBank.Record.Feature instance at 0x13697d8' 'gi' == '66360541' 'keywords' == "['']" 'locus' == '1XJI_A' 'nid' == '' 'organism' == 'Halobacterium salinarum' 'origin' == '' 'pid' == '' 'primary' == '[]' 'references' == '[<Bio.GenBank.Record.Reference instance at 0x133d3' 'residue_type' == 'linear' 'segment' == '' 'sequence' == 'AQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVPA' 'size' == '247' 'source' == 'Halobacterium salinarum' 'taxonomy' == "['Archaea', 'Euryarchaeota', 'Halobacteria', 'Halo" 'version' == '1XJI_A' >>>
Introspection lets you get an idea of the data structure without needing to track down the source code. The interactive shell supports introspection to let you know the list of available properties. If you're using the shell through the terminal and you have readline support (so that the up- and down-arrows let you scroll through the history) then you can enable tab completion using the rlcompleter module. Here's what it looks like for me; the <<tab>> shows where I pressed the tab key.
>>> record.<<tab>> record.BASE_FEATURE_FORMAT record._pid_line record.BASE_FORMAT record._segment_line record.GB_BASE_INDENT record._sequence_line record.GB_FEATURE_INDENT record._source_line record.GB_FEATURE_INTERNAL_INDENT record._version_line record.GB_INTERNAL_INDENT record.accession record.GB_LINE_LENGTH record.base_counts record.GB_OTHER_INTERNAL_INDENT record.comment record.GB_SEQUENCE_INDENT record.contig record.INTERNAL_FEATURE_FORMAT record.data_file_division record.INTERNAL_FORMAT record.date record.OTHER_INTERNAL_FORMAT record.db_source record.SEQUENCE_FORMAT record.definition record.__class__ record.features record.__doc__ record.gi record.__init__ record.keywords record.__module__ record.locus record.__str__ record.nid record._accession_line record.organism record._base_count_line record.origin record._comment_line record.pid record._contig_line record.primary record._db_source_line record.references record._definition_line record.residue_type record._features_line record.segment record._keywords_line record.sequence record._locus_line record.size record._nid_line record.source record._organism_line record.taxonomy record._origin_line record.version >>>Here are some of the fields in the record
>>> record.sequence 'AQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVPAIAFTMYLSMLLGYGLTMV PFGGEQNPIYWARYADWLFTTPLLLLDLALLVDADQGTILALVGADGIMIGTGLVGALTKVYSYRFVWW AISTAAMLYILYVLFFGFTSKAESMRPEVASTFKVLRNVTVVLWSAYPVVWLIGSEGAGIVPLNIETLL FMVLDVSAKVGFGLILLRSRAIFGEAEAPEPSAGDGAAATS' >>> record.features [<<tab>> feature.__class__ feature.__init__ feature.__str__ feature.location feature.__doc__ feature.__module__ feature.key feature.qualifiers >>> feature.key 'source' >>> feature = record.features[1] >>> feature.key 'SecStr' >>> feature.location '9..32' >>> feature.qualifiers [<Bio.GenBank.Record.Qualifier instance at 0x13414b8>, <Bio.GenBank. Record.Qualifier instance at 0x13414e0>] >>> feature.qualifiers[0] <Bio.GenBank.Record.Qualifier instance at 0x13414b8> >>> feature.qualifiers[0].key '/sec_str_type=' >>> feature.qualifiers[0].value '"helix"' >>>
In case you didn't catch it in the above, here's how to get the
sequence information and the secondary structure element information.
## print_secstr
# print the secondary structure fields from a GenBank record
from Bio import GenBank
parser = GenBank.RecordParser()
record = parser.parse(open("bR.gp"))
print "sequence", repr(record.sequence)
for feature in record.features:
if feature.key == "SecStr":
locations = feature.location.split("..")
start = int(locations[0])
end = int(locations[1])
sec_str_type = "unknown"
for qualifier in feature.qualifiers:
if qualifier.key == "/sec_str_type=":
# get rid of the quotes around the qualifier
sec_str_type = qualifier.value[1:-1]
print sec_str_type, "starts at", start, "and ends at", end
(I don't like how the qualifier information contains the quotes. I
think the quotes are part of the syntax of the file format and aren't
part of the actual data. Oh well.)
Here's the output
sequence 'AQITGRPEWIWLALGTALMGLGTLYFLVKGMGVSDPDAKKFYAITTLVP AIAFTMYLSMLLGYGLTMVPFGGEQNPIYWARYADWLFTTPLLLLDLALLVDADQGTIL ALVGADGIMIGTGLVGALTKVYSYRFVWWAISTAAMLYILYVLFFGFTSKAESMRPEVA STFKVLRNVTVVLWSAYPVVWLIGSEGAGIVPLNIETLLFMVLDVSAKVGFGLILLRSR AIFGEAEAPEPSAGDGAAATS' helix starts at 9 and ends at 32 helix starts at 36 and ends at 61 sheet starts at 64 and ends at 71 sheet starts at 72 and ends at 79 helix starts at 82 and ends at 100 helix starts at 104 and ends at 126 helix starts at 130 and ends at 159 helix starts at 164 and ends at 190 helix starts at 200 and ends at 224If you look at the data file (which you really should do) then you'll see that the Biopython reader converted the lower-case sequence letters into upper-case.
For the assignment you'll need to know one more Python feature. So far your programs have used a hard-coded filename for input data. I'll want you to take a name from the command-line. These values are available to Python progams using the sys.argv variable. That's has a list of all the parameter strings passed on the command line. The first argument (sys.argv[0]) is special. It's the actual name of the program being run.
Here are some examples to show you what I mean. I'll create a program called print_args.py which prints the list of command-line arguments.
## show_args.py # Print the list of command-line arguments import sys print sys.argvI'll run it a few different ways from the command-line. A couple of times I'll use quoted text to show that that's converted into a single string by the shell.
% python print_args.py ['print_args.py'] % python print_args.py one two three ['print_args.py', 'one', 'two', 'three'] python print_args.py "Who are you?" ['print_args.py', 'Who are you?'] python print_args.py say "Polly want a cracker?" ['print_args.py', 'say', 'Polly want a cracker?'] %
Here's a program which prints the sum of the floating point values passed to it.
# sum the list of values entered on the command-line import sys total = 0.0 for arg in sys.argv[1:]: value = float(arg) total = total + value print "Sum:", totalalong with some examples of it in action.
% python sum.py Sum: 0.0 % python sum.py 1 Sum: 1.0 % python sum.py 23 -8 Sum: 15.0 % python sum.py 9.5 1.2 -3.4 2E3 Sum: 2007.3 %
Finally, here's a program which prints the repr() of the first line of the filename passed to it.
## first_line.py import sys # The length of argv is always at least 1 because sys.argv[0] contains # the name of the program. If there's a command-line argument then it # will have more than 1 value. Use that to check that the filename # was given. if len(sys.argv) == 1: sys.exit("No filename given") # The first element is the name of the program being read. # The second element is the first command-line parameter filename = sys.argv[1] line = open(filename).readline() print repr(line)Some examples
% python first_line.py No filename given % python first_line.py sum.py '# sum the list of values entered on the command-line\n' % python first_line.py br_sequences.fasta '>gi|4376110|emb|CAA49762.1| bacterioopsin [synthetic construct]\n' %
Copyright © 2001-2020 Andrew Dalke Scientific AB