Problem Statement You have the mRNA sequence that results from the transcription of the Homo sapiens Hemoglobin subunit beta gene. Knowing that the 5' and 3' ends of the mRNA are processed post-transcriptionally, you know that the start codon and termination codon lie somewhere inside the sequence. A manual inspection of the mRNA sequence should reveal the locations of the start and stop codons, but to ensure you don't miss anything you decide to write a Python script to analyze the mRNA sequence and find the positions of both codons.
You have the mRNA sequence, in the 5' to 3' direction, in a text file:
From the lecture you know that the canonical start codon is AUG, and you know the 3 stop codons are UAA, UAG, and UGA. Requirements We have covered enough Python to accomplish this task. The basic idea is to store the mRNA sequence as a string value, then take advantage of the string's find() function to locate the start codon and the first stop codon. We haven't covered how to read data from a file in Python yet, but you can copy and paste the sequence into a script.
Your script's output should follow the template shown below:
Homo sapiens HBB mRNA:
found at position
# of amino acids in the HBB protein:
Be sure to comment your code meaningfully! This not only helps you to understand your code, it also helps me understand your thought processes, which is important for awarding partial credit when necessary. Commenting is also one of the rubric items, so if you do not comment your code you will lose points.
Data Storage in the Script
How you store the data inside a script is very important. In general, you want to minimize hardcoding data values, especially if they will be used repeatedly. “Hardcoding” means to use the literal data value in your code instead of storing it in a variable. Every place where a data value is hardcoded represents a potential source of error. If that data value has to be changed, and it is hardcoded, every instance of that value in the script must be changed to avoid errors. If you instead store the value in a variable, and use the variable name in the script instead of the data value itself, you only have to change the data value once, where the variable is initialized.
With this in mind, you should store the following initial data at the top of your script, to be used later: The mRNA sequence should be stored as a single-line string with no whitespace or line feed characters, in a variable named “HBB_CDS” (short for HBB CoDing Sequence). Although Python allows syntax for storing multiline strings, do not use this syntax, since the line feed characters will be included when you performs searches on the sequence.
The codon length, 3, should be stored in a variable named “Codon_length”.
The value of the start codon, “aug”, should be stored in a variable named “Start_codon”.
The 3 stop codons, “uaa”, “uag”, and “uga”, should be stored in a list. You could use 3 separate variables and store each stop codon separately, but using a list only requires 1 variable, and you can use list indexing to retrieve individual values.
Use of Upper vs. Lower Case
Whether you use upper case or lower case for the sequence data and codons is entirely up to you. Just be sure that you are consistent throughout your script.
Displaying the mRNA Sequence
The first item in the output is the display of the mRNA sequence itself. On one line you should display the species, Homo sapiens, followed by the abbreviation (“HBB”) of the gene. Below this line the mRNA sequence itself should be displayed at 60 bases per line, which is the same convention used by GenBank. You do not need to include numbering or spaces every 10 bases like GenBank does, however.
Hint: Use a for loop combined with the range function with an increment of 60, and print each line as a slice, or substring, of 60 bases beginning with the current position in the loop.
Finding and Displaying the Position of the Translation Start Codon
The translation start codon will be the first occurrence of the canonical start codon, AUG, as the mRNA is read from left to right. You can use the string's find() function, which we covered in the Module 1 Python lecture to do this. One important thing to keep in mind is that Python treats strings as 0-based in terms of indexing, meaning the first base in the mRNA is at position 0, not 1. When you display the position of the start codon you must remember to add 1 to the position returned by the find() function, since we read nucleotide sequences as 1-based, with the first base starting at position 1.
Caution: Do not add 1 to the position of the codon when you store it, or you will run the risk of error when you use the position for searches, etc. Only add 1 to the position when you are displaying the codon's position; e.g.:
print(“Translation start:”, start_codon_pos + 1)
In the example above, the variable, start_codon_pos is not changed; the values of start_codon_pos and the “+ 1” are dynamically added in a different, local variable that is passed as an argument to the print() function, and this local variable is lost once the print() function is done.
Finding and Displaying the Position of the Translation Stop Codon
There are 3 possible stop codons, UAA, UAG, and UGA, and any one of these will signal translation to terminate. You can find the stop codon using a similar approach to finding the start codon. There are a few things to bear in mind, however: Translation begins at the position of the first AUG codon, so the stop codon must come after the start codon.
You don't know beforehand which stop codon will be the first one encountered, so you must check for all 3 of them. Whichever of the 3 stop codons occurs first after the start codon will be the one that terminates protein synthesis.
Translation reads the mRNA as codons, not individual bases, and codons do not overlap each other. Therefore, when you look for the stop codon you must read the sequence 1 codon, or 3 bases, at a time, with the first codon being the one immediately following the start codon. So if you have the following sequence:
the start codon is at position 4 (in a Python string it will be position 3 since strings are 0-based). Reading the sequence 1 codon at a time to find the stop codon would result in the sequence being read as follows:
ggg aug acc cag aaa uaa
The stop codon, UAA, would thus be found at position 13 (index 12 in the Python string).
If you were to read the sequence one base at a time instead of one codon at a time, you would find a stop codon at position 5 (index 4 in the Python string), which is incorrect.
Be sure to store the position of the stop codon in a variable so you can display it after you have found it.
Hint: This is another good use of a loop with the range function. The range function should begin at the first codon after the start codon, and use an increment of 3 to read the sequence one codon at a time. Inside the loop use an if-elif-elif block to check for each of the stop codons. The stop codons are stored in a list, so you can use list indexes (0, 1, 2) to access individual stop codons. Once a stop codon is found, use the break statement to terminate the loop immediately. Don't forget to store the position of the stop codon in a variable, since you will need to display it.
Calculating and Displaying the Number of Amino Acids in the HBB Protein
Once you have the positions of both the start and stop codons, you can calculate how many amino acids are encoded by the HBB mRNA. Keep in mind the positions of the start and stop codons give the length of the mRNA in bases, not codons, but the number of amino acids will always be equal to the number of codons.
Hint: The math involved here is pretty straightforward, but Python will end up giving you a result that is a floating point value. To convert the floating point value to an integer, use Python's int() function:
int_value = int(floating_point_value)