The world’s Largest Sharp Brain Virtual Experts Marketplace Just a click Away
Levels Tought:
Elementary,Middle School,High School,College,University,PHD
| Teaching Since: | Apr 2017 |
| Last Sign in: | 103 Weeks Ago, 3 Days Ago |
| Questions Answered: | 4870 |
| Tutorials Posted: | 4863 |
MBA IT, Mater in Science and Technology
Devry
Jul-1996 - Jul-2000
Professor
Devry University
Mar-2010 - Oct-2016
Homework Problem:
This problem will be comparing and measuring the difference between
DNA sequences. This type of analysis is common in the Computational
Biology field. There are three parts to this assignment. You must provide
functions with the given names and parameters. The names of any other
supporting functions are named at your discretion. The names should be
reflective of the function they perform.
Measuring DNA Similarity
DNA is the hereditary material in human and other species. Almost
every cell in a person’s body has the same DNA. All information in a
DNA is stored as a code in four chemical bases: adenine (A), guanine
(G), cytosine (C) and thymine (T). The differences in the order of these
bases provide a means of specifying different information.
In the lectures there have been examples of string representation and use
in C++. In this assignment we will create strings that represent the
possible DNA sequences. DNA sequences use a limited alphabet
(A,C,G,T) within a string. Here we will be implementing a number of
functions that are used to search for substrings that may represent genes
and protein sequences within the DNA.
One of the challenges in computational biology is determining what the
codons in a DNA sequence represent. A codon is a sequence of three
nucleotides that form a unit of genetic code in a DNA or RNA molecule.
For example, given the sequence GGGA, the codon could be a GGG or a
GGA, depending on where the gene begins with the sequence. Clues
about how to interpret a DNA sequence can be found by comparing an
unknown DNA sequence to a known sequence and measuring their
similarity. If sequences are similar, then it can be hypothesized that they
have similar functions and proteins.
Another common DNA analysis function is to compare two sequences to
determine the similarity of newly discovered DNA to the large databases
of known DNA sequences.
Hamming distance and similarity between two strings
Hamming distance is one of the most common ways to measure the
similarity between two strings of the same length. Hamming distance is
a positionbyposition comparison that counts the number of positions in
which the corresponding characters in the string are different. Two
strings with a small Hamming distance are more similar than two strings
with a larger Hamming distance.
Example:
first string = “ACCT”
second string = “ACCG”
ACCT
| | | *
ACCG
In this example, there are three matching characters and one mismatch,
so the Hamming distance is one.
The similarity score for two sequences is then calculated as follows:
similarity_score = (string length hamming distance) / string length
similarity_score = (41)/4=3/4=0.75
Two sequences with a high similarity score are more similar than two
sequences with a lower similarity score. The two strings are always the
same length when calculating a Hamming distance.
Assignment Details:
In this assignment, you will search a string looking for exact matches to
a given codon, calculate the Hamming distance between two strings, and
calculate the similarity scores for sample DNA sequences compared to
known DNA sequences.
Here we have provided a small portion of a DNA sequence from a
human, a mouse, and an unknown species. Smaller DNA sequences will
be compared to each of these larger DNA sequences to determine which
has the best match. Each of the DNA sequences can be copied from this
writeup and stored in a variable in your program.
Part1
Your program will ask the user for a codon sequence (three characters).
You will search each of the given species (human, mouse, unknown) to
find all the locations that match that codon. Your program will pass each
of the predefined sequences along with the codon to the
listCodonPositions() function that does not return any value. Repeat
until the codon given is a single character ‘*’.
The output should correspond to the following:
Enter codon:
CCG
Human: 11 61 98 165 179
Mouse:
Unknown: 11 179
Enter codon:
*
void listCodonPositions(string1, string2)The listCodonPositions()
function will take two arguments that are both strings and print each of
the positions where string1 has an exact match to string2. The function
will print the locations of all exact matches separated by spaces.
Part 2
Your program will ask the user for two sequences that will be used in
calculating a Hamming distance. Pass the two sequences to a function
named calcSimilarity that returns a floatingpoint result (similarity score
described above). Repeat until the sequence 1 given is a single character
‘*’.
The output should correspond to the following:
Enter sequence 1:
CCGCCGCCGA
Enter sequence 2:
CCTCCTCCTA
Similarity: 0.7
Enter sequence 1:
*
float calcSimilarity (string1, string2)The calcSimilarity() function will
take two arguments that are both strings. The function calculates the
Hamming distance and returns the similarity score. This function should
only calculate the similarity if the two strings are the same length,
otherwise return 0.
Part 3
Your program will ask the user for a sequence that will be compared to
each of the sequences (Human, Mouse, Unknown) to find the best
matching sequence. The user sequence will be passed to the
compareDNA() to find the best similarity score within the predefined
sequence. Your program will output the name of the sequence with the
best match. Repeat until the user sequence given is a single character
‘*’.
The output should correspond to the following:
Enter user sequence:
GTAGTTTAAA
Human
Enter user sequence:
TTTTAATAT
MouseEnter user sequence: *
float compareDNA(dbSequence, userSequence)The compareDNA()
function should take two arguments that are both strings. The function
should calculate the similarity score for each substring of the
dbSequence (substring should be same length as userSequence) and
return the best similarity score found across all the possible substrings.
Use the calcSimilarity() function described above.
Getting started
Let’s talk about implementation and testing your solutions. We should
approach the design of algorithms or programs from the top down.
Begin by writing the most abstract descriptions first and then repeatedly
writing more detailed descriptions of the complex abstractions. Each of
your abstractions can be its own function. Once you have reached a level
of detail that is well understood, you can begin to implement (write the
code) for those functions. This is best done from the bottom up, meaning
you implement the base functions (those that don’t call other functions)
first and test them until you are satisfied you have the correct code for
that function. Next you can implement the next layer of abstractions that
use the base functions to complete their algorithm. Again, test the new
functions until you are satisfied they perform as required. This method
of bottom up will progressively add in functionality until the complete
algorithm is implemented.
Write your listCodonPosition() function and test itby calling it from
main() and passing it two known strings. For example:
string string1 = “test string”
string string2 = “t”
cout << “Test:”;
listCodonPositions(string1, string2)
cout << endl;
Using a simple example to verify that the function is working will be
easier than testing it on a longer string. Once you’re confident the
function works, call the
function from main() using the following strings as the first input
parameter to the function:
humanDNA =
"CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAAC
GAGATTGCCAG
CACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCC
TTTTCTTTGAC
CTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCC
CGTGTCCTTTC
CACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATC
TGCAGGTGTCT
GACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAGCTGAGCACTGGA
GTGGAGTTTTC
CTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATG"
mouseDNA =
"CGCAATTTTTACTTAATTCTTTTTCTTTTAATTCATATATTTTTAAT
ATGTTTACTAT
TAATGGTTATCATTCACCATTTAACTATTTGTTATTTTGACGTCATTT
TTTTCTATTTC
CTCTTTTTTCAATTCATGTTTATTTTCTGTATTTTTGTTAAGTTTTCA
CAAGTCTAATA
TAATTGTCCTTTGAGAGGTTATTTGGTCTATATTTTTTTTTCTTCATC
TGTATTTTTAT
GATTTCATTTAATTGATTTTCATTGACAGGGTTCTGCTGTGTTCTGGA
TTGTATTTTTC
TTGTGGAGAGGAACTATTTCTTGAGTGGGATGTACCTTTGTTCTTG"
unknownDNA =
"CGCATTTTTGCCGGTTTTCCTTTGCTGTTTATTCATTTATTTTAAAC
GATATTTATAT
CATCGGGTTTCATTCACTATTTTTCTTTTCGATAAATTTTTGTCAGCA
TTTTCTTTTAC
CTCTTCTTTCTGTTTATGTTAATTTTCTGTTTCTTAACCCAGTCTTCT
CGATTCTTATC
TACCGGACCTATTATAGGTCACAGGGTCTTGATGCTTTGGTTTTCATC
TGCAAGAGTCT
GACTTCCTGCTAATGCTGTTCTGTGTCAGGGTGCATCTGAGCACTGAT
GTGGAGTTTTC
TTGTGGATATGAGCCATTCATAGTGTGGGATGTGCCATAGTTCATG"
Once you can find subsequences in the larger string, use “ATG” as the
second parameter (ATG is the sequence that can signify the beginning of
a gene in a DNA sequence.)
Now you can write your calcSimilarity() function and test it by calling it
in main(). Pass in two small strings where you can easily calculate the
similarity score by hand to verify that your calculations are correct.
Once you’re confident in the function’s correctness, call compareDNA()
fom main() to compare the Human, Mouse, and Unknown DNA string to
a given test string.