ComputerScienceExpert

(11)

$18/per page/

About ComputerScienceExpert

Levels Tought:
Elementary,Middle School,High School,College,University,PHD

Expertise:
Applied Sciences,Calculus See all
Applied Sciences,Calculus,Chemistry,Computer Science,Environmental science,Information Systems,Science Hide all
Teaching Since: Apr 2017
Last Sign in: 103 Weeks Ago, 3 Days Ago
Questions Answered: 4870
Tutorials Posted: 4863

Education

  • MBA IT, Mater in Science and Technology
    Devry
    Jul-1996 - Jul-2000

Experience

  • Professor
    Devry University
    Mar-2010 - Oct-2016

Category > Programming Posted 12 May 2017 My Price 9.00

Measuring DNA Similarity

Homework Problem:

This problem will be comparing and measuring the difference between

DNA sequences. This type of analysis is common in the Computational

Biology field. There are three parts to this assignment. You must provide

functions with the given names and parameters. The names of any other

supporting functions are named at your discretion. The names should be

reflective of the function they perform.

 

Measuring DNA Similarity

DNA is the hereditary material in human and other species. Almost

every cell in a person’s body has the same DNA. All information in a

DNA is stored as a code in four chemical bases: adenine (A), guanine

(G), cytosine (C) and thymine (T). The differences in the order of these

bases provide a means of specifying different information.

In the lectures there have been examples of string representation and use

in C++. In this assignment we will create strings that represent the

possible DNA sequences. DNA sequences use a limited alphabet

(A,C,G,T) within a string. Here we will be implementing a number of

functions that are used to search for substrings that may represent genes

and protein sequences within the DNA.

One of the challenges in computational biology is determining what the

codons in a DNA sequence represent. A codon is a sequence of three

nucleotides that form a unit of genetic code in a DNA or RNA molecule.

For example, given the sequence GGGA, the codon could be a GGG or a

GGA, depending on where the gene begins with the sequence. Clues

about how to interpret a DNA sequence can be found by comparing an

unknown DNA sequence to a known sequence and measuring their

similarity. If sequences are similar, then it can be hypothesized that they

have similar functions and proteins.

Another common DNA analysis function is to compare two sequences to

 

determine the similarity of newly discovered DNA to the large databases

of known DNA sequences.

 

Hamming distance and similarity between two strings

Hamming distance is one of the most common ways to measure the

similarity between two strings of the same length. Hamming distance is

a position­by­position comparison that counts the number of positions in

which the corresponding characters in the string are different. Two

strings with a small Hamming distance are more similar than two strings

with a larger Hamming distance.

Example:

first string = “ACCT”

second string = “ACCG”

ACCT

| | | *

ACCG

In this example, there are three matching characters and one mismatch,

so the Hamming distance is one.

The similarity score for two sequences is then calculated as follows:

similarity_score = (string length ­ hamming distance) / string length

similarity_score = (4­1)/4=3/4=0.75

Two sequences with a high similarity score are more similar than two

sequences with a lower similarity score. The two strings are always the

same length when calculating a Hamming distance.

 

Assignment Details:

In this assignment, you will search a string looking for exact matches to

a given codon, calculate the Hamming distance between two strings, and

calculate the similarity scores for sample DNA sequences compared to

known DNA sequences.

Here we have provided a small portion of a DNA sequence from a

human, a mouse, and an unknown species. Smaller DNA sequences will

be compared to each of these larger DNA sequences to determine which

has the best match. Each of the DNA sequences can be copied from this

write­up and stored in a variable in your program.

 

Part1

Your program will ask the user for a codon sequence (three characters).

You will search each of the given species (human, mouse, unknown) to

find all the locations that match that codon. Your program will pass each

of the predefined sequences along with the codon to the

listCodonPositions() function that does not return any value. Repeat

until the codon given is a single character ‘*’.

The output should correspond to the following:

Enter codon:

CCG

Human: 11 61 98 165 179

Mouse:

Unknown: 11 179

Enter codon:

*

void listCodonPositions(string1, string2)The listCodonPositions()

function will take two arguments that are both strings and print each of

the positions where string1 has an exact match to string2. The function

will print the locations of all exact matches separated by spaces.

 

Part 2

Your program will ask the user for two sequences that will be used in

calculating a Hamming distance. Pass the two sequences to a function

named calcSimilarity that returns a floating­point result (similarity score

described above). Repeat until the sequence 1 given is a single character

‘*’.

The output should correspond to the following:

Enter sequence 1:

CCGCCGCCGA

Enter sequence 2:

CCTCCTCCTA

Similarity: 0.7

Enter sequence 1:

*

float calcSimilarity (string1, string2)The calcSimilarity() function will

take two arguments that are both strings. The function calculates the

Hamming distance and returns the similarity score. This function should

only calculate the similarity if the two strings are the same length,

otherwise return 0.

 

Part 3

Your program will ask the user for a sequence that will be compared to

each of the sequences (Human, Mouse, Unknown) to find the best

matching sequence. The user sequence will be passed to the

compareDNA() to find the best similarity score within the predefined

sequence. Your program will output the name of the sequence with the

best match. Repeat until the user sequence given is a single character

‘*’.

The output should correspond to the following:

 

Enter user sequence:

GTAGTTTAAA

Human

Enter user sequence:

TTTTAATAT

MouseEnter user sequence: *

float compareDNA(dbSequence, userSequence)The compareDNA()

function should take two arguments that are both strings. The function

should calculate the similarity score for each substring of the

dbSequence (substring should be same length as userSequence) and

return the best similarity score found across all the possible substrings.

Use the calcSimilarity() function described above.

 

Getting started

Let’s talk about implementation and testing your solutions. We should

approach the design of algorithms or programs from the top down.

Begin by writing the most abstract descriptions first and then repeatedly

writing more detailed descriptions of the complex abstractions. Each of

your abstractions can be its own function. Once you have reached a level

of detail that is well understood, you can begin to implement (write the

code) for those functions. This is best done from the bottom up, meaning

you implement the base functions (those that don’t call other functions)

first and test them until you are satisfied you have the correct code for

that function. Next you can implement the next layer of abstractions that

use the base functions to complete their algorithm. Again, test the new

functions until you are satisfied they perform as required. This method

of bottom up will progressively add in functionality until the complete

algorithm is implemented.

Write your listCodonPosition() function and test itby calling it from

main() and passing it two known strings. For example:

string string1 = “test string”

 

string string2 = “t”

cout << “Test:”;

listCodonPositions(string1, string2)

cout << endl;

Using a simple example to verify that the function is working will be

easier than testing it on a longer string. Once you’re confident the

function works, call the

function from main() using the following strings as the first input

parameter to the function:

humanDNA =

"CGCAAATTTGCCGGATTTCCTTTGCTGTTCCTGCATGTAGTTTAAAC

GAGATTGCCAG

CACCGGGTATCATTCACCATTTTTCTTTTCGTTAACTTGCCGTCAGCC

TTTTCTTTGAC

CTCTTCTTTCTGTTCATGTGTATTTGCTGTCTCTTAGCCCAGACTTCC

CGTGTCCTTTC

CACCGGGCCTTTGAGAGGTCACAGGGTCTTGATGCTGTGGTCTTCATC

TGCAGGTGTCT

GACTTCCAGCAACTGCTGGCCTGTGCCAGGGTGCAGCTGAGCACTGGA

GTGGAGTTTTC

CTGTGGAGAGGAGCCATGCCTAGAGTGGGATGGGCCATTGTTCATG"

mouseDNA =

"CGCAATTTTTACTTAATTCTTTTTCTTTTAATTCATATATTTTTAAT

ATGTTTACTAT

TAATGGTTATCATTCACCATTTAACTATTTGTTATTTTGACGTCATTT

TTTTCTATTTC

CTCTTTTTTCAATTCATGTTTATTTTCTGTATTTTTGTTAAGTTTTCA

CAAGTCTAATA

TAATTGTCCTTTGAGAGGTTATTTGGTCTATATTTTTTTTTCTTCATC

TGTATTTTTAT

GATTTCATTTAATTGATTTTCATTGACAGGGTTCTGCTGTGTTCTGGA

TTGTATTTTTC

TTGTGGAGAGGAACTATTTCTTGAGTGGGATGTACCTTTGTTCTTG"

 

unknownDNA =

"CGCATTTTTGCCGGTTTTCCTTTGCTGTTTATTCATTTATTTTAAAC

GATATTTATAT

CATCGGGTTTCATTCACTATTTTTCTTTTCGATAAATTTTTGTCAGCA

TTTTCTTTTAC

CTCTTCTTTCTGTTTATGTTAATTTTCTGTTTCTTAACCCAGTCTTCT

CGATTCTTATC

TACCGGACCTATTATAGGTCACAGGGTCTTGATGCTTTGGTTTTCATC

TGCAAGAGTCT

GACTTCCTGCTAATGCTGTTCTGTGTCAGGGTGCATCTGAGCACTGAT

GTGGAGTTTTC

TTGTGGATATGAGCCATTCATAGTGTGGGATGTGCCATAGTTCATG"

Once you can find subsequences in the larger string, use “ATG” as the

second parameter (ATG is the sequence that can signify the beginning of

a gene in a DNA sequence.)

Now you can write your calcSimilarity() function and test it by calling it

in main(). Pass in two small strings where you can easily calculate the

similarity score by hand to verify that your calculations are correct.

Once you’re confident in the function’s correctness, call compareDNA()

fom main() to compare the Human, Mouse, and Unknown DNA string to

a given test string.

Answers

(11)
Status NEW Posted 12 May 2017 06:05 AM My Price 9.00

-----------

Not Rated(0)