ComputerScienceExpert

(11)

$18/per page/

About ComputerScienceExpert

Levels Tought:
Elementary,Middle School,High School,College,University,PHD

Expertise:
Applied Sciences,Calculus See all
Applied Sciences,Calculus,Chemistry,Computer Science,Environmental science,Information Systems,Science Hide all
Teaching Since: Apr 2017
Last Sign in: 103 Weeks Ago, 3 Days Ago
Questions Answered: 4870
Tutorials Posted: 4863

Education

  • MBA IT, Mater in Science and Technology
    Devry
    Jul-1996 - Jul-2000

Experience

  • Professor
    Devry University
    Mar-2010 - Oct-2016

Category > Programming Posted 04 May 2017 My Price 9.00

CS 4390 Introduction to bioinformatics.

CS 4390 Introduction to bioinformatics. 

Due Date: 02/14/2017

 

You can use any high level programming language. You can adopt procedural or object oriented programming technique. You need to submit the code and a word document to report your findings.

 

  • 1.Converting gene exons to protein sequences (20)

You are given a simple file (“exons.fa”). This is a fasta file containing exon sequences. An exon sequence follows the exon id which always starts with “>”. The exon sequences are from 5’ to 3’. Implement a program that will take these exon sequences as input and generate all possible protein sequences. Note that you have to take alternate splicing into account. Note that a minimum of 2 exons must be joined to be able to translate to a protein sequence. For example, if a gene has 3 exons (exon1, exon2, and exon3), you can generate the following mRNAs with alternate splicing

  • exon1 exon2
  • exon1 exon3
  • exon2 exon3
  • exon1 exon2 exon3

Note that since transcription is directional; in any spliced mRNA exon1 should come before exons 2 and 3; exon 2 should come before exon3.

Use the “codon to amino acid” conversion code given in page 67 of the book.  You can treat the ‘U’ as ‘T’. Remember that you protein translation happens from a start codon till a stop codon.

Report all the possible protein sequences.

 

 

2.Counting l-mer frequencies (20 points)

Implement a program that counts the number of occurrences of each l-mer in a string of length n. Run it over Ebola virus genome and construct the l-mer frequencies. Run the program for l = 4.

Ebola virus genome is on BlackBoard as KM034562v1.fa. But, you can also see genomes of many species here http://hgdownload.cse.ucsc.edu/downloads.html

 

  • Report the 10 l-mer with the highest frequency. Report the 10 l-mer with the lowest frequency.

3.Predicting motifs of transcription factors (60 points)

 

You are given a fasta file (gata6.fa). This file represents 200 DNA sequences (each sequence has length = 200 bases) which are bound by a transcription factor called GATA6 in mouse extraembryonic endoderm stem cells. You can find the actual paper and data source here https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?.

 

Implement a greedy motif finding algorithm (Section 5.5 from the book) to identify the best motif (choose motif length = 7) of this transcription factor. Once you find the best motif, extract the best representation of that motif in the first 50 DNA sequences to get the alignment profile. Upload the alignment profile to http://weblogo.berkeley.edu/logo.cgi to create the motif logo.

Answers

(11)
Status NEW Posted 04 May 2017 04:05 AM My Price 9.00

-----------

Not Rated(0)