The world’s Largest Sharp Brain Virtual Experts Marketplace Just a click Away
Levels Tought:
Elementary,Middle School,High School,College,University,PHD
| Teaching Since: | Apr 2017 |
| Last Sign in: | 103 Weeks Ago, 3 Days Ago |
| Questions Answered: | 4870 |
| Tutorials Posted: | 4863 |
MBA IT, Mater in Science and Technology
Devry
Jul-1996 - Jul-2000
Professor
Devry University
Mar-2010 - Oct-2016
CS 4390 Introduction to bioinformatics.
Due Date: 02/14/2017
You can use any high level programming language. You can adopt procedural or object oriented programming technique. You need to submit the code and a word document to report your findings.
You are given a simple file (“exons.fa”). This is a fasta file containing exon sequences. An exon sequence follows the exon id which always starts with “>”. The exon sequences are from 5’ to 3’. Implement a program that will take these exon sequences as input and generate all possible protein sequences. Note that you have to take alternate splicing into account. Note that a minimum of 2 exons must be joined to be able to translate to a protein sequence. For example, if a gene has 3 exons (exon1, exon2, and exon3), you can generate the following mRNAs with alternate splicing
Note that since transcription is directional; in any spliced mRNA exon1 should come before exons 2 and 3; exon 2 should come before exon3.
Use the “codon to amino acid” conversion code given in page 67 of the book. You can treat the ‘U’ as ‘T’. Remember that you protein translation happens from a start codon till a stop codon.
Report all the possible protein sequences.
2.Counting l-mer frequencies (20 points)
Implement a program that counts the number of occurrences of each l-mer in a string of length n. Run it over Ebola virus genome and construct the l-mer frequencies. Run the program for l = 4.
Ebola virus genome is on BlackBoard as KM034562v1.fa. But, you can also see genomes of many species here http://hgdownload.cse.ucsc.edu/downloads.html
3.Predicting motifs of transcription factors (60 points)
You are given a fasta file (gata6.fa). This file represents 200 DNA sequences (each sequence has length = 200 bases) which are bound by a transcription factor called GATA6 in mouse extraembryonic endoderm stem cells. You can find the actual paper and data source here https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?.
Implement a greedy motif finding algorithm (Section 5.5 from the book) to identify the best motif (choose motif length = 7) of this transcription factor. Once you find the best motif, extract the best representation of that motif in the first 50 DNA sequences to get the alignment profile. Upload the alignment profile to http://weblogo.berkeley.edu/logo.cgi to create the motif logo.