The world’s Largest Sharp Brain Virtual Experts Marketplace Just a click Away
Levels Tought:
Elementary,Middle School,High School,College,University,PHD
| Teaching Since: | Apr 2017 |
| Last Sign in: | 103 Weeks Ago, 3 Days Ago |
| Questions Answered: | 4870 |
| Tutorials Posted: | 4863 |
MBA IT, Mater in Science and Technology
Devry
Jul-1996 - Jul-2000
Professor
Devry University
Mar-2010 - Oct-2016
In this assignment, we are going to:
1. Align the FASTA sequences using a program called "mafft"
2. Calculate pairwise distances between the aligned sequences using a program called "quicktree"
Your assignment is to write an executable program called:
~/assignments/assignment10/alignAndDist.py
The program should take a single command-line argument: the name of a FASTA-formatted sequence file. As test cases, I have supplied three unaligned FASTA files in:
/home/PyProgBiol_materials/assignment10/
called:
/home/PyProgBiol_materials/assignment10/BC01.fasta
/home/PyProgBiol_materials/assignment10/BC02.fasta
/home/PyProgBiol_materials/assignment10/BC03.fasta
These are the same as the correct output files from assignment 9. For testing purposes, I would copy these files into your ~/assignments/assignment10 directory. Your program should then work on any one of these input files, by typing:
./alignAndDist.py BC01.fasta
The program should print nothing to the screen, although you are free to print progress messages (or anything else you want) to the screen, if you'd like.
The required output will be 3 files, in this case named:
BC01.fasta.mafft
BC01.fasta.tab
BC01.fasta.dists
That is, the first one is the user-supplied file name, with an added ".mafft" extension. This file will contain the mafft-aligned sequence files.
The second file is a conversion of BC01.fasta.mafft to tab-delimited format. You may want to use the program:
/home/PyProgBiol_materials/assignment10/fastaToTab.py
To accomplish this task. You are free to copy this to your ~/assignments/assignment10 directory, or use it where it is. I have also placed a copy of this program in /usr/loca/bin, so it is in your executable PATH. You can use it by just typing:
fastaToTab.py
I have also written this as a python module, so you can "include" it in your program, should you wish to take that route.
Finally, the last required output file, BC01.fasta.dists, contains a matrix of pairwise distances between all the aligned sequences in BC01.fasta.tab. This can be produced from BC01.fasta.tab by the "quicktree" program.
So, the order of operations goes:
XX.fasta (user-input file) -> XX.fasta.mafft (aligned sequences) -> XX.fasta.tab (convert aligned sequences to tab-delimited) -> XX.fasta.dists (pairwise distances)
Both mafft and quicktree are in /usr/local/bin on our server, so they are available by typing:
mafft
or
quicktree
I will go over using these in the podcasts, although you are welcome to try on your own and/or consult the internet.
Also, all the required output files for each of the supplied .fasta input files can be found in:
/home/PyProgBiol_materials/assignment10/
For testing purposes.
This assignment does a lot of computation with minimal Python code, and highlights writing scripts that make use of existing 3rd-party command-line applications. This represents about 90% (well, a large bit, anyway) of what "bioinformatics" really is: piecing together programming bits to perform complex analysis "pipelines."