Monday, January 27, 2020

Sequence Alignment and Dynamic Programming

Sequence Alignment and Dynamic Programming Introduction Sequence alignment Sequence alignment is a standard method to compare two or more sequences by looking for a series of individual characters or character patterns that are in the same order in the sequences [1]. Also, it is a way of arranging two or more sequences of characters to recognize regions of similarity [2]. Importance of sequence alignment Sequence alignment is significant because in bimolecular sequences (DNA, RNA, or protein), high sequence similarity usually implies important functional or structural similarity that is the first step of many biological analysis [3]. Besides, sequence alignment can address significant questions such as detecting gene sequences that cause disease or susceptibility to disease, identifying changes in gene sequences that cause evolution, finding the relationship between various gene sequences that can indicate the common ancestry [4], detecting functionally important sites, and demonstrating mutation events [5]. Analysis of the alignment can reveal important information. It is possible to identify the parts of the sequences that are likely to be important for the function, if the proteins are involved in similar processes .The random mutations can accumulate more easily in parts of the sequence of a protein which are not very essential for its function. In the parts of the sequence that are essential for the function hardly any mutations will be accepted because approximately all changes in such regions will destroy the function [6]. Moreover, Sequence alignment is important for assigning function to unknown proteins [7]. Protein alignment of two residues implies that those residues perform similar roles in the two different proteins [8]. Methods The main purpose of sequence alignments methods is finding maximum degree of similarities and minimum evolutionary distance. Generally, computational approaches to solve sequence alignment problems can be divided into two categories: global alignments and local alignments. Global alignments traverse the entire length of all query sequences, and match as many characters as possible from end to end. These alignment methods are most useful when the sequences have approximately the same size or they are similar. The alignment is performed from beginning of the sequence to end of the sequences to find out the best possible alignment. On the other hand, Local alignments find the local regions with high level of similarity. They are more useful for sequences that are suspected to contain regions of similarity within their larger sequence context. [9] Besides, pairwise sequence alignment is used to find the regions of similarity between two sequences. As the number of sequences increases, comparing each and every sequence to every other may be impossible. So, we need multiple sequence alignment, where all similar sequences can be compared in one single figure or table. The basic idea is that the sequences are aligned on top of each other, so that a coordinate system is set up, where each row is the sequence for one protein, and each column is the same position in each sequence. [10] There are many different approaches and implementations of the methods to perform sequence alignment. These include techniques such as dynamic programming , heuristic algorithms (BLAST and FASTA similarity searching), probabilistic methods, dot-matrix methods, progressive methods, ClustalW , MUSCLE , T-Coffee , and DIALIGN. Dynamic programming Dynamic programming (DP) is a problem solving method for a class of problems that can be solved by dividing them down into simpler sub-problems. It finds the alignment by giving some scores for matches and mismatches (Scoring matrices).This method is widely used in sequence alignments problems. [11] However, when the number of the sequences is more than two, multiple dimensional Dynamic programming in infeasible because of the large storage and computational complexities.[16] Dynamic programming algorithms use gap penalties to increase the biological meaning [9]. There are different gap penalties such as linear gap, constant gap, gap open and gap extension. The gap score is a penalty given to alignment when there is insertion or deletion. There may be a case where there are continuous gaps all along the sequence during the evolution, so the linear gap penalty would not be suitable for the alignment. Therefore, gap opening penalty and gap extension penalty has been introduced when there are continuous gaps. The gap opening penalty is applied at the start of the gap, and then the other gap following it is given with a gap extension penalty which will be less compared to the open penalty. Different gap penalty functions require different dynamic programming algorithms [12]. Also; there is a substitution matrix to score alignments. The mainly used predefined scoring matrices for sequence alignment are PAM (Point Accepted Mutation) and BLOSUM (Blocks Substitut ion Matrix). The two algorithms, Smith-Waterman for local alignment and Needleman-Wunsch for global alignment, are based on dynamic programming. Needleman-Wunsch algorithm requires alignment score for a pair of residues to be equal or more than zero. No gap penalty is required, and score cannot decrease between two cells of pathway. Smith-Waterman requires a gap penalty to work efficiently. Residue alignment score may be positive or negative .Score can increase, decrease, or stay level between two cells of pathway [13]. Sequence Alignment Problems For an n-character sequence s, and an m-character sequence t , we construct an (n+1)Ãâ€"(m+1)matrix . Global alignment: F ( i, j ) = score of the best alignment of s[1i ] with t[1j] Local alignment: F ( i, j ) = score of the best alignment of a suffix of s[1i ] and a suffix of t[1j] There are three steps in the sequence alignments algorithms: Initialization In the initialization phase, we assign values for the first row and column of the alignment matrix .The next step of the algorithm depends on this. Fill In the fill stage, the entire matrix is filled with scores from top to bottom, left to right with appropriate values that depend on the gap penalties and scoring matrix. Trace back For each F ( i, j ), save pointers to cell that resulted in best score . For global alignment, we trace pointers back from F (m, n) to F(0, 0) to recover sequence alignments . For local alignment, we are looking for the maximum value of the F (i, j) that can be anywhere in the matrix. We trace pointers back from F (i, j) and stop when we get to a cell with value 0. Local alignment with scoring matrix After creating and initializing the alignment matrix ( F ) and trace back matrix, the score of F (i, j) for every cell is calculated as follows: For i = 1 to n+1 For j = 1 to m+1 left_score= F[i][ j-1] gap, diagonal_score=F[i-1[ j-1] + PAM250(s[i], t[j]), up_score= F[i-1][ j] gap scores=max[ 0, left_score, diagonal_score, up_score] Also, we should keep the reference to each cell to perform backtracking. traceback_matrix[i][j]= scores.index(F[i][j]) After filling the F matrix, we find the optimal alignment score and the optimal end points by finding the highest scoring cell, maxi,jF(i , j) . best_score has a default value equals to -1 . if F [i][j] > best_score: best_score= F [i][j] i_maximum_score, j_maximum_score = i, j To recover the optimal alignment, we trace back from i_maximum_score, j_maximum_score position , terminating the trace back when we reach a cell with score 0 . The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t. Local alignment with affine gap penalty For this problem, there are gap opening penalty and gap extension penalty. The gap opening penalty is applied at the start of the gap, and then the other gap following it is given with a gap extension penalty. Initialization: There are Four different matrices: up_score , left_score ,m_score , trace_back Filling matrix: For i = 1 to n+1: up_score[i][0] = -gap_opening_penalty-(i-1)*gap_extension_penalty For j = 1 to m+1: left_score[0][j] = -gap_opening_penalty-(j-1)*gap_extension_penalty For i = 1 to n+1: For j = 1 to m+1: up_score [i][j] = max( [up_score [i][j-1] gap_extension_penalty, m_score[i][j-1] gap_opening_penalty] ) Left_score[i][j] = max( [left_score[i-1][j] gap_extension_penalty, m_score[i-1][j] gap_opening_penalty] ) m_score[i][j] = BLOSUM62 (s[i], t[j])) +max( m_score [i-1][j-1], left_score [i-1][j-1], up_score [i-1][j-1] ) scores = [left_score[i-1][j-1], m_score[i-1][j-1] ,up_score[i-1][j-1], 0] We find the highest scoring cell, the position of that cell,and the best alignment by following the same steps as we accomplished in the previous problem. The time and space complexity of this algorithm is O(mn). Global alignment with constant gap penalty In this case every gap receives a fixed score, regardless of the gap length For i = 1 to m+1: alignment_matrix[i][0] = -gap_penalty For i = 1 to n+1: alignment_matrix[0][j] = -gap_penalty For i = 1 to n+1: For j = 1 to m+1: scores = [alignment_matrix[i][j-1] gap_penalty,alignment_matrix[i-1][j] gap_penalty, alignment_matrix[i-1][j-1] + BLOSUM62 (s[i], t[j]),) alignment_matrix[i][j] = max(scores) alignment_matrix[m][n] holds the optimal alignment score. The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t. Global alignment with scoring matrix In this problem there is a linear gap that each inserted or deleted symbol is charged g; as a result, if the length of the gap L; the total gap penalty would be the product of the two gL. For i = 1 to m+1: alignment_matrix[i][0] = -i*gap_penalty For i = 1 to n+1: alignment_matrix[0][j] = -j*gap_penalty scores = [alignment_matrix[i][j-1] gap_penalty,alignment_matrix[i-1][j] gap_penalty, alignment_matrix[i-1][j-1] + BLOSUM62 (s[i], t[j]),) alignment_matrix[i][j] = max(scores) alignment_matrix[m][n] holds the optimal alignment score. The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t. Global alignment with scoring matrix and affine gap penalty There are Four different matrices: up_score , left_score ,m_score , trace_back Filling matrix: For i = 1 to n+1: up_score[i][0] = -gap_opening_penalty-(i-1)*gap_extension_penalty For j = 1 to m+1: left_score[0][j] = -gap_opening_penalty-(j-1)*gap_extension_penalty For i = 1 to n+1: For j = 1 to m+1: up_score [i][j] = max( [up_score [i][j-1] gap_extension_penalty, m_score[i][j-1] gap_opening_penalty] ) Left_score[i][j] = max( [left_score[i-1][j] gap_extension_penalty, m_score[i-1][j] gap_opening_penalty] ) m_score[i][j] = BLOSUM62 (s[i], t[j])) +max( m_score [i-1][j-1], left_score [i-1][j-1], up_score [i-1][j-1] ) maximum_alignment_score = max(m_score[m][n], left_score[m][n], up_score[m][n]) The time and space complexity of this algorithm is O(mn) which m is the length of sequence s , and n is the length of sequence t. The above algorithms require too much time for searching large databases so we cannot use these algorithms. There are several methods to overcome this problem. Heuristic Method It is an algorithm that gives only approximate solution to a problem. Sometimes we are not able to formally prove that this solution actually solves the problem, but since heuristic methods are much faster than exact algorithms, they are commonly used . FASTA is a heuristic method for sequence alignment .The main idea of this method is choosing regions of the two sequences that have some degree of similarity, and using dynamic programming to compute local alignment in these regions. The disadvantage of using these methods is losing significant amount of sensitivity. Parallelization is a possible solution for solving this problem.[14] Parallel Algorithm In this paper [ 15 ] a parallel method is introduced to reduce the complexity of the dynamic programming algorithm for pairwise sequence alignment. The time consumption of sequential algorithm mainly depends on the computation of the score matrix .For calculating the score of each cell, the computation of F(i,j) can be started only when F(i-1,j-1), F(i-1,j) and F(i,j-1) acquire their values. Consequently, it is possible to conduct the computation of score matrix sequentially in order of anti-diagonals .So, the values in the same anti-diagonal can be calculated simultaneously. ( Figure 1 ) Figure1 .Computing score matrix in parallel manner .The values of the cells marked by à ¯Ã†â€™Ã‚   can be computed simultaneously. There are two models for problem solving using parallel method that improve the performance of the pairwise alignment algorithm. Pipeline model: Each row of the score matrix is computed successively by a processor, which blocks itself until the required values in the above row are computed. Anti-diagonal model: From the left-top corner to the right-bottom corner of score matrix, all processors compute concurrently along an anti-diagonal of the matrix. Each idle processor selects a cell from the current anti-diagonal and computes its value. When all values in current anti-diagonal are computed, the computation moves on to next anti-diagonal. In the algorithm that is based on the pipeline model, the score matrix is partitioned into several blocks by column and several bands by row. All the bands distributed to multiple processors, and each processor computes the block in its own band simultaneously. By applying parallel algorithm, The time complexity is O(n) when n processor is used. [15] Progressive Method For solving multiple sequence alignment problems, the most common algorithm used is progressive method. This algorithm consists of three main stapes. First, comparing all the sequences with each other, and producing similarity scores ( distance matrix) . This stage is parallelized. The second stapes groups the most similar sequences together using the similarity scores and a clustering method such as Neighbor-Joining to create a guide tree. Finally, the third stage sequentially aligns the most similar sequences and groups of sequences until all the sequences are aligned. Before alignment with a pairwise dynamic programming algorithm, groups of aligned sequences are converted into profiles. A profile represents the character frequencies for each column in an alignment. In the final stage, for aligning groups of sequences, trace back information from full pairwise alignment is required.[ 17 ] ClustalW This algorithm that has become the most popular for multiple sequence alignment implements progressive method. The time complexity of this method is O (N 4 + L 2) and the space complexity is O (N2 + L 2). [18] Conclusion By comparing the different methods to implement pairwise sequence alignment and multiple sequence alignment , we can conclude that using parallel algorithms that implement pipeline model or anti-diagonal model are effective algorithm for performing pairwise sequence alignments. The algorithms that implement progressive method such as ClustalW are effective algorithm for solving multiple sequence alignments problems. References Robert F. Murphy, Computational Biology, Carnegie Mellon University www.cmu.edu/bio//LecturesPart03.ppt http://en.wikipedia.org/wiki/Sequence_alignment Dan Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology (Cambridge University Press, 1997). http://cs.calvin.edu/activities/blasted/intro03.html http://www.embl.de/~seqanal/courses/commonCourseContent/commonMsaExercises.html Per Kraulis , Stockholm Bioinformatics Center, SBC ,http://www.avatar.se/molbioinfo2001/seqali-why.html http://iitb.vlab.co.in/?sub=41brch=118sim=656cnt=1 Andreas D. Baxevanis, B. F. Francis Ouellett ,Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins http://amrita.vlab.co.in/?sub=3brch=274sim=1433cnt=1 David S.Moss, Sibila Jelaska, Sandor Pongor, Essays in Bioinformatics, ISB 1-58603-539-8 http://amrita.vlab.co.in/?sub=3brch=274sim=1431cnt=1 Burr Settles, Sequence Alignment, IBS Summer Research Program 2008, http://pages.cs.wisc.edu/~bsettles/ibs08/lectures/02-alignment.pdf Aoife McLysaght, Biological Sequence Comparision/Database Homology Searching, The University of Dublin, http://www.maths.tcd.ie/~lily/pres2/sld001.htm Rapid alignment methods FASTA and BLAST http://www.cs.helsinki.fi/bioinformatiikka/mbi/courses/07-08/itb/slides/itb0708_slides_83-116.pdf Yang Chen, Songnian Yu, Ming Ling, Parallel Sequence Alignment Algorithm For Clustering System, School of Computer Enginnering and science, Shanghai University Heitor S. Lope, Carlos R ,Erig Lima , Guilherme L. Morit , A Parallel Algorithm for Large-Scale Multiple Sequence Alignment , Bioinformatics Laboratory/CPGE Federal University of Technology – Paran ÃÅ'  Scott Lloyd, Quinn O Snel , Accelerated large-scale multiple sequence alignment Kridsadakorn Chaichoompu, Surin Kittitornkun, and Sissades Tongsima ,MT-ClustalW: Multithreading Multiple Sequence Alignment

Saturday, January 18, 2020

History Vietnam Controlled Assessment Essay

P1 Para-starter: Use of Defoliation to win hearts and minds. Point 1: Causes birth defects  Evidence: â€Å"Agent Orange is fifty times more concentrated than normal agricultural herbicides; this extreme intensity completely destroyed all plants in the area. Agent Orange not only had devastating effects on agriculture but also on people and animals. The Vietnam Red Cross recorded over 4.8 million deaths and 400,000 children born with birth defects due to exposure to Agent Orange. (http://vietnamawbb.weebly.com/napalm-agent-orange.html) Explanation: So it affected South Vietnam negatively, caused them to hate US and feel sympathy for VC. Evidence 2: Point 2: Use of napalm strikes Evidence: â€Å"Fail grey smoke where they’d burnt off the rice fields, brilliant white smoke from phosphorus, and deep black smoke from napalm. They said that if you stood at the base of the column it would suck the air right out of your lungs.† (Sauvain, Philip: Vietnam) Explanation: Consequently shows how bad it is from American POV so would be worse for random civilian. Evidence 2: Para-ender: Overall, defoliation is bad because it makes civilians hate US. P2 Para-starter: Use of Search and Destroy to win hearts and minds Point 1: Ruthless aggression of Americans Evidence: â€Å"Frustrated and frightened American troops settled on searching villages and destroying those instead. In most cases these villages played no role in supporting the VC.†(Demarco, Neil: Vietnam) Explanation: Because of this Americans would kill innocents (Refer to My Lai Massacre and Zippo lighters) Evidence 2: Para-ender: As a result, brutality of US caused the civilains to hate the US P3 Para-Starter: In addition, use of Search and Destroy to counter VC Point 1: VC were well prepared Evidence: â€Å"Such missions were ineffective because at the slightest hint of american activity the communist forces would slip away into the jungle.†(Bircher, Rob and May-History controlled assessment) Explanation: Shows how well prepared VC were compared to americans Evidence 2: â€Å"60%of US casualties from the war came from traps and mines† Explanation: Shows how vulnerable Americans were, demoralized American troops and failed to counter VC P4 Para-starter: Finally, Defoliation counters VC Point 1: Successful to an extent Evidence: â€Å"It is estimated that approximately 77 million litres of this acid was sprayed over Vietnam† (Rob Bircher and Steve May – History Controlled Assessment) Evidence 2:†Nearly 5.5 million acres of South Vietnamese forest and cropland†(Gibson, Michael – The war in Vietnam) Explanation: initial plan to uncover Ho Chi Minh trail, but not fully achieved. Para-ender: In Addition, they couldn’t do more damage cause communists are supported by USSR and China.

Friday, January 10, 2020

My Hometown Essay

Changing is inevitable part in our life. We daily see the evidences of changing anywhere. In fact, the changing is nature phenomenon. Also, sometimes the differences which have been resulted of changing can be wide and distinct, other times it can be unnoticeable. There is an ancient quote which says â€Å"if people have finished changing, they will finish,† this say implies the idea that the variability is a continuous process. My hometown varies a lot for the last ten years compared to when I was a child. The obvious changing has been occurred in people income rate, public services, and education. First, the annual income for each people has increased in the last several years many time than the past. Many available jobs opportunities are provided. Because there are many oil companies branches has been opened in my hometown, a lot of people involve in these companies and get sufficient salaries, while then, there were limited numbers of small stores which required few workers. Also, because people in the past used to work in farms and agricultural production, most of them spent a lot labor with merely money. Moreover, now days, the agricultural yield has significantly increased since the technology was adapted in production. Conversely, the production was so limited that people did not get enough food for their families because old methods of management and the harmful effects of pests were predominant in agricultural production. Moreover, by the time the companies are established in the city, many banks branches have been opened ,and individuals can acquire the benefits of banks services. For instance, they can easily get loans and financial support or invest their money. Whereas, in the past, may hometown did not have banks. Therefore, people used to go to another cities which had banks. Al in all, the financial abilities for people has changed a lot. Second, the public services which have been offered for people are immensely available in these days. To illustrate, there are many health centers and hospitals which are opened to take care of people. Conversely, my city had a small health center which was designed for limited population in the past. Some people who had serious disease did not find the suitable therapy or the first aids. Furthermore, right now, my city has a convenient transportation even though they do not have modern transitions such as subways and monorail. Many traffic highways are built and communicated with most of the other cities. On the other hand, the roads were inconvenient and most of them did not pave in past; therefore, people were obligated to walk in mud in winter. In addition, a big bridge is built on Tigris river which follows through the city and divides it into two parts in order to communicate both cities part, while the old woody bridge was unsafe and causes many accident in which many people were died. In addition, many parks, lots, and railroad are built to absorb the increasing in population. Finally, the developing in education is noticeable. For example, many education institutes are recently built such as primary, secondary, and high schools. A lot of students annually attend the schools, whereas, there were limited number of schools which was built in old design. Also, there is a big university has opened for 8 years. Therefore, the graduated high schools students can complete their academic study. Furthermore, not only government but also private schools are opened which have different specialists and fields. Thus, the students will have options when they choose their career; conversely, students used to travel to another cities in order to attend universities. In short, today, may hometown is different to what it was in the past in three important sectors finance, public facilities, and education sectors. Although no all changing presents positive results, it is still better than the last time. In my opinion my hometown will continue developing although that is might produce new challenges.

Thursday, January 2, 2020

Solutions for Poor Contries in Bottom Billion by Paul Collier

The Bottom Billion by Paul Collier discusses why the poorest countries are failing and then offers some insights and solutions to the problem. He says the four major problems in developing nations are: conflict, natural resources, bad neighbors, and bad governments. The conflicts are usually civil wars which have huge costs and the situation just becomes worse the longer the conflicts drag on. Collier states that countries rich in natural resources are often worse off than countries that are not, he attributes this problem to several different factors. One of the factors is that the resources open the possibility for conflict over the resources. Another factor is that if a country strictly focuses it’s on a specific natural resource then the other resources and industries might get forgotten and lose value. Being landlocked with bad neighbors can also be a large problem because it makes it almost impossible to be a part of world trade, so these landlocked countries have to d epend on their neighbors for most of the trade and materials. A bad government can also be very destructive to a country’s economy, if they create unreasonable and restrictive policies. The smaller countries are also at a disadvantage because it is hard for them to get any investors, because the investors would much rather invest in well-known countries like India or China. After Collier stated all the problems he also offered up some possible solutions. He believed that aid agencies should concentrate