# Projects

## Project1:

• Implement either Knuth-Morris-Pratt (5 points) or Boyer-Moore (7 points) algorithm for pattern matching. Run your algorithm on the following input. Take text from here (by CNN). Each student has his own pattern to match. Find your pattern from the table below. Your program should report all matches for the pattern and total number of comparisons performed.
• Bonus points: 5 for KMP and 7 for BM. If both algorithms implemented, then you will get 7 points
• INPUT: TEXT and personal patterns from the table
• OUTPUT: comma-separated list of positions of all pattern matches and also total number of comparisons
• CLOSING DATE: 20.01.2018
• Good luck!

## Project2:

• Implement Ukkonen algorithm for constructing a suffix tree. Each student gets the individual input text from the table below.
• Bonus points: 8
• INPUT: individual text for each student from the column “Assigment 2”. The analysis should be case-sensitive (i.e., do not ignore letter case) and also consider all non-alphabet characters occurring in the text, like spaces, commas, exclamation marks, etc. Make sure that you add at the end of your text character “\$” that should not occur anywhere else in the text.
• OUTPUT: suffix tree stored as a list of labeled edges and suffix links stored in CSV table files.
• FORMAT:
• Table of labeled edges should contain columns “src_node”, “head_node”, “left_label_pos”, “right_label_pos”. The root node of the suffix tree should have label “R”, leafs should have indices 1, 2, …, m assigned by Ukkonen algorithm, the intermediate nodes may have arbitrary labels. Column “src_node” contains the label of the source node of the edge, “head_node” contains the label of the head node of the edge, “left_label_pos” contains the left-most position of the edge label in the text, “right_label_pos” contains the right-most position of the edge label in the text.
• OPEN DATE: 03.10.2014
• CLOSING DATE: 20.01.2018
• Good luck!

## Project 3:

• SUBTASK A: Basing on the Ukkonen algorithm from Project 2, implement the algorithm searching for a generalized suffix tree that is indexing two strings (see Lecture 9)
• Bonus points: 4
• INPUT: individual two texts S1 and S2 (from Wikipedia) for each student from the column “Assigment 3”. The analysis should be case-sensitive (i.e., do not ignore letter case) and also consider all non-alphabet characters occurring in the text, like spaces, commas, exclamation marks, etc. Make sure that you add at the end of your text character “\$” that should not occur anywhere else in the text.
• OUTPUT:  suffix tree stored as a list of labeled edges and list of leafs where each leaf v contains information about in which string(s) and at what position(s) the corresponding suffix L(v) starts
• FORMAT:
• Table of labeled edges should contain columns “src_node”, “head_node”, “left_label_pos”, “right_label_pos”, “string_ID”. The root node of the suffix tree should have label “R”, leafs and intermediate nodes may have arbitrary labels. Column “src_node” contains the label of the source node of the edge, “head_node” contains the label of the head node of the edge, “left_label_pos” contains the left-most position of the edge label in the text, “right_label_pos” contains the right-most position of the edge label in the text. “string_ID” sontains the ID of the string the edge label was taken from. Here we have two string labels “S1” and “S2”.
• The table of leafs should contain columns “node”, “suffix_pos”, “string_id”. Column “node” stays for the leaf node label (say v), “suffix_pos” contains the starting (left-most) position of the suffix L(v) represented by the leaf v and column “string_id” indicates in what string that suffix occurs. For instance, row “v 3 S1” indicates that the suffix L(v) associated to the leaf v starts at position 3 in string S1. Note that the same suffix L(v) may start in both strings S1 and S2 at two different positions.
• SUBTASK B: Implement the algorithm which basing on the generalized suffix tree obtained in SUBTASK A will compute the longest common substring for strings S1 and S2
• Bonus points: 4
• OUTPUT: the longest common substring in S1 and S2
• SUBTASK C: Compute the edit distance and optimal transcript for strings S1 and S2 with penalty score 1 for insertions, deletions and substitutions (see Lecture 10).
• Bonus points: 4