↑ Return to Advanced Text Algorithms

Projects

Project1:

  • Implement either Knuth-Morris-Pratt (5 points) or Boyer-Moore (7 points) algorithm for pattern matching. Run your algorithm on the following input. Take text from here (by CNN). Each student has his own pattern to match. Find your pattern from the table below. Your program should report all matches for the pattern and total number of comparisons performed.
  • Bonus points: 5 for KMP and 7 for BM. If both algorithms implemented, then you will get 7 points
  • INPUT: TEXT and personal patterns from the table
  • OUTPUT: comma-separated list of positions of all pattern matches and also total number of comparisons
  • CLOSING DATE: 20.01.2018
  • Good luck!

Project2:

  • Implement Ukkonen algorithm for constructing a suffix tree. Each student gets the individual input text from the table below.
  • Bonus points: 8
  • INPUT: individual text for each student from the column “Assigment 2”. The analysis should be case-sensitive (i.e., do not ignore letter case) and also consider all non-alphabet characters occurring in the text, like spaces, commas, exclamation marks, etc. Make sure that you add at the end of your text character “$” that should not occur anywhere else in the text.
  • OUTPUT: suffix tree stored as a list of labeled edges and suffix links stored in CSV table files.
  • FORMAT:
    • Table of labeled edges should contain columns “src_node”, “head_node”, “left_label_pos”, “right_label_pos”. The root node of the suffix tree should have label “R”, leafs should have indices 1, 2, …, m assigned by Ukkonen algorithm, the intermediate nodes may have arbitrary labels. Column “src_node” contains the label of the source node of the edge, “head_node” contains the label of the head node of the edge, “left_label_pos” contains the left-most position of the edge label in the text, “right_label_pos” contains the right-most position of the edge label in the text.
    • Table of suffix links contains two columns “src” and “head”, where “src” contains label of the source node of the link, while “head” contains label of the head node of the link.
  • OPEN DATE: 03.10.2014
  • CLOSING DATE: 20.01.2018
  • Good luck!

Project 3:

  • SUBTASK A: Basing on the Ukkonen algorithm from Project 2, implement the algorithm searching for a generalized suffix tree that is indexing two strings (see Lecture 9)
    • Bonus points: 4
    • INPUT: individual two texts S1 and S2 (from Wikipedia) for each student from the column “Assigment 3”. The analysis should be case-sensitive (i.e., do not ignore letter case) and also consider all non-alphabet characters occurring in the text, like spaces, commas, exclamation marks, etc. Make sure that you add at the end of your text character “$” that should not occur anywhere else in the text.
    • OUTPUT:  suffix tree stored as a list of labeled edges and list of leafs where each leaf v contains information about in which string(s) and at what position(s) the corresponding suffix L(v) starts
    • FORMAT:
      • Table of labeled edges should contain columns “src_node”, “head_node”, “left_label_pos”, “right_label_pos”, “string_ID”. The root node of the suffix tree should have label “R”, leafs and intermediate nodes may have arbitrary labels. Column “src_node” contains the label of the source node of the edge, “head_node” contains the label of the head node of the edge, “left_label_pos” contains the left-most position of the edge label in the text, “right_label_pos” contains the right-most position of the edge label in the text. “string_ID” sontains the ID of the string the edge label was taken from. Here we have two string labels “S1” and “S2”.
      • The table of leafs should contain columns “node”, “suffix_pos”, “string_id”. Column “node” stays for the leaf node label (say v), “suffix_pos” contains the starting (left-most) position of the suffix L(v) represented by the leaf v and column “string_id” indicates in what string that suffix occurs. For instance, row “v 3 S1” indicates that the suffix L(v) associated to the leaf v starts at position 3 in string S1. Note that the same suffix L(v) may start in both strings S1 and S2 at two different positions.
  • SUBTASK B: Implement the algorithm which basing on the generalized suffix tree obtained in SUBTASK A will compute the longest common substring for strings S1 and S2
    • Bonus points: 4
    • INPUT: from SUBTASK A
    • OUTPUT: the longest common substring in S1 and S2
  • SUBTASK C: Compute the edit distance and optimal transcript for strings S1 and S2 with penalty score 1 for insertions, deletions and substitutions (see Lecture 10).
    • Bonus points: 4
    • INPUT: from SUBTASK A
    • OUTPUT: edit distance and sequence of letters (I, D, R) representing the optimal transcript of S1 into S2, where I stays for the insertion, D stays for the deletion and R stays for the replacement
  • SUBTASK D: Implement an algorithm solving the longest common subsequence problem for strings S1 and S2 (See Lecture 10).
    • Bonus points: 6
    • INPUT: from SUBTASK A
    • OUTPUT: a string representing the longest common subsequence S from S1 and S2, vector of integers V1 representing occurrences of S in S1 and vector V2 representing occurrences in S2
    • FORMAT: V1 and V2 two one-line separate files, where integers are comma-separated.
  • OPEN DATE: 18.10.2014
  • CLOSING DATE: 20.01.2018
  • Good luck!

 

In the table below you will find your gained scores and currently opened projects.