I've designed an algorithm to find the longest common subsequence. these are steps:
Starts with i = 0
- Picks the first letter from the first string start from ith letter.
- Go to the second string looking for that picked letter.
- If not found return to the first string and picks the next letter and repeat 1 to 3 until it finds a letter that is in the second string.
- Now that found a common letter in the second string, adds that to
$common_subsequence. - Store its position in
$index. - Picks next letter from the first string and do step 2 but this time starts from
$index. - Repeat 3 to 6 until reached end of string 1 or string 2.
- If length
$common_subsequenceis greater than length of common subsequence so far add that change lcs to the$common_subsequence. - Add 1 to the i and repeat 1 to 9 while i is less that length of the first string.
This is an example:
X=A, B, C, B, D, A, B Y=B, D, C, A, B, A - First pick
A. - Look for
AinY. - Now that found
Aadd that to the$common_subsequence. - Then pick
BfromX. - Look for
BinYbut this time start searching fromA. - Now pick
C. It isn't there in string 2, so pick the next letter inXthat isB.
...
...
...
The complexity of this algorithm is theta(n*m).
I implemented it on the two methods. The second one uses a hash table, but after implementing I found that it's much slower compared to the first algorithm. I cant undrestand why.
Here is my implementation:
First algorithm:
import time def lcs(xstr, ystr): if not (xstr and ystr): return # if string is empty lcs = [''] # longest common subsequence lcslen = 0 # length of longest common subsequence so far for i in xrange(len(xstr)): cs = '' # common subsequence start = 0 # start position in ystr for item in xstr[i:]: index = ystr.find(item, start) # position at the common letter if index != -1: # if common letter has found cs += item # add common letter to the cs start = index + 1 if index == len(ystr) - 1: break # if reached end of the ystr # update lcs and lcslen if found better cs if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) elif len(cs) == lcslen: lcs.append(cs) return lcs file1 = open('/home/saji/file1') file2 = open('/home/saji/file2') xstr = file1.read() ystr = file2.read() start = time.time() lcss = lcs(xstr, ystr) elapsed = (time.time() - start) print elapsed Second one using hash table:
import time from collections import defaultdict def lcs(xstr, ystr): if not (xstr and ystr): return # if strings are empty lcs = [''] # longest common subsequence lcslen = 0 # length of longest common subsequence so far location = defaultdict(list) # keeps track of items in the ystr i = 0 for k in ystr: location[k].append(i) i += 1 for i in xrange(len(xstr)): cs = '' # common subsequence index = -1 reached_index = defaultdict(int) for item in xstr[i:]: for new_index in location[item][reached_index[item]:]: reached_index[item] += 1 if index < new_index: cs += item # add item to the cs index = new_index break if index == len(ystr) - 1: break # if reached end of the ystr # update lcs and lcslen if found better cs if len(cs) > lcslen: lcs, lcslen = [cs], len(cs) elif len(cs) == lcslen: lcs.append(cs) return lcs file1 = open('/home/saji/file1') file2 = open('/home/saji/file2') xstr = file1.read() ystr = file2.read() start = time.time() lcss = lcs(xstr, ystr) elapsed = (time.time() - start) print elapsed Asked By : Sajjad
Answered By : Yuval Filmus
One optimization is to replace the access to reached_index[item] by a local variable which is initialized from the hash table at the beginning of the for item loop, and stored back to the hash table at the end of the loop.
Another optimization which will speed up things is to replace the hash tables with arrays of size 256.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/7851
0 comments:
Post a Comment
Let us know your responses and feedback