World's most popular travel blog for travel bloggers.

Dynamic programming table for finding similar substrings is too large

, , No Comments
Problem Detail: 

Substring Diff
Given two strings of length $n$, $P = p_1\dots p_n$ and $Q = q_1 \dots q_n$, we define $M(i, j, L)$ as the number of mismatches between $p_i \dots p_{i+L-1}$ and $q_j \dots q_{j+L-1}$. In set notation, $M(i, j, L)$ refers to the size of the set $\{0 \leq x < L \mid p_{i + x} \neq q_{j + x}\}$.

Given an integer $K$, your task is to find the maximum length $L$ such that there exists pair of indices $(i,j)$ for which we have $M(i, j, L) \leq K$. Of course, we should also have $i + L - 1 \leq n$ and $j + L - 1 \leq n$.

Constraints

  • $0 \leq K \leq |P|$
  • Both $P$ & $Q$ would have the same length
  • The size of each of the string would be at the max 1500
  • All characters in $P$ and $Q$ are lower-case English letters.

The recursive function will have the form:

longest(string1, string2, allowed_mismatches) =      {         ... (something :P )     } 

The state space then has size $K^3$. With an upper bound on $K$ of 1500, the running time and space usage will be terrible... So direct dynamic programming will not work without some additional property to reduce the state space.

Ideas?

UPDATE

Using the ideas suggested by both Yuval and Vor, I came up with the following solution that works like a charm, running in $O(K^2)$ time and using $K$ space.

def longest_range_min_sum(str1, str2, start1, start2, slice_size, max_sum):     longest = 0     i = 0     running_sum = 0     while i + longest < slice_size:         if str1[start1 + i + longest] != str2[start2 + i + longest]:             running_sum += 1         if running_sum > max_sum:             if str1[start1 + i] != str2[start2 + i]:                 running_sum -= 1             i += 1         else:             longest += 1     return longest  import sys  data = sys.stdin.readlines() num_cases = int(data.pop(0)) for ignore in xrange(num_cases):     max_mismatches, str1, str2 = data.pop(0).split()     max_mismatches = int(max_mismatches)     m = n = len(str1)     longest = 0     for i in xrange(m + n + 1):         if i > n:             slice_size = m - (i - n)         else:             slice_size = min(i, m)         if slice_size == 0:             continue         end1 = max(m, m - i)         if i > n:             end1 = m - (i - n)         start1 = end1 - slice_size         end2 = min(i, n)         start2 = end2 - slice_size         #print zeros_and_ones          #print str1[start1:end1], ' - ', str2[start2:end2]         longest_in_sub = longest_range_min_sum(str1, str2, start1, start2, slice_size, max_mismatches)         #print longest_in_sub         longest = max(longest, longest_in_sub)     print longest 
Asked By : Alexandre

Answered By : Yuval Filmus

One can reduce your problem to the following. Given a sequence of $N$ numbers, find a contiguous subsequence of length at most $K+1$ having maximal sum. This problem, in turn, is solvable in time $O(N)$.

What's the connection between your problem and mine? Let the positions of the mistakes be $I_1,\ldots,I_t$, and add $I_0 = 0$, $I_{t+1} = N+1$. The sequence in question is $J_1 = I_1 - I_0,\ldots,J_{t+1}=I_{t+1}-I_t$. Every $K+1$ consecutive numbers $J_a,\ldots,J_{a+K}$ correspond to a maximal solution for your problem of length $J_a + \cdots + J_{a+K} - 1$. The entire algorithm takes linear time.

Edit: This calculates the maximal $L$ such that $M(i,i,L) \leq K$ for some $i$. The actual problem wanted to find the maximal $L$ such that $M(i,j,L) \leq K$ for some $i,j$. By considering all possible shifts, we can solve this in $O(N^2)$ time and $O(K)$ space.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/4994

0 comments:

Post a Comment

Let us know your responses and feedback