I need to find the substring that is from a 100,000 characters this substring must be most repeated and it need to be longest substring for example
TYUFRIETEYM0SQZLHBCTN0W1KA9HELAT4LTQ14W7ZW484GSK1XTNOBJ2R6AMGW9KU36G7ITMPF315Y7ESYPR1XE2C1953J0DXUNBJLNTDG7IHS63854SGSS7YDEFJYSFP0DLL54GK6NUZ5UU5FRIETEYCPNGHIJOX23QOVSCBYHKE7HRIETEYV0H49I5SX9CW967CDGKX3TYCVNVBNCFGGDGDGDDFIIPGDSDVGDDSRGDGVCZAQRIOPKLMVFGCDGDTYGSDCBGDUSLVAQEFCGDGRIETEYDGDFG
In above character set there is two substring one is GD and other is RIETEY algorithm able to identify RIETEY because it is longest the substring, also pattern must occur at least twice to be considered a recurrence and patterns will not overlap.
I found a algorithm but it only work for less that 100,000 characters
any suggestion for this problem ?
Asked By : Nuwan Indika
Answered By : Raphael
This problem is well-studied; it's aptly called longest repeated substring problem.
It can be solved in linear time by creating a suffix tree with Ukkonen's algorithm; the longest repeat corresponds to the labelling of the longest path from the root to an inner node which you find using breadth-first search.
This does not exclude overlapping substrings. Keep track of the smallest $m$ and largest $M$ starting index of sequences nodes represent while creating the tree; a node at depth $n$ (counted in symbols on the path) represents a non-overlapping repeat if and only if $M - m \geq n$.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/55575
0 comments:
Post a Comment
Let us know your responses and feedback