World's most popular travel blog for travel bloggers.

[Solved]: How to find longest recurring pattern from lage string data set?

, , No Comments
Problem Detail: 

I need to find the substring that is from a 100,000 characters this substring must be most repeated and it need to be longest substring for example

TYUFRIETEYM0SQZLHBCTN0W1KA9HELAT4LTQ14W7ZW484GSK1XTNOBJ2R6AMGW9KU36G7ITMPF315Y7ESYPR1XE2C1953J0DXUNBJLNTDG7IHS63854SGSS7YDEFJYSFP0DLL54GK6NUZ5UU5FRIETEYCPNGHIJOX23QOVSCBYHKE7HRIETEYV0H49I5SX9CW967CDGKX3TYCVNVBNCFGGDGDGDDFIIPGDSDVGDDSRGDGVCZAQRIOPKLMVFGCDGDTYGSDCBGDUSLVAQEFCGDGRIETEYDGDFG

In above character set there is two substring one is GD and other is RIETEY algorithm able to identify RIETEY because it is longest the substring, also pattern must occur at least twice to be considered a recurrence and patterns will not overlap.

I found a algorithm but it only work for less that 100,000 characters

any suggestion for this problem ?

Asked By : Nuwan Indika

Answered By : Raphael

This problem is well-studied; it's aptly called longest repeated substring problem.

It can be solved in linear time by creating a suffix tree with Ukkonen's algorithm; the longest repeat corresponds to the labelling of the longest path from the root to an inner node which you find using breadth-first search.

This does not exclude overlapping substrings. Keep track of the smallest $m$ and largest $M$ starting index of sequences nodes represent while creating the tree; a node at depth $n$ (counted in symbols on the path) represents a non-overlapping repeat if and only if $M - m \geq n$.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/55575

 Ask a Question

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback