World's most popular travel blog for travel bloggers.

[Solved]: How does the runtime of the Ukkonen's algorithm depend on the alphabet size?

, , No Comments
Problem Detail: 

I am concerned with the question of the asymptotic running time of the Ukkonen's algorithm, perhaps the most popular algorithm for constructing suffix trees in linear (?) time.

Here is a citation from the book "Algorithms on strings, trees and sequences" by Dan Gusfield (section 6.5.1):

"... the Aho-Corasick, Weiner, Ukkonen and McCreight algorithms all either require $\Theta(m|\Sigma|)$ space, or the $O(m)$ time bound should be replaced with the minimum of $O(m \log m)$ and $O(m \log|\Sigma|)$".

[$m$ is the string length and $\Sigma$ is the size of the alphabet]

I don't understand why that is true.

  • Space: well, in case we represent branches out of the nodes using arrays of size $\Theta(|\Sigma|)$, then, indeed, we end up with $\Theta(m|\Sigma|)$ space usage. However, as far as I can see, it is also possible to store the branches using hash tables (say, dictionaries in Python). We would then have only $\Theta(m)$ pointers stored in all hash tables altogether (since there are $\Theta(m)$ edges in the tree), while still being able to access the children nodes in $O(1)$ time, as fast as when using arrays.
  • Time: as mentioned above, using hash tables allows us to access the outgoing branches of any node in $O(1)$ time. Since the Ukkonen's algorithm requires $O(m)$ operations (including accessing children nodes), the overall running time then would be also $O(m)$.

I would be very grateful to you for any hints on why I am wrong in my conclusions and why Gusfield is right about the dependence of the Ukkonen's algorithm on the alphabet.

Asked By : Mikhail Dubov

Answered By : FrankW

As @jogojapan mentions in the comments, hasing generally is only amortized $O(1)$, so you would only get amortized bounds for the algorithm. However, I think you do not even get these: In order to get amortized $O(1)$ hashing, the hash tables have to be of size $\Omega(\Sigma)$, so you still have $\Theta(m\Sigma)$ space (and the same time requirement for initialization).

What's more, in practice the time for setting up all these hash tables will be much higher than the time to set up arrays.

You might fare better with using a global hash table that is indexed with (node, character)-pairs, but at least the "only amortized" argument will remain.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/9820

0 comments:

Post a Comment

Let us know your responses and feedback