I'm thinking about the optimal algorithm for the following problem:
Input data:
- a text, say it's an article about 5-50 pages.
- a set of ngrams (ngram strings, n>2), of arbitrary length, could be more than 20k n-grams.
The algorithm should output the following:
- a dictionary of all ngrams that were found in the text with the corresponding quantities, it should also take into account that ngrams could partially intersect or consist of each other (like 'probability density', 'probability density function', 'probability density distribution')
So the question is what would be the most time-efficient algorithm to compute this?
Both all words in a text and all words in ngrams are reduced to the canonical forms.
Asked By : dragoon
Answered By : Raphael
What you want to solve is a string matching problem. Wikipedia (and any textbook on the matter) contains a rich list of algorithms with their respective runtimes.
Be aware that they give worst-case runtimes. Different algorithms behave differently on natural text; some perform better because of the large alphabet and some worst because of its repetitive nature. Therefore, you should benchmark the respective algorithms with data similar to which you expect to get.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/8972
0 comments:
Post a Comment
Let us know your responses and feedback