[Solved]: Optimal algorithm for finding all ngrams from a pre-defined set in a text

posted by , on 8:40:00 AM , No Comments

Problem Detail:

I'm thinking about the optimal algorithm for the following problem:

Input data:

a text, say it's an article about 5-50 pages.
a set of ngrams (ngram strings, n>2), of arbitrary length, could be more than 20k n-grams.

The algorithm should output the following:

a dictionary of all ngrams that were found in the text with the corresponding quantities, it should also take into account that ngrams could partially intersect or consist of each other (like 'probability density', 'probability density function', 'probability density distribution')

So the question is what would be the most time-efficient algorithm to compute this?

Both all words in a text and all words in ngrams are reduced to the canonical forms.

Asked By : dragoon

Answered By : Raphael

What you want to solve is a string matching problem. Wikipedia (and any textbook on the matter) contains a rich list of algorithms with their respective runtimes.

Be aware that they give worst-case runtimes. Different algorithms behave differently on natural text; some perform better because of the large alphabet and some worst because of its repetitive nature. Therefore, you should benchmark the respective algorithms with data similar to which you expect to get.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/8972

MCA IGNOU GROUP

[Solved]: Optimal algorithm for finding all ngrams from a pre-defined set in a text

Asked By : dragoon

Answered By : Raphael

Best Answer from StackOverflow

Akshay Mishra

0 comments:

Post a Comment

Search By Course Code

Blog Archive

Popular Posts

Label Cloud

Ad Blocker

[Solved]: Optimal algorithm for finding all ngrams from a pre-defined set in a text

Asked By : dragoon

Answered By : Raphael

Best Answer from StackOverflow

Akshay Mishra

0 comments:

Post a Comment

Search By Course Code

Blog Archive

Popular Posts

Label Cloud

Ad Blocker

Click to Download Now