I have a text file with all possible 5-grams (26^5 = 11.881.376) organized in rows as:
aaaaa
aaaab
aaaac
aaaad
.....
and I have a txt file (organized in rows) with all English words. I have to find how much time every 5-gram is at the beginning of a word. I'm using this code:
def frequency(five-grams, dict): result = {} for gram in five-grams: freq = 0 for word in dict: if word.startswith(gram): freq = freq + 1 result[gram] = freq return result
However this method takes too long (after hours the algorithm is still running). Is there a more efficient and faster method?
Asked By : Zirko88
Answered By : Tom van der Zanden
There is a 11.881.376 times faster method. Instead of for every 5-gram looping over the entire dictionary, loop over the dictionary once, and for every word determine with what 5-gram it starts and then increment the counter for that particular 5-gram.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/37277
0 comments:
Post a Comment
Let us know your responses and feedback