I'm not sure if this question belongs here and I apologize if not. What I am looking to do is to develop a programmatic way in which I can probabilistically determine whether a given string "belongs" in a bag of strings. For example, if I have bag of 10,000 US city names, and then I have the string "Philadelphia", I would like some quantitative measure of how likely 'Philadelphia' is a US city name based on the US city names I already know. While I know I won't be able to separate real city names from fake city names in this context, I would at least expect to have strings such as "123.75" and "The quick red fox jumped over the lazy brown dogs" excluded given some threshold.
To get started, I've looked at Levenshtein Distance and poked around a bit on how that's been applied to problems at least somewhat similar to the one I'm trying to solve. One interesting application I found was plagiarism detection, with one paper describing how Levenshtein distance was used with a modified Smith-Waterman algorithm to score papers based on how likely they were a plagarized version of a given base paper. My question is if anyone could point me in the right direction with other established algorithms or methodologies that might help me. I get the feeling that this may be a problem someone in the past has tried to solve but so far my Google-fu has failed me.
Asked By : Andrew
Answered By : Yuval Filmus
Some better statistics to think of are word length and $n$-gram analysis. For word length, you can collect statistics of the distribution of word length of city names, and compare it to the length of what you get. $n$-gram analysis looks at the distribution of sequences of $n$ letters in your sample text (say $n=2$). Both approaches can be combined.
Given the heuristics, you can use likelihood to get a score which would (hopefully) be higher for your sample data than for other text. In order to determine a reasonable threshold, you can perform cross-validation. Pick a set of sample phrases which are not city names. Divide the city names into two parts, a large (say 80%) part and a small (say 20%) part. Train your model on the large part (that is, collect statistics on the large part), and then evaluate your model on the small part and on the sample of bad phrases. Determine if there is a reasonable threshold that passes most city names, but only a small amount of bad phrases.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/2777
0 comments:
Post a Comment
Let us know your responses and feedback