World's most popular travel blog for travel bloggers.

[Solved]: What are the k characters which make the most complete words?

, , No Comments
Problem Detail: 

Given a word list of $N$ words formed from a language of $M$ characters, where each word is composed of $n \geq 1$ not necessarily distinct characters, how can I find the best set of $k<M$ characters to learn, where "best" is defined as the set of $k$ with which the most complete words can be spelled. Ie, if I know these $k$ characters I can spell $N_k$ words. What is the maximum $N_k$ for every $k$ and which characters should I choose?

Checking a given set of $k$ words is equivalent to a Scrabble problem, but searching the space becomes very hard very fast. This problem is of limited interest in English where $M=26$ but is more important in Chinese or Japanese where $M \sim 10^3$.

I thought of considering the bipartite graph of characters and the words that they make but I'm not sure what the best search strategy is. I'm somewhat pessimistic that this problem is strictly solvable and therefore I am willing to try heuristic or stochastic methods.

Asked By : mmdanziger

Answered By : D.W.

As Lopsy states, your problem is NP-hard, so you shouldn't expect an efficient algorithm for your problem.

Your problem is an instance of the maximum coverage problem (which is also NP-hard). This may help you find algorithms that will be helpful in your setting.

There are standard algorithms for the maximum coverage algorithm. There's a natural greedy algorithm; it achieves an approximation ratio of approximately 0.63.

Perhaps a better approach is to formulate this as an instance of integer linear programming (ILP), and then throw an off-the-shelf ILP solver at it (e.g., lp_solve, CPLEX). See Wikipedia for a description of how to do that. For your English example (alphabet size 26) it's possible this might give you an optimal solution, but for the Chinese language problem, you shouldn't expect this to terminate and give you the optimal solution within your lifetime. Instead, I suggest you let it run for a bounded time and do the best it can within that time bound. Standard ILP solvers will let you provide a time bound: e.g., run for 5 minutes and then give me the best solution you've found so for.

You will probably want to apply the following simple optimization. Replace each word with the set of letters it contains (so order doesn't matter and repetitions are ignored). Then, de-dup these sets, and include a weight for each set that counts how many times it appears. Thus, GRIN and RING map to the same set, {G,I,N,R}. This will reduce the problem size and might make the ILP solver run faster.

Depending upon your application, you might also want to weight the words by their frequency of occurrence in natural language and then solve the weighted maximum coverage problem, so that you place an extra emphasis on covering common words and a lower priority on covering rare words.

You can also search the literature for other algorithms for the maximum coverage problem, but I suspect the pragmatic answer is going to be to use ILP with a reasonable time bound.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/41850

0 comments:

Post a Comment

Let us know your responses and feedback