World's most popular travel blog for travel bloggers.

# Classifier optimization

, ,
Problem Detail:

Suppose we have a set E of entities. Each entity is described by a set P of binary properties (i.e. each element e of E has a defined true/false value for each element p of P).

|P| >> |E|

We now want to select a subset of fixed size (e.g., 10) of P that will enable us distinguish the elements in E as accurately as possible.

In extensions of this basic scenario, the properties could assume numeric values.

Has this problem been studied? Under what name?

A practical example: there is a set of 100 bacterial species that are characterized for the presence or absence of 1000 genes. We now want to select a subset of 10 genes that, upon typing of a novel sample, will tell us which bacterial species that sample represents.

If you need the optimal answer, the best solution I know is exhaustive search: try all ${|P| \choose 10}$ different subsets, and see which is best. The running time of this will be $O(|P|^{10})$, though, which is probably too high to be feasible.

Given this, you will probably need to accept solutions that are heuristics or not guaranteed to give an absolutely optimal answer.

One standard approach is to use a greedy algorithm: you iteratively build up a set of properties, one by one. At each step, if your set is currently $S$, you choose the property $p$ that makes $S \cup \{p\}$ as accurate as possible, and then add $p$ to $S$. To turn this into a full algorithm, you need to decide how you want to measure/evaluate each candidate $S \cup \{p\}$.

For comparison, you can look also at the ID3 algorithm. Rather than trying to pick a set of size 10, it tries to pick a decision tree of depth 10, so it's not solving exactly the same problem: but it is similar. The metric used at each step to evaluate the candidates is the information gain; you could do the same, but for a set rather than a tree.

In the machine learning literature, there is a lot of work on feature selection: given a large number of possible features, the goal is to pick a subset of the features that makes the classifier as accurate as possible. You could explore that literature and see if any of those methods are effective in your domain.