World's most popular travel blog for travel bloggers.

[Solved]: How to handle missing continuous attribute values in ID3 (Iterative Dichotomiser 3)?

, , No Comments
Problem Detail: 

I'm implementing the ID3 algorithm (Iterative Dichotomiser 3). I have an attribute which happens to be continuous like 12.21, 3.01, etc. AND have missing values which are marked as "NA".

How I'm discretizing the data: I'm finding the optimal split which results in the max information gain. How I'm dealing with missing values: I will use the most probable attribute value to replace the "?".

Of course I can do either process in both ways, and this is where my confusion arises. Is there a correct way in handling this?

Asked By : pierop

Answered By : Evil

I would like to propose paper about ID3 and it's successors, generally about Decision Tree Algorithms.

Using Mean, Median, Mode etc. is very tempting and it works to some degree, but of course the outcome depends on values inserted to missing (NA) data.

Mean has nice property in many statistics that it just acts like missing value, but increases weight of other ones (since it changes nothing, other values are counted with +1/N weight).
But in decision trees the effect is bigger, changing the classifier, so there is one big idea - apply all possible missing values :-/.

There are also three easier techniques:

  • apply mean and do not care
  • reconstruct data to fit classifier better (very often trial and error, but due to discretization of continous data, only values that differ by multiplicity of $\epsilon$ are to be checked)
  • try to reconstruct data

The last one should yield the best results, but it is not always possible, and still these are not exact values.

If you can predict the most probable value and replace missing ones - this is the best way to do it.

Best Answer from StackOverflow

Question Source :

3.2K people like this

 Download Related Notes/Documents


Post a Comment

Let us know your responses and feedback