1 June, 2020 # DECISION TREE LEARNING

Decision tree learning is a standout amongst the most generally utilized and pragmatic techniques for
inductive derivation. It is a technique for approximating discrete-esteemed capacities that
is strong to boisterous information and equipped for learning disjunctive articulations. This part
portrays a group of decision tree learning calculations that incorporates generally utilized
calculations, for example, ID3, Collaborator, and C4.5. These decision tree learning techniques
look through a totally expressive speculation space and along these lines maintain a strategic distance from the troubles
of limited theory spaces. Their inductive inclination is an inclination for little trees
over huge trees.

## INTRODUCTION (decision tree learning)

Decision tree learning is a strategy for approximating discrete-esteemed target capacities,
in which the scholarly capacity is spoken to by a decision tree. Learned trees
can likewise be re-spoken to as sets of if-then standards to enhance human clarity.
These learning techniques are among the most well known of inductive surmising calculations
what’s more, have been effectively connected to a wide scope of undertakings from learning
to analyze therapeutic cases to figuring out how to survey credit danger of advance candidates.

## THE BASIC DECISION TREE LEARNING ALGORITHM

Which Attribute Is the Best Classifier ?
The central choice in the ID3 algorithm is selecting which attribute to test at
each node in the tree. We would like to select the attribute that is most useful
for classifying examples. What is a good quantitative measure of the worth of
an attribute? We will define a statistical property, called informution gain, that
measures how well a given attribute separates the training examples according to
their target classification. ID3 uses this information gain measure to select among
the candidate attributes at each step while growing the tree.

## INDUCTIVE BIAS IN DECISION TREE LEARNING

What is the policy by which ID3 generalizes from observed training examples
to classify unseen instances? In other words, what is its inductive bias? that inductive bias is the set of assumptions that, together with
the training data, deductively justify the classifications assigned by the learner to
future instances.

Given a collection of training examples, there are typically many decision
trees consistent with these examples. Describing the inductive bias of ID3 therefore
consists of describing the basis by which it chooses one of these consistent
hypotheses over the others. Which of these decision trees does ID3 choose?
It chooses the first acceptable tree it encounters in its simple-to-complex, hillclimbing
search through the space of possible trees. Roughly speaking, then, the
ID3 search strategy (a) selects in favor of shorter trees over longer ones, and
(b) selects trees that place the attributes with highest information gain closest to
the root.

Because of the subtle interaction between the attribute selection heuristic
used by ID3 and the particular training examples it encounters, it is difficult
to characterize precisely the inductive bias exhibited by ID3. However, we can
approximately characterize its bias as a preference for short decision trees over
complex trees.

Approximate inductive bias of ID3: Shorter trees are preferred over larger trees.
In fact, one could imagine an algorithm similar to ID3 that exhibits precisely
this inductive bias. Consider an algorithm that begins with the empty tree and
searches breadth First through progressively more complex trees, first considering
all trees of depth 1, then all trees of depth 2, etc.
Once it finds a decision tree

consistent with the training data, it returns the smallest consistent tree at that
search depth (e.g., the tree with the fewest nodes). Let us call this breadth-first
search algorithm BFS-ID3. BFS-ID3 finds a shortest decision tree and thus exhibits
precisely the bias “shorter trees are preferred over longer trees.” ID3 can be
viewed as an efficient approximation to BFS-ID3, using a greedy heuristic search
to attempt to find the shortest tree without conducting the entire breadth-first
search through the hypothesis space.

Because ID3 uses the information gain heuristic and a hill climbing strategy,
it exhibits a more complex bias than BFS-ID3. In particular, it does not always
find the shortest consistent tree, and it is biased to favor trees that place attributes
with high information gain closest to the root.

A closer approximation to the inductive bias of ID3: Shorter trees are preferred
over longer trees. Trees that place high information gain attributes close to the root
are preferred over those that do not.

Restriction Biases and Preference Biases (decision tree learning)
There is an interesting difference between the types of inductive bias exhibited
by ID3.

Consider the difference between the hypothesis space search in these two approaches:
ID3 searches a complete hypothesis space (i.e., one capable of expressing
any finite discrete-valued function). It searches incompletely through this
space, from simple to complex hypotheses, until its termination condition is
met (e.g., until it finds a hypothesis consistent with the data). Its inductive
bias is solely a consequence of the ordering of hypotheses by its search
strategy. Its hypothesis space introduces no additional bias.

0 The version space CANDIDATE-ELIMINATION algorithm searches an incomplete
hypothesis space (i.e., one that can express only a subset of the potentially
teachable concepts). It searches this space completely, finding every
hypothesis consistent with the training data. Its inductive bias is solely a
consequence of the expressive power of its hypothesis representation. Its
search strategy introduces no additional bias.

In brief, the inductive bias of ID3 follows from its search strategy, whereas
the inductive bias of the CANDIDATE-ELIMINATION algorithm follows from the definition
of its search space.

The inductive bias of ID3 is thus a preference for certain hypotheses over
others (e.g., for shorter hypotheses), with no hard restriction on the hypotheses that
can be eventually enumerated. This form of bias is typically called a preference
bias (or, alternatively, a search bias). In contrast, the bias of the CANDIDATE

ELIMINATION algorithm of decision tree learning is in the form of a categorical restriction on the set of
hypotheses considered. This form of bias is typically called a restriction bias (or,
alternatively, a language bias).
Given that some form of inductive bias is required in order to generalize
beyond the training data.

### which type of inductive bias shall weprefer; a preference bias or restriction bias?

Typically, a preference bias is more desirable than a restriction bias, because
it allows the learner to work within a complete hypothesis space that is
assured to contain the unknown target function. In contrast, a restriction bias that
strictly limits the set of potential hypotheses is generally less desirable, because
it introduces the possibility of excluding the unknown target function altogether.
Whereas ID3 exhibits a purely preference bias and CANDIDATE-ELIMINATION
a purely restriction bias, some learning systems combine both. Consider, for example,
the program described for learning a numerical evaluation
function for game playing.

In this case, the learned evaluation function is represented
by a linear combination of a fixed set of board features, and the learning
algorithm adjusts the parameters of this linear combination to best fit the available
training data. In this case, the decision to use a linear function to represent the evaluation
function introduces a restriction bias (nonlinear evaluation functions cannot
be represented in this form). At the same time, the choice of a particular parameter
tuning method (the LMS algorithm in this case) introduces a preference bias stemming
from the ordered search through the space of all possible parameter values.

## Why Prefer Short Hypotheses in decision tree learning?

Is ID3’s inductive bias favoring shorter decision trees a sound basis for generalizing
beyond the training data? Philosophers and others have debated this question
for centuries, and the debate remains unresolved to this day. William of Occam
was one of the first to discusst the question, around the year 1320, so this bias
often goes by the name of Occam’s razor.

Occam’s razor
Prefer the simplest hypothesis that fits the data.
Of course giving an inductive bias a name does not justify it. Why should one
prefer simpler hypotheses? Notice that scientists sometimes appear to follow this
inductive bias. Physicists, for example, prefer simple explanations for the motions
of the planets, over more complex explanations. Why?
One argument is that because there are fewer short hypotheses than long ones (based on straightforward combinatorial arguments), it is less likely that one will find a short hypothesis that
coincidentally fits the training data. In contrast there are often many very complex
hypotheses that fit the current training data but fail to generalize correctly to
subsequent data. Consider decision tree hypotheses, for example.
There are many
more 500-node decision trees than 5-node decision trees. Given a small set of
20 training examples, we might expect to be able to find many 500-node decision
trees consistent with these, whereas we would be more surprised if a 5-node
decision tree could perfectly fit this data. We might therefore believe the 5-node
tree is less likely to be a statistical coincidence and prefer this hypothesis over
the 500-node hypothesis.
Upon closer examination, it turns out there is a major difficulty with the
above argument. By the same reasoning we could have argued that one should
prefer decision trees containing exactly 17 leaf nodes with 11 nonleaf nodes, that
use the decision attribute A1 at the root, and test attributes A2 through All, in
numerical order. There are relatively few such trees, and we might argue (by the
same reasoning as above) that our a priori chance of finding one consistent with
an arbitrary set of data is therefore small. The difficulty here is that there are very
many small sets of hypotheses that one can define-most of them rather arcane.

### Why should we believe that the small set of hypotheses consisting of decisiontrees with short descriptions should be any more relevant than the multitude ofother small sets of hypotheses that we might define?

A second problem with the above argument for Occam’s razor is that the size
of a hypothesis is determined by the particular representation used internally by
the learner. Two learners using different internal representations could therefore
anive at different hypotheses, both justifying their contradictory conclusions by
Occam’s razor!
For example, the function represented by the learned decision
tree could be represented as a tree with just one decision node, by a
learner that uses the boolean attribute XYZ, where we define the attribute XYZ to be true for instances that are classified positive by the decision tree and false otherwise.
Thus, two learners, both applying Occam’s razor, would generalize in different ways if one used the XYZ attribute to describe its examples
and the other used only the attributes Outlook, Temperature, Humidity, and Wind.
This last argument shows that Occam’s razor will produce two different
hypotheses from the same training examples when it is applied by two learners
that perceive these examples in terms of different internal representations. On this
basis we might be tempted to reject Occam’s razor altogether. However, consider
the following scenario that examines the question of which internal representations
might arise from a process of evolution and natural selection. Imagine a
population of artificial learning agents created by a simulated evolutionary process
involving reproduction, mutation, and natural selection of these agents.
Let us assume that this evolutionary process can alter the perceptual systems of these
agents from generation to generation, thereby changing the internal attributes by
which they perceive their world. For the sake of argument, let us also assume that
the learning agents employ a fixed learning algorithm (say ID3) that cannot be
altered by evolution.

It is reasonable to assume that over time evolution will produce
internal representation that make these agents increasingly successful within
their environment. Assuming that the success of an agent depends highly on its
ability to generalize accurately, we would therefore expect evolution to develop
internal representations that work well with whatever learning algorithm and inductive
bias is present. If the species of agents employs a learning algorithm whose
inductive bias is Occam’s razor, then we expect evolution to produce internal representations
for which Occam’s razor is a successful strategy.
The essence of the
argument here is that evolution will create internal representations that make the
learning algorithm’s inductive bias a self-fulfilling prophecy, simply because it
can alter the representation easier than it can alter the learning algorithm.
For now, we leave the debate regarding Occam’s razor. We will revisit it where we discuss the Minimum Description Length principle, a version of Occam’s razor that can be interpreted within a Bayesian framework.

## ISSUES IN DECISION TREE LEARNING

Practical issues in learning decision trees include determining how deeply to grow
the decision tree, handling continuous attributes, choosing an appropriate attribute
selection measure, andling training data with missing attribute values, handling
attributes with differing costs, and improving computational efficiency. Below
we discuss each of these issues and extensions to the basic ID3 algorithm that
address them. ID3 has itself been extended to address most of these issues, with
the resulting system renamed C4.5 (Quinlan 1993).
Avoiding Overfitting the Data
Reduced error Pruning
Rule Post-Pruning