# DECISION TREE LEARNING

Decision tree learning is a standout amongst the most generally utilized and pragmatic techniques for

inductive derivation. It is a technique for approximating discrete-esteemed capacities that

is strong to boisterous information and equipped for learning disjunctive articulations. This part

portrays a group of decision tree learning calculations that incorporates generally utilized

calculations, for example, ID3, Collaborator, and C4.5. These decision tree learning techniques

look through a totally expressive speculation space and along these lines maintain a strategic distance from the troubles

of limited theory spaces. Their inductive inclination is an inclination for little trees

over huge trees.

## INTRODUCTION (decision tree learning)

Decision tree learning is a strategy for approximating discrete-esteemed target capacities,

in which the scholarly capacity is spoken to by a decision tree. Learned trees

can likewise be re-spoken to as sets of if-then standards to enhance human clarity.

These learning techniques are among the most well known of inductive surmising calculations

what’s more, have been effectively connected to a wide scope of undertakings from learning

to analyze therapeutic cases to figuring out how to survey credit danger of advance candidates.

## DECISION TREE REPRESENTATION

## THE BASIC DECISION TREE LEARNING ALGORITHM

**Which Attribute Is the Best Classifier ?**

The central choice in the ID3 algorithm is selecting which attribute to test at

each node in the tree. We would like to select the attribute that is most useful

for classifying examples. What is a good quantitative measure of the worth of

an attribute? We will define a statistical property, called informution gain, that

measures how well a given attribute separates the training examples according to

their target classification. ID3 uses this information gain measure to select among

the candidate attributes at each step while growing the tree.

## INDUCTIVE BIAS IN DECISION TREE LEARNING

What is the policy by which ID3 generalizes from observed training examples

to classify unseen instances? In other words, what is its inductive bias? that inductive bias is the set of assumptions that, together with

the training data, deductively justify the classifications assigned by the learner to

future instances.

**Given a collection of training examples**, there are typically many decision

trees consistent with these examples. Describing the inductive bias of ID3 therefore

consists of describing the basis by which it chooses one of these consistent

hypotheses over the others. Which of these decision trees does ID3 choose?

It chooses the first acceptable tree it encounters in its simple-to-complex, hillclimbing

search through the space of possible trees. Roughly speaking, then, the

ID3 search strategy (a) selects in favor of shorter trees over longer ones, and

(b) selects trees that place the attributes with highest information gain closest to

the root.

Because of the subtle interaction between the attribute selection heuristic

used by ID3 and the particular training examples it encounters, it is difficult

to characterize precisely the inductive bias exhibited by ID3. However, we can

approximately characterize its bias as a preference for short decision trees over

complex trees.

Approximate inductive bias of ID3: Shorter trees are preferred over larger trees.

In fact, one could imagine an algorithm similar to ID3 that exhibits precisely

this inductive bias. Consider an algorithm that begins with the empty tree and

searches breadth First through progressively more complex trees, first considering

all trees of depth 1, then all trees of depth 2, etc.

Once it finds a decision tree

consistent with the training data, it returns the smallest consistent tree at that

search depth (e.g., the tree with the fewest nodes). Let us call this breadth-first

search algorithm BFS-ID3. BFS-ID3 finds a shortest decision tree and thus exhibits

precisely the bias “shorter trees are preferred over longer trees.” ID3 can be

viewed as an efficient approximation to BFS-ID3, using a greedy heuristic search

to attempt to find the shortest tree without conducting the entire breadth-first

search through the hypothesis space.

Because ID3 uses the information gain heuristic and a hill climbing strategy,

it exhibits a more complex bias than BFS-ID3. In particular, it does not always

find the shortest consistent tree, and it is biased to favor trees that place attributes

with high information gain closest to the root.

A closer approximation to the inductive bias of ID3: Shorter trees are preferred

over longer trees. Trees that place high information gain attributes close to the root

are preferred over those that do not.

**Restriction Biases and Preference Biases (decision tree learning)**

There is an interesting difference between the types of inductive bias exhibited

by ID3.

**Consider the difference between the hypothesis space search in these two approaches:**

ID3 searches a complete hypothesis space (i.e., one capable of expressing

any finite discrete-valued function). It searches incompletely through this

space, from simple to complex hypotheses, until its termination condition is

met (e.g., until it finds a hypothesis consistent with the data). Its inductive

bias is solely a consequence of the ordering of hypotheses by its search

strategy. Its hypothesis space introduces no additional bias.

0 The version space CANDIDATE-ELIMINATION algorithm searches an incomplete

hypothesis space (i.e., one that can express only a subset of the potentially

teachable concepts). It searches this space completely, finding every

hypothesis consistent with the training data. Its inductive bias is solely a

consequence of the expressive power of its hypothesis representation. Its

search strategy introduces no additional bias.

In brief, the inductive bias of ID3 follows from its search strategy, whereas

the inductive bias of the CANDIDATE-ELIMINATION algorithm follows from the definition

of its search space.

The inductive bias of ID3 is thus a preference for certain hypotheses over

others (e.g., for shorter hypotheses), with no hard restriction on the hypotheses that

can be eventually enumerated. This form of bias is typically called a preference

bias (or, alternatively, a search bias). In contrast, the bias of the CANDIDATE

ELIMINATION algorithm of decision tree learning is in the form of a categorical restriction on the set of

hypotheses considered. This form of bias is typically called a restriction bias (or,

alternatively, a language bias).

Given that some form of inductive bias is required in order to generalize

beyond the training data.

**which type of inductive bias shall we**

**prefer; a preference bias or restriction bias?**

Typically, a preference bias is more desirable than a restriction bias, because

it allows the learner to work within a complete hypothesis space that is

assured to contain the unknown target function. In contrast, a restriction bias that

strictly limits the set of potential hypotheses is generally less desirable, because

it introduces the possibility of excluding the unknown target function altogether.

Whereas ID3 exhibits a purely preference bias and CANDIDATE-ELIMINATION

a purely restriction bias, some learning systems combine both. Consider, for example,

the program described for learning a numerical evaluation

function for game playing.

In this case, the learned evaluation function is represented

by a linear combination of a fixed set of board features, and the learning

algorithm adjusts the parameters of this linear combination to best fit the available

training data. In this case, the decision to use a linear function to represent the evaluation

function introduces a restriction bias (nonlinear evaluation functions cannot

be represented in this form). At the same time, the choice of a particular parameter

tuning method (the LMS algorithm in this case) introduces a preference bias stemming

from the ordered search through the space of all possible parameter values.

## **Why Prefer Short Hypotheses in decision tree learning?**

Is ID3’s inductive bias favoring shorter decision trees a sound basis for generalizing

beyond the training data? Philosophers and others have debated this question

for centuries, and the debate remains unresolved to this day. William of Occam

was one of the first to discusst the question, around the year 1320, so this bias

often goes by the name of Occam’s razor.

**Occam’s razor**

Prefer the simplest hypothesis that fits the data.

Of course giving an inductive bias a name does not justify it. Why should one

prefer simpler hypotheses? Notice that scientists sometimes appear to follow this

inductive bias. Physicists, for example, prefer simple explanations for the motions

of the planets, over more complex explanations. Why?

One argument is that because there are fewer short hypotheses than long ones (based on straightforward combinatorial arguments), it is less likely that one will find a short hypothesis that

coincidentally fits the training data. In contrast there are often many very complex

hypotheses that fit the current training data but fail to generalize correctly to

subsequent data. Consider decision tree hypotheses, for example.

There are many

more 500-node decision trees than 5-node decision trees. Given a small set of

20 training examples, we might expect to be able to find many 500-node decision

trees consistent with these, whereas we would be more surprised if a 5-node

decision tree could perfectly fit this data. We might therefore believe the 5-node

tree is less likely to be a statistical coincidence and prefer this hypothesis over

the 500-node hypothesis.**Upon closer examination**, it turns out there is a major difficulty with the

above argument. By the same reasoning we could have argued that one should

prefer decision trees containing exactly 17 leaf nodes with 11 nonleaf nodes, that

use the decision attribute A1 at the root, and test attributes A2 through All, in

numerical order. There are relatively few such trees, and we might argue (by the

same reasoning as above) that our a priori chance of finding one consistent with

an arbitrary set of data is therefore small. The difficulty here is that there are very

many small sets of hypotheses that one can define-most of them rather arcane.

**Why should we believe that the small set of hypotheses consisting of decision**

**trees with short descriptions should be any more relevant than the multitude of**

**other small sets of hypotheses that we might define?**

A second problem with the above argument for Occam’s razor is that the size

of a hypothesis is determined by the particular representation used internally by

the learner. Two learners using different internal representations could therefore

anive at different hypotheses, both justifying their contradictory conclusions by

Occam’s razor!

For example, the function represented by the learned decision

tree could be represented as a tree with just one decision node, by a

learner that uses the boolean attribute XYZ, where we define the attribute XYZ to be true for instances that are classified positive by the decision tree and false otherwise.** Thus**, two learners, both applying Occam’s razor, would generalize in different ways if one used the XYZ attribute to describe its examples

and the other used only the attributes Outlook, Temperature, Humidity, and Wind.

This last argument shows that Occam’s razor will produce two different

hypotheses from the same training examples when it is applied by two learners

that perceive these examples in terms of different internal representations. On this

basis we might be tempted to reject Occam’s razor altogether. However, consider

the following scenario that examines the question of which internal representations

might arise from a process of evolution and natural selection. Imagine a

population of artificial learning agents created by a simulated evolutionary process

involving reproduction, mutation, and natural selection of these agents.

Let us assume that this evolutionary process can alter the perceptual systems of these

agents from generation to generation, thereby changing the internal attributes by

which they perceive their world. For the sake of argument, let us also assume that

the learning agents employ a fixed learning algorithm (say ID3) that cannot be

altered by evolution.

It is reasonable to assume that over time evolution will produce

internal representation that make these agents increasingly successful within

their environment. Assuming that the success of an agent depends highly on its

ability to generalize accurately, we would therefore expect evolution to develop

internal representations that work well with whatever learning algorithm and inductive

bias is present. If the species of agents employs a learning algorithm whose

inductive bias is Occam’s razor, then we expect evolution to produce internal representations

for which Occam’s razor is a successful strategy.

The essence of the

argument here is that evolution will create internal representations that make the

learning algorithm’s inductive bias a self-fulfilling prophecy, simply because it

can alter the representation easier than it can alter the learning algorithm.

For now, we leave the debate regarding Occam’s razor. We will revisit it where we discuss the Minimum Description Length principle, a version of Occam’s razor that can be interpreted within a Bayesian framework.

## **ISSUES IN DECISION TREE LEARNING**

Practical issues in learning decision trees include determining how deeply to grow

the decision tree, handling continuous attributes, choosing an appropriate attribute

selection measure, andling training data with missing attribute values, handling

attributes with differing costs, and improving computational efficiency. Below

we discuss each of these issues and extensions to the basic ID3 algorithm that

address them. ID3 has itself been extended to address most of these issues, with

the resulting system renamed C4.5 (Quinlan 1993).**Avoiding Overfitting the Data****Reduced error Pruning****Rule Post-Pruning**

## Leave a Reply