Hey Developer, let’s get more insight’s of Decision Tree learning about :

1. Decision Tree
2. Various Types of Decision Tree
3. Different Algorithms Used For Decision Tree
4. How to build Decision Trees ?
5. Entropy and Information Gain Calculations
6. How to avoid overfitting in Decision Trees

Let’s start then,

List of the parts of Decision Tree Series :

### Introduction To Decision Trees

• Decision Tree learning is one of the most widely used and practical methods for inductive inference(It is the process of reaching a general conclusion from specific examples, the conclusion is then applied to unseen data)
• It is a method for approximating discrete-valued functions, in which the learned functions is represented by a decision tree.
• It is robust to noisy data and capable of learning disjunctive expressions( A disjunction is a compound statement formed by joining two statements with the connector OR ).
• Learned trees can also be represented as sets of if-then rules to improve readability.

### Let’s understand with a small dataset…

• Attributes (Features): Outlook, temp, humidity, wind
• Values: Description of features, Eg: Outlook Values:- sunny, cloudy, rainy
• Target: Play Tennis (We have to predict this), It represents the output of the model.
• Instances: Examples of D1 to D14 of the dataset
• Concept: Learn to decide whether to play tennis i.e., find h from the given dataset.

### What is Decision Tree?

• An internal node is a test on an attribute.
• A branch represents an outcome of the test, eg., outlook=windy
• A leaf node represents a class label or class label distribution
• At each node, one attribute is chosen to split training examples into distinct classes as much as possible.
• A new case is classified by following a matching path to a leaf node.

### Decision Tree Types:

• Binary Trees: Only two choices in each split. Can be non-uniform(uneven) in depth.
• N-way trees or ternary trees: three or more choices in at least one of its splits(3-way,4-way etc.).

### Decision Tree Algorithms

• Hunt’s Algorithm( One of the earliest)
• CART
• ID3
• C4.5
• SLIQ
• SPRINT
• CHAID

### How to build a decision tree(Algorithm)?

#### Basic algorithm(a greedy algorithm)

• The tree is constructed in a top-down recursive divide and conquer manner
• At the start, all training examples are the root
• Attributes are categorical(if continuous-values, they are discretized in advance)
• Partition the examples recursively by chosing one attribute each time on the basis of a heuristic or statistical measure(eg., information gain)

#### Conditions for stopping partitioning

• All samples for a given node belong to the same class
• If there are no remaining attributes for further partitioning, majority voting is applied which is done through the Random Forest Algorithm(Ensemble Methods).
• Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.

### Forming Rules From Decision Tree:

• Example of concept: ‘Should I play tennis today’
– Takes inputs (set of attributes)
– Outputs a decision (say YES/NO)
• Each non-leaf node is an attribute
– The first non-leaf node is root node
• Each leaf node is either Yes or No
• Each link (branch) is labeled with
possible values of the associated attribute
• Rule formation
– A decision tree can be expressed as a disjunction of conjunctions
• PLAY tennis IF (Outlook = sunny) ∧ (Humidity = normal) ∨ (Outlook=Cloudy)
∨ (Outlook = Rainy) ∧ (Wind=Weak)
• ∨ is disjunction operator (OR)
• ∧ is conjunction operator (AND)

### Which attribute is the Best Classifier or How to choose the splitting attribute?

• At each node, available attributes are evaluated on the basis of separating the classes of the training examples
• A goodness(purity) function is used for this purpose
• Typical goodness function:
1. Information Gain(ID3/C4.5)
2. Information Gain Ratio
3. Gini Index

### How to Measure Impurity?

Entropy Impurity: Most Popular Measure Of Impurity

Gini Impurity: Generalization Of Variance Impurity, Applicable to two or more categories.

Misclassification Impurity: Minimum Probability that a training pattern would be misclassified at N. Most strongly peaked at equal probabilities.

### Obtaining DT through top-down induction

• How can we obtain a DT?
• Perform a top-down search, through the space of possible decision trees
• Determine the attribute that best classifies the training data
• Use this attribute as the root of the tree
• Repeat this process for each branch from left to right
• Proceed to the next level and determine the next best feature
• Repeat until a leaf is reached.
• How to choose the best attribute?
• Choose the attribute that will yield more information (i.e. the attribute with the highest information gain)

### Information gain

• Information gain – > a reduction of Entropy, E
• But what is Entropy?
– Is the amount of energy that cannot be used to do work
– Measured in bits
– A measure of disorder in a system (high entropy = disorder)

– where:
– S is the training data set
– c is the number of target classes
– pi is the proportion of examples in S belonging to target class i
– Note: if your calculator doesn’t do log2, use log2(x)=1.443 ln(x) or 3.322 log10(x). For even better accuracy, use log2(x)=ln(x)/ln(2) or log2(x)=log10(x)/log10(2)

### Entropy example

• A coin is flipped
– If the coin was fair -> 50% chance of head
– Now, let us rig the coin -> so that 99% of the time head comes up
• Let’s look at this in terms of entropy:
– E(0.5, 0.5)= – 0.5 log2(0.5) – (0.5) log2(0.5) = 1 bit – E(0.01, 0.99) = – 0.01 log2(0.01) – 0.99 log2 (0.99) = 0.08 bit
• If the probability of heads =1, then entropy=0 – E(0, 1.0) = – 0 log2(0) – 1.0 log2 (1.0) = 0 bit

### Information Gain

• Information Gain, G will be defined as:

where
– Values (A) is the set of all possible values of attribute A
– Sv is the subset of S for which A has a value v
– |S| is the size of S and Sv is the size of Sv
The information gain is the expected reduction in entropy
caused by knowing the value of attribute A

### Example – entropy calculation

• Compute the entropy of the play-tennis example:
• We have two classes, YES and NO
• We have 14 instances with 9 classified as YES and 5 as NO
– i.e. no. of classes, c=2
• EYES = – (9/14) log2 (9/14) = 0.41
• ENO = – (5/14) log2 (5/14) = 0.53
• E(S) = EYES + ENO = 0.94

### Example – information gain calculation

• Compute the information gain for the attributes wind in the
play-tennis data set:

– |S|=14
– Attribute wind
• Two values: weak and strong
• |Sweak| = 8
• |Sstrong| = 6
• Now, let us determine E(Sweak)
• Instances=8, YES=6, NO=2
• [6+,2-]
• E(Sweak)=-(6/8)log2(6/8)-(2/8)log2(2/8)=0.81
• Now, let us determine E(Sstrong)
• Instances=6, YES=3, NO=3
• [3+,3-]
• E(Sstrong)=-(3/6)log2(3/6)-(3/6)log2(3/6)=1.0
• Note, do not waste time if pYES=pNO
• Going back to information gain computation for the attribut wind:

= 0.94 – (8/14) 0.81 – (6/14)1.00
= 0.048

• Same as calculate for all the attributes

Now, compute the information gain for the attribute outlook and
temperature in the play-tennis data set:
• Attribute outlook:

• Attribute humidity:

• Attribute temperature:

and we get all the Information Gain of Attributes

• Gain(S, outlook)=0.25
• Gain(S, temp)=0.03
• Gain(S, humidity)=0.15
• Gain(S, wind)=0.048
• So, attribute with the highest info. gain
• OUTLOOK, therefore use outlook as the root node

### Decision Tree On Next Level

After determining OUTLOOK as the root node, we need to expand the tree

• E(Ssunny)=-(2/5)log2(2/5)-(3/5)log2(3/5)=0.97

• Entropy (Ssunny)=0.97

• Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97
• Gain (Ssunny, Wind)= 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019
• Gain(Ssunny, Temperature)=0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57
• Highest information gain is humidity, so use this attribute

Continue until all the examples are classified…

– Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain (Srainy, Temp)
– Gain (Srainy, Wind) is the highest

• All leaf nodes are associated with training examples from the
same class (entropy=0)
• The attribute temperature is not used

### How to avoid overfitting in Decision Trees:

• Pre-Pruning
• Post-Pruning

There are more topics in Decision Tree-like how to handle numeric values?

Do let me know so that, I can expand this series of Decision Trees…

In the next blog post, I ‘ll be going into the coding part of the decision tree…

Thank You!

List of the parts of Decision Tress Series :

Connect with me on :