Hey Developer, let’s get more insight’s of Decision Tree learning about :
- Decision Tree
- Various Types of Decision Tree
- Different Algorithms Used For Decision Tree
- How to build Decision Trees ?
- Entropy and Information Gain Calculations
- How to avoid overfitting in Decision Trees
Let’s start then,
List of the parts of Decision Tree Series :
- http://34.74.145.68/2020/03/23/introduction-to-decision-trees—part-1/
- http://34.74.145.68/2020/03/23/going-deeper-into-decision-trees—part-2/
Introduction To Decision Trees
- Decision Tree learning is one of the most widely used and practical methods for inductive inference(It is the process of reaching a general conclusion from specific examples, the conclusion is then applied to unseen data)
- It is a method for approximating discrete-valued functions, in which the learned functions is represented by a decision tree.
- It is robust to noisy data and capable of learning disjunctive expressions( A disjunction is a compound statement formed by joining two statements with the connector OR ).
- Learned trees can also be represented as sets of if-then rules to improve readability.
Let’s understand with a small dataset…

- Attributes (Features): Outlook, temp, humidity, wind
- Values: Description of features, Eg: Outlook Values:- sunny, cloudy, rainy
- Target: Play Tennis (We have to predict this), It represents the output of the model.
- Instances: Examples of D1 to D14 of the dataset
- Concept: Learn to decide whether to play tennis i.e., find h from the given dataset.
What is Decision Tree?
- An internal node is a test on an attribute.
- A branch represents an outcome of the test, eg., outlook=windy
- A leaf node represents a class label or class label distribution
- At each node, one attribute is chosen to split training examples into distinct classes as much as possible.
- A new case is classified by following a matching path to a leaf node.
Decision Tree Types:
- Binary Trees: Only two choices in each split. Can be non-uniform(uneven) in depth.
- N-way trees or ternary trees: three or more choices in at least one of its splits(3-way,4-way etc.).
Decision Tree Algorithms
- Hunt’s Algorithm( One of the earliest)
- CART
- ID3
- C4.5
- SLIQ
- SPRINT
- CHAID
How to build a decision tree(Algorithm)?
Basic algorithm(a greedy algorithm)
- The tree is constructed in a top-down recursive divide and conquer manner
- At the start, all training examples are the root
- Attributes are categorical(if continuous-values, they are discretized in advance)
- Partition the examples recursively by chosing one attribute each time on the basis of a heuristic or statistical measure(eg., information gain)
Conditions for stopping partitioning
- All samples for a given node belong to the same class
- If there are no remaining attributes for further partitioning, majority voting is applied which is done through the Random Forest Algorithm(Ensemble Methods).
- Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.
Forming Rules From Decision Tree:
- Example of concept: ‘Should I play tennis today’
– Takes inputs (set of attributes)
– Outputs a decision (say YES/NO)
- Each non-leaf node is an attribute
– The first non-leaf node is root node - Each leaf node is either Yes or No
- Each link (branch) is labeled with
possible values of the associated attribute - Rule formation
– A decision tree can be expressed as a disjunction of conjunctions
• PLAY tennis IF (Outlook = sunny) ∧ (Humidity = normal) ∨ (Outlook=Cloudy)
∨ (Outlook = Rainy) ∧ (Wind=Weak)
• ∨ is disjunction operator (OR)
• ∧ is conjunction operator (AND)
Which attribute is the Best Classifier or How to choose the splitting attribute?
- At each node, available attributes are evaluated on the basis of separating the classes of the training examples
- A goodness(purity) function is used for this purpose
- Typical goodness function:
- Information Gain(ID3/C4.5)
- Information Gain Ratio
- Gini Index
How to Measure Impurity?
Entropy Impurity: Most Popular Measure Of Impurity

Gini Impurity: Generalization Of Variance Impurity, Applicable to two or more categories.

Misclassification Impurity: Minimum Probability that a training pattern would be misclassified at N. Most strongly peaked at equal probabilities.

Impurity of Two Category Case:

Obtaining DT through top-down induction
- How can we obtain a DT?
- Perform a top-down search, through the space of possible decision trees
- Determine the attribute that best classifies the training data
- Use this attribute as the root of the tree
- Repeat this process for each branch from left to right
- Proceed to the next level and determine the next best feature
- Repeat until a leaf is reached.
- How to choose the best attribute?
- Choose the attribute that will yield more information (i.e. the attribute with the highest information gain)
Information gain
- Information gain – > a reduction of Entropy, E
- But what is Entropy?
– Is the amount of energy that cannot be used to do work
– Measured in bits
– A measure of disorder in a system (high entropy = disorder)

– where:
– S is the training data set
– c is the number of target classes
– pi is the proportion of examples in S belonging to target class i
– Note: if your calculator doesn’t do log2, use log2(x)=1.443 ln(x) or 3.322 log10(x). For even better accuracy, use log2(x)=ln(x)/ln(2) or log2(x)=log10(x)/log10(2)
Entropy example
- A coin is flipped
– If the coin was fair -> 50% chance of head
– Now, let us rig the coin -> so that 99% of the time head comes up - Let’s look at this in terms of entropy:
– Two outcomes: head, tail
– Probability: phead, ptail
– E(0.5, 0.5)= – 0.5 log2(0.5) – (0.5) log2(0.5) = 1 bit – E(0.01, 0.99) = – 0.01 log2(0.01) – 0.99 log2 (0.99) = 0.08 bit - If the probability of heads =1, then entropy=0 – E(0, 1.0) = – 0 log2(0) – 1.0 log2 (1.0) = 0 bit
Information Gain
- Information Gain, G will be defined as:

where
– Values (A) is the set of all possible values of attribute A
– Sv is the subset of S for which A has a value v
– |S| is the size of S and Sv is the size of Sv
The information gain is the expected reduction in entropy
caused by knowing the value of attribute A
Example – entropy calculation
- Compute the entropy of the play-tennis example:
- We have two classes, YES and NO
- We have 14 instances with 9 classified as YES and 5 as NO
– i.e. no. of classes, c=2

- EYES = – (9/14) log2 (9/14) = 0.41
- ENO = – (5/14) log2 (5/14) = 0.53
- E(S) = EYES + ENO = 0.94
Example – information gain calculation
- Compute the information gain for the attributes wind in the
play-tennis data set:
– |S|=14
– Attribute wind - Two values: weak and strong
- |Sweak| = 8
- |Sstrong| = 6
- Now, let us determine E(Sweak)
- Instances=8, YES=6, NO=2
- [6+,2-]
- E(Sweak)=-(6/8)log2(6/8)-(2/8)log2(2/8)=0.81
- Now, let us determine E(Sstrong)
- Instances=6, YES=3, NO=3
- [3+,3-]
- E(Sstrong)=-(3/6)log2(3/6)-(3/6)log2(3/6)=1.0
- Note, do not waste time if pYES=pNO
- Going back to information gain computation for the attribut wind:

= 0.94 – (8/14) 0.81 – (6/14)1.00
= 0.048
- Same as calculate for all the attributes
Now, compute the information gain for the attribute outlook and
temperature in the play-tennis data set:
• Attribute outlook:
• Attribute humidity:
• Attribute temperature:
and we get all the Information Gain of Attributes
• Gain(S, outlook)=0.25
• Gain(S, temp)=0.03
• Gain(S, humidity)=0.15
• Gain(S, wind)=0.048
• So, attribute with the highest info. gain
• OUTLOOK, therefore use outlook as the root node
Decision Tree On Next Level
After determining OUTLOOK as the root node, we need to expand the tree

• E(Ssunny)=-(2/5)log2(2/5)-(3/5)log2(3/5)=0.97
• Entropy (Ssunny)=0.97

• Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97
• Gain (Ssunny, Wind)= 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019
• Gain(Ssunny, Temperature)=0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57
• Highest information gain is humidity, so use this attribute
Continue until all the examples are classified…
– Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain (Srainy, Temp)
– Gain (Srainy, Wind) is the highest

• All leaf nodes are associated with training examples from the
same class (entropy=0)
• The attribute temperature is not used
How to avoid overfitting in Decision Trees:
- Pre-Pruning
- Post-Pruning
To learn more about these methods please let us know otherwise this blog will become more lengthy.
There are more topics in Decision Tree-like how to handle numeric values?
Do let me know so that, I can expand this series of Decision Trees…
In the next blog post, I ‘ll be going into the coding part of the decision tree…
Thank You!
List of the parts of Decision Tress Series :
- http://34.74.145.68/2020/03/23/introduction-to-decision-trees—part-1/
- http://34.74.145.68/2020/03/23/going-deeper-into-decision-trees—part-2/
Connect with me on :
Linkedin: https://www.linkedin.com/in/patidarparas13/
Twitter: https://twitter.com/patidarparas13
Github: https://github.com/patidarparas13
MLAIT On Twitter: https://twitter.com/mlait1908
MLAIT On Linkedin: https://www.linkedin.com/company/mlait1908