Hey Developer, let’s get more insight’s of Decision Tree learning about :

  1. Decision Tree
  2. Various Types of Decision Tree
  3. Different Algorithms Used For Decision Tree
  4. How to build Decision Trees ?
  5. Entropy and Information Gain Calculations
  6. How to avoid overfitting in Decision Trees

Let’s start then,

List of the parts of Decision Tree Series :


Introduction To Decision Trees

  • Decision Tree learning is one of the most widely used and practical methods for inductive inference(It is the process of reaching a general conclusion from specific examples, the conclusion is then applied to unseen data)
  • It is a method for approximating discrete-valued functions, in which the learned functions is represented by a decision tree.
  • It is robust to noisy data and capable of learning disjunctive expressions( A disjunction is a compound statement formed by joining two statements with the connector OR ).
  • Learned trees can also be represented as sets of if-then rules to improve readability.

Let’s understand with a small dataset…

  • Attributes (Features): Outlook, temp, humidity, wind
  • Values: Description of features, Eg: Outlook Values:- sunny, cloudy, rainy
  • Target: Play Tennis (We have to predict this), It represents the output of the model.
  • Instances: Examples of D1 to D14 of the dataset
  • Concept: Learn to decide whether to play tennis i.e., find h from the given dataset.

What is Decision Tree?

  • An internal node is a test on an attribute.
  • A branch represents an outcome of the test, eg., outlook=windy
  • A leaf node represents a class label or class label distribution
  • At each node, one attribute is chosen to split training examples into distinct classes as much as possible.
  • A new case is classified by following a matching path to a leaf node.

Decision Tree Types:

  • Binary Trees: Only two choices in each split. Can be non-uniform(uneven) in depth.
  • N-way trees or ternary trees: three or more choices in at least one of its splits(3-way,4-way etc.).

Decision Tree Algorithms

  • Hunt’s Algorithm( One of the earliest)
  • CART
  • ID3
  • C4.5
  • SLIQ

How to build a decision tree(Algorithm)?

Basic algorithm(a greedy algorithm)

  • The tree is constructed in a top-down recursive divide and conquer manner
  • At the start, all training examples are the root
  • Attributes are categorical(if continuous-values, they are discretized in advance)
  • Partition the examples recursively by chosing one attribute each time on the basis of a heuristic or statistical measure(eg., information gain)

Conditions for stopping partitioning

  • All samples for a given node belong to the same class
  • If there are no remaining attributes for further partitioning, majority voting is applied which is done through the Random Forest Algorithm(Ensemble Methods).
  • Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.

Forming Rules From Decision Tree:

  • Example of concept: ‘Should I play tennis today’
    – Takes inputs (set of attributes)
    – Outputs a decision (say YES/NO)
  • Each non-leaf node is an attribute
    – The first non-leaf node is root node
  • Each leaf node is either Yes or No
  • Each link (branch) is labeled with
    possible values of the associated attribute
  • Rule formation
    – A decision tree can be expressed as a disjunction of conjunctions
    • PLAY tennis IF (Outlook = sunny) ∧ (Humidity = normal) ∨ (Outlook=Cloudy)
    ∨ (Outlook = Rainy) ∧ (Wind=Weak)
    • ∨ is disjunction operator (OR)
    • ∧ is conjunction operator (AND)

Which attribute is the Best Classifier or How to choose the splitting attribute?

  • At each node, available attributes are evaluated on the basis of separating the classes of the training examples
  • A goodness(purity) function is used for this purpose
  • Typical goodness function:
  1. Information Gain(ID3/C4.5)
  2. Information Gain Ratio
  3. Gini Index

How to Measure Impurity?

Entropy Impurity: Most Popular Measure Of Impurity

Gini Impurity: Generalization Of Variance Impurity, Applicable to two or more categories.

Misclassification Impurity: Minimum Probability that a training pattern would be misclassified at N. Most strongly peaked at equal probabilities.

Impurity of Two Category Case:

Obtaining DT through top-down induction

  • How can we obtain a DT?
  • Perform a top-down search, through the space of possible decision trees
  • Determine the attribute that best classifies the training data
  • Use this attribute as the root of the tree
  • Repeat this process for each branch from left to right
  • Proceed to the next level and determine the next best feature
  • Repeat until a leaf is reached.
  • How to choose the best attribute?
  • Choose the attribute that will yield more information (i.e. the attribute with the highest information gain)

Information gain

  • Information gain – > a reduction of Entropy, E
  • But what is Entropy?
    – Is the amount of energy that cannot be used to do work
    – Measured in bits
    – A measure of disorder in a system (high entropy = disorder)

– where:
– S is the training data set
– c is the number of target classes
– pi is the proportion of examples in S belonging to target class i
– Note: if your calculator doesn’t do log2, use log2(x)=1.443 ln(x) or 3.322 log10(x). For even better accuracy, use log2(x)=ln(x)/ln(2) or log2(x)=log10(x)/log10(2)

Entropy example

  • A coin is flipped
    – If the coin was fair -> 50% chance of head
    – Now, let us rig the coin -> so that 99% of the time head comes up
  • Let’s look at this in terms of entropy:
    – Two outcomes: head, tail
    – Probability: phead, ptail
    – E(0.5, 0.5)= – 0.5 log2(0.5) – (0.5) log2(0.5) = 1 bit – E(0.01, 0.99) = – 0.01 log2(0.01) – 0.99 log2 (0.99) = 0.08 bit
  • If the probability of heads =1, then entropy=0 – E(0, 1.0) = – 0 log2(0) – 1.0 log2 (1.0) = 0 bit

Information Gain

  • Information Gain, G will be defined as:

– Values (A) is the set of all possible values of attribute A
– Sv is the subset of S for which A has a value v
– |S| is the size of S and Sv is the size of Sv
The information gain is the expected reduction in entropy
caused by knowing the value of attribute A

Example – entropy calculation

  • Compute the entropy of the play-tennis example:
  • We have two classes, YES and NO
  • We have 14 instances with 9 classified as YES and 5 as NO
    – i.e. no. of classes, c=2
This image has an empty alt attribute; its file name is Screenshot-693.png
  • EYES = – (9/14) log2 (9/14) = 0.41
  • ENO = – (5/14) log2 (5/14) = 0.53
  • E(S) = EYES + ENO = 0.94

Example – information gain calculation

  • Compute the information gain for the attributes wind in the
    play-tennis data set:

    – |S|=14
    – Attribute wind
  • Two values: weak and strong
  • |Sweak| = 8
  • |Sstrong| = 6
  • Now, let us determine E(Sweak)
  • Instances=8, YES=6, NO=2
  • [6+,2-]
  • E(Sweak)=-(6/8)log2(6/8)-(2/8)log2(2/8)=0.81
  • Now, let us determine E(Sstrong)
  • Instances=6, YES=3, NO=3
  • [3+,3-]
  • E(Sstrong)=-(3/6)log2(3/6)-(3/6)log2(3/6)=1.0
  • Note, do not waste time if pYES=pNO
  • Going back to information gain computation for the attribut wind:

= 0.94 – (8/14) 0.81 – (6/14)1.00
= 0.048

  • Same as calculate for all the attributes

Now, compute the information gain for the attribute outlook and
temperature in the play-tennis data set:
• Attribute outlook:

• Attribute humidity:

• Attribute temperature:

and we get all the Information Gain of Attributes

• Gain(S, outlook)=0.25
• Gain(S, temp)=0.03
• Gain(S, humidity)=0.15
• Gain(S, wind)=0.048
• So, attribute with the highest info. gain
• OUTLOOK, therefore use outlook as the root node

Decision Tree On Next Level

After determining OUTLOOK as the root node, we need to expand the tree

• E(Ssunny)=-(2/5)log2(2/5)-(3/5)log2(3/5)=0.97

• Entropy (Ssunny)=0.97

• Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97
• Gain (Ssunny, Wind)= 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019
• Gain(Ssunny, Temperature)=0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57
• Highest information gain is humidity, so use this attribute

Continue until all the examples are classified…

– Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain (Srainy, Temp)
– Gain (Srainy, Wind) is the highest

• All leaf nodes are associated with training examples from the
same class (entropy=0)
• The attribute temperature is not used

How to avoid overfitting in Decision Trees:

  • Pre-Pruning
  • Post-Pruning

To learn more about these methods please let us know otherwise this blog will become more lengthy.

There are more topics in Decision Tree-like how to handle numeric values?

Do let me know so that, I can expand this series of Decision Trees…

In the next blog post, I ‘ll be going into the coding part of the decision tree…

Thank You!

List of the parts of Decision Tress Series :


Connect with me on :

Linkedin: https://www.linkedin.com/in/patidarparas13/

Twitter: https://twitter.com/patidarparas13

Github: https://github.com/patidarparas13

MLAIT On Twitter: https://twitter.com/mlait1908

MLAIT On Linkedin: https://www.linkedin.com/company/mlait1908