Hey Developer, let’s get more insight’s of Decision Tree learning about :

- Decision Tree
- Various Types of Decision Tree
- Different Algorithms Used For Decision Tree
- How to build Decision Trees ?
- Entropy and Information Gain Calculations
- How to avoid overfitting in Decision Trees

Let’s start then,

List of the parts of Decision Tree Series :

- https://mlait.in/2020/03/23/introduction-to-decision-trees—part-1/
- https://mlait.in/2020/03/23/going-deeper-into-decision-trees—part-2/

### Introduction To Decision Trees

- Decision Tree learning is one of the most widely used and practical methods for inductive inference(It is the process of reaching a general conclusion from specific examples, the conclusion is then applied to unseen data)
- It is a method for approximating discrete-valued functions, in which the learned functions is represented by a decision tree.
- It is robust to noisy data and capable of learning disjunctive expressions( A
**disjunction**is a compound**statement**formed by joining two**statements**with the connector OR ). - Learned trees can also be represented as sets of if-then rules to improve readability.

### Let’s understand with a small dataset…

**Attributes (Features)**: Outlook, temp, humidity, wind**Values**: Description of features, Eg: Outlook Values:- sunny, cloudy, rainy**Target:**Play Tennis (We have to predict this), It represents the output of the model.**Instances:**Examples of D1 to D14 of the dataset**Concept:**Learn to decide whether to play tennis i.e., find**h**from the given dataset.

### What is Decision Tree?

- An internal node is a test on an attribute.
- A branch represents an outcome of the test, eg., outlook=windy
- A leaf node represents a class label or class label distribution
- At each node, one attribute is chosen to split training examples into distinct classes as much as possible.
- A new case is classified by following a matching path to a leaf node.

**Decision Tree Types:**

**Binary Trees:**Only two choices in each split. Can be non-uniform(uneven) in depth.**N-way trees or ternary trees:**three or more choices in at least one of its splits(3-way,4-way etc.).

**Decision Tree Algorithms**

- Hunt’s Algorithm( One of the earliest)
- CART
- ID3
- C4.5
- SLIQ
- SPRINT
- CHAID

### How to build a decision tree(Algorithm)?

#### Basic algorithm(a greedy algorithm)

- The tree is constructed in a top-down recursive divide and conquer manner
- At the start, all training examples are the root
- Attributes are categorical(if continuous-values, they are discretized in advance)
- Partition the examples recursively by chosing one attribute each time on the basis of a heuristic or statistical measure(eg., information gain)

#### Conditions for stopping partitioning

- All samples for a given node belong to the same class
- If there are no remaining attributes for further partitioning, majority voting is applied which is done through the
**Random Forest Algorithm(Ensemble Methods).** - Remove subtrees or branches, in a bottom-up manner, to improve the estimated accuracy on new cases.

### Forming Rules From Decision Tree:

- Example of concept: ‘Should I play tennis today’

– Takes inputs (set of attributes)

– Outputs a decision (say YES/NO)

- Each non-leaf node is an attribute

– The first non-leaf node is root node - Each leaf node is either Yes or No
- Each link (branch) is labeled with

possible values of the associated attribute - Rule formation

– A decision tree can be expressed as a disjunction of conjunctions

• PLAY tennis IF (Outlook = sunny) ∧ (Humidity = normal) ∨ (Outlook=Cloudy)

∨ (Outlook = Rainy) ∧ (Wind=Weak)

• ∨ is disjunction operator (OR)

• ∧ is conjunction operator (AND)

### Which attribute is the Best Classifier or How to choose the splitting attribute?

- At each node, available attributes are evaluated on the basis of separating the classes of the training examples
- A goodness(purity) function is used for this purpose
**Typical goodness function:**

- Information Gain(ID3/C4.5)
- Information Gain Ratio
- Gini Index

### How to Measure Impurity?

**Entropy Impurity**: Most Popular Measure Of Impurity

**Gini Impurity:** Generalization Of Variance Impurity, Applicable to two or more categories.

**Misclassification Impurity:** Minimum Probability that a training pattern would be misclassified at N. Most strongly peaked at equal probabilities.

### Impurity of Two Category Case:

### Obtaining DT through top-down induction

- How can we obtain a DT?
- Perform a top-down search, through the space of possible decision trees
- Determine the attribute that best classifies the training data
- Use this attribute as the root of the tree
- Repeat this process for each branch from left to right
- Proceed to the next level and determine the next best feature
- Repeat until a leaf is reached.
- How to choose the best attribute?
- Choose the attribute that will yield more information (i.e. the attribute with the
**highest information gain**)

### Information gain

- Information gain – > a reduction of Entropy, E
- But what is Entropy?

– Is the amount of energy that cannot be used to do work

– Measured in bits

– A measure of disorder in a system (high entropy = disorder)

– where:

– S is the training data set

– c is the number of target classes

– pi is the proportion of examples in S belonging to target class i

– Note: if your calculator doesn’t do log2, use log2(x)=1.443 ln(x) or 3.322 log10(x). For even better accuracy, use log2(x)=ln(x)/ln(2) or log2(x)=log10(x)/log10(2)

### Entropy example

- A coin is flipped

– If the coin was fair -> 50% chance of head

– Now, let us rig the coin -> so that 99% of the time head comes up - Let’s look at this in terms of entropy:

– Two outcomes: head, tail

– Probability: phead, ptail

– E(0.5, 0.5)= – 0.5 log2(0.5) – (0.5) log2(0.5) = 1 bit – E(0.01, 0.99) = – 0.01 log2(0.01) – 0.99 log2 (0.99) = 0.08 bit - If the probability of heads =1, then entropy=0 – E(0, 1.0) = – 0 log2(0) – 1.0 log2 (1.0) = 0 bit

### Information Gain

- Information Gain, G will be defined as:

where

– Values (A) is the set of all possible values of attribute A

– Sv is the subset of S for which A has a value v

– |S| is the size of S and Sv is the size of Sv

The information gain is the expected reduction in entropy

caused by knowing the value of attribute A

### Example – entropy calculation

- Compute the entropy of the play-tennis example:
- We have two classes, YES and NO
- We have 14 instances with 9 classified as YES and 5 as NO

– i.e. no. of classes, c=2

- EYES = – (9/14) log2 (9/14) = 0.41
- ENO = – (5/14) log2 (5/14) = 0.53
- E(S) = EYES + ENO = 0.94

### Example – information gain calculation

**Compute the information gain for the attributes wind in the**

play-tennis data set:

– |S|=14

– Attribute wind- Two values: weak and strong
- |Sweak| = 8
- |Sstrong| = 6

**Now, let us determine E(Sweak)**- Instances=8, YES=6, NO=2
- [6+,2-]
- E(Sweak)=-(6/8)log2(6/8)-(2/8)log2(2/8)=0.81
**Now, let us determine E(Sstrong)**- Instances=6, YES=3, NO=3
- [3+,3-]
- E(Sstrong)=-(3/6)log2(3/6)-(3/6)log2(3/6)=1.0
- Note, do not waste time if pYES=pNO
**Going back to information gain computation for the attribut wind:**

= 0.94 – (8/14) 0.81 – (6/14)1.00

= 0.048

**Same as calculate for all the attributes**

Now, compute the information gain for the attribute outlook and

temperature in the play-tennis data set:

• Attribute outlook:

• Attribute humidity:

• Attribute temperature:

**and we get all the Information Gain of Attributes**

• Gain(S, outlook)=0.25

• Gain(S, temp)=0.03

• Gain(S, humidity)=0.15

• Gain(S, wind)=0.048** • So, attribute with the highest info. gain • OUTLOOK, therefore use outlook as the root node**

### Decision Tree On Next Level

After determining **OUTLOOK** as the root node, we need to expand the tree

• E(Ssunny)=-(2/5)log2(2/5)-(3/5)log2(3/5)=0.97

• Entropy (Ssunny)=0.97

• Gain(Ssunny, Humidity)=0.97-(3/5) 0.0 – (2/5) 0.0=0.97

• Gain (Ssunny, Wind)= 0.97– (3/5) 0.918 – (2/5) 1.0 = 0.019

• Gain(Ssunny, Temperature)=0.97-(2/5) 0.0 – (2/5) 1.0 – (1/5) 0.0 = 0.57

• Highest information gain is humidity, so use this attribute

** Continue until all the examples are classified…**

– Gain (Srainy, Wind), Gain (Srainy, Humidity),Gain (Srainy, Temp)

– Gain (Srainy, Wind) is the highest

• All leaf nodes are associated with training examples from the

same class (entropy=0)

• The attribute temperature is not used

### How to avoid overfitting in Decision Trees:

- Pre-Pruning
- Post-Pruning

To learn more about these methods please let us know otherwise this blog will become more lengthy.

There are more topics in Decision Tree-like **how to handle numeric values**?

Do let me know so that, I can expand this series of Decision Trees…

In the next blog post, I ‘ll be going into the coding part of the decision tree…

Thank You!

List of the parts of Decision Tress Series :

- https://mlait.in/2020/03/23/introduction-to-decision-trees—part-1/
- https://mlait.in/2020/03/23/going-deeper-into-decision-trees—part-2/

Connect with me on :

Linkedin: https://www.linkedin.com/in/patidarparas13/

Twitter: https://twitter.com/patidarparas13

Github: https://github.com/patidarparas13

MLAIT On Twitter: https://twitter.com/mlait1908

MLAIT On Linkedin: https://www.linkedin.com/company/mlait1908