- Sequence of if-else question
- Consists of hierarchy of nodes. Each node raise question or prediction.
- Root node : No parent
- Internal node : Has parent, has children
- Leaf node : Has no children. It is where predictions are made
- Goal : Search for pattern to produce purest leaves. Each leaf contains pattern for one dominant label.
- Information Gain : At each node, find the split point for each feature for which we get maximum correct pure split of the data. When information gain = 0, we could say that our goal is achieved, the pattern is captured, and this is a leaf node. Otherwise keep splitting it (We can stop it by specifying maximum depth of recursion split).
- Measure of impurity in a node:
- Gini index: For classification
- Entropy: For classification
- MSE : For regression
- capture non-linear relationhship between features and labels/ real values
- Do not require feature scaling
- At each split, only one feature is involved
- Decision region : Feature space where instances are assigned to a label / value
- Decision Boundary : Surface that separates different decision regions
- Steps of building a decision tree:
1. Choose an attribute (column) of dataset
2. Calculate the significance of that attribute when splitting the data with Entropy.
A good split has less Entropy (disorder / randomness).
3. Find the best attribute that has most significance and use that attribute
to split the data
4. For each branch, repeat the process (Recursive partitioning) for best
information gain (The path that gives the most information using entropy).
- Limitations:
- Can only produce orthogonal decision boundaries
- Sensitive to small variations in training set
- High variance overfits the model
- Solution : Ensemble learning
- Train different models on same dataset
- Let each model make its prediction
- Aggregate predictions of individual models (eg: hard-voting)
- One model's weakness is covered by another model's strength in that particular task
- Final model is combination of models that are skillfull in different ways