Unit 4: Decision Trees, CART, Ensemble Learning, Bagging, Boosting & Nearest Neighbour

Machine Learning Techniques (MCA556)

From your syllabus.

---

Learning with Trees

Decision Trees are one of the most popular machine learning algorithms.

They make decisions using a tree-like structure.

Example:

Study Hours?

> 5 Hours

Pass

< 5 Hours

Fail

---

Components of Decision Tree

Root Node

Starting point of the tree.

Example:

Study Hours?

---

Internal Node

Represents a condition.

Example:

Attendance > 75%?

---

Leaf Node

Final prediction.

Example:

Pass

Fail

---

Advantages of Decision Trees

Easy to understand

Easy to visualize

Works with numerical and categorical data

Requires little data preparation

---

Disadvantages

Can overfit

Sensitive to data changes

Large trees become complex

---

Constructing Decision Trees

Steps:

1. Select best feature

2. Split dataset

3. Create branches

4. Repeat recursively

5. Stop when classification is complete

---

Classification and Regression Trees (CART)

CART stands for:

Classification And Regression Trees

Used for:

Classification

Output is a category.

Examples:

Pass/Fail

Spam/Not Spam

---

Regression

Output is a numerical value.

Examples:

Salary prediction

House price prediction

---

Ensemble Learning

Combining multiple models to create a stronger model.

Idea:

Weak Learners

↓

Combine

↓

Strong Learner

Benefits:

Higher accuracy

Better generalization

Reduced overfitting

---

Types of Ensemble Learning

Bagging

Boosting

---

Bagging (Bootstrap Aggregating)

Multiple models are trained independently.

Process:

Dataset

↓

Random Samples

↓

Many Models

↓

Voting/Average

↓

Final Prediction

---

Advantages of Bagging

Reduces variance

Prevents overfitting

Improves stability

---

Example

Random Forest

Most famous Bagging algorithm.

Random Forest:

Uses many Decision Trees

Final answer through voting

---

Boosting

Boosting improves weak models sequentially.

Idea:

Model 1

↓

Fix Mistakes

↓

Model 2

↓

Fix Mistakes

↓

Model 3

↓

Final Strong Model

---

Advantages of Boosting

High accuracy

Handles complex problems

Improves weak learners

---

Popular Boosting Algorithms

AdaBoost

Adaptive Boosting.

---

Gradient Boosting

Improves prediction by minimizing errors.

---

XGBoost

Most widely used boosting algorithm.

Applications:

Data science competitions

Industry projects

---

Difference Between Bagging and Boosting

Bagging Boosting

Models trained independently Models trained sequentially

Reduces variance Reduces bias

Faster Slower

Random Forest AdaBoost, XGBoost

---

Probability and Learning

Machine Learning often uses probability.

Probability helps:

Handle uncertainty

Make predictions

Estimate outcomes

Applications:

Spam filtering

Disease prediction

Recommendation systems

---

Data into Probabilities

Example:

80 students passed

20 students failed

Probability of passing:

80/100 = 0.8

80%

---

Basic Statistics

Statistics helps understand data.

Important terms:

---

Mean

Average value.

\bar{x}=\frac{\sum x}{n}

---

Median

Middle value after sorting.

---

Mode

Most frequent value.

---

Variance

Measures spread of data.

Variance=\frac{\sum (x-\bar{x})^2}{n}

---

Gaussian Mixture Models (GMM)

Advanced clustering algorithm.

Assumption: Data is generated from multiple Gaussian distributions.

Advantages:

Flexible cluster shapes

Better than K-Means in many cases

Applications:

Image processing

Speech recognition

Pattern recognition

---

Nearest Neighbour Methods

One of the simplest ML techniques.

Most common:

K-Nearest Neighbour (KNN)

Idea:

Find the K closest data points and classify based on neighbors.

Example:

New Student

↓

Find 5 nearest students

↓

Majority Vote

↓

Prediction

---

Advantages of KNN

Easy to understand

No training phase

Good for small datasets

---

Disadvantages of KNN

Slow for large datasets

Sensitive to irrelevant features

Requires choosing K value

---

Applications of KNN

Recommendation systems

Image recognition

Medical diagnosis

Pattern recognition

---

Important Exam Questions

Short Questions

1. What is a Decision Tree?

2. Define CART.

3. What is Ensemble Learning?

4. Define Bagging.

5. Define Boosting.

6. What is Random Forest?

7. What is KNN?

8. What is GMM?

---

Long Questions

1. Explain Decision Tree construction.

2. Discuss CART with examples.

3. Explain Ensemble Learning.

4. Differentiate Bagging and Boosting.

5. Explain KNN algorithm.

6. Explain Gaussian Mixture Models.

---

Quick Revision

Decision Tree = Tree-based prediction model.

CART = Classification and Regression Trees.

Ensemble Learning = Combining multiple models.

Bagging = Independent model training.

Random Forest = Bagging-based algorithm.

Boosting = Sequential improvement of models.

KNN = Nearest neighbour classification.

GMM = Advanced clustering model.

Next Unit 5:

PCA, LDA, Factor Analysis, ICA, Isomap, Genetic Algorithms, Evolutionary Learning, Reinforcement Learning, Markov Decision Process (MDP) — the final unit of Machine Learning and often asked in theory exams.

[ROOT@CYBERSHIELD]#

Unit 4: Decision Trees, CART, Ensemble Learning, Bagging, Boosting & Nearest Neighbour

Discuss (0)