Saturday, June 13, 2026

Unit 2: Evaluation Metrics, K-Means, Bayes Learning, Clustering & Feature Reduction

 



From your syllabus.


Evaluation Metrics

Evaluation metrics help us measure how good a machine learning model is.


Confusion Matrix

Used for classification problems.

Actual / Predicted Positive Negative
Positive TP FN
Negative FP TN

Where:

  • TP = True Positive
  • TN = True Negative
  • FP = False Positive
  • FN = False Negative

Precision

Measures how many predicted positives are actually correct.

Example: If 100 emails are predicted as spam and 90 are actually spam:

Precision = 90%


Recall

Measures how many actual positives were correctly identified.

Example: Out of 100 spam emails, if system detects 80:

Recall = 80%


F1 Score

Balance between Precision and Recall.

Higher F1 Score means better model.


Mean Squared Error (MSE)

Used in regression models.

Measures average squared prediction error.

Smaller MSE = Better model.


Flexibility vs Interpretability

Flexible Models

Examples:

  • Neural Networks
  • Deep Learning

Advantages:

  • High accuracy

Disadvantages:

  • Hard to understand

Interpretable Models

Examples:

  • Linear Regression
  • Decision Trees

Advantages:

  • Easy to understand

Disadvantages:

  • Sometimes less accurate

Reducible and Irreducible Error

Reducible Error

Can be reduced by:

  • Better data
  • Better algorithms

Irreducible Error

Cannot be eliminated.

Caused by:

  • Randomness
  • Noise in data

Unsupervised Learning

Learning from unlabeled data.

Goal:

  • Discover hidden patterns

K-Means Clustering

Most important clustering algorithm.

Purpose:

  • Divide data into K groups.

Steps

  1. Select K clusters.
  2. Choose initial centroids.
  3. Assign points to nearest centroid.
  4. Update centroid positions.
  5. Repeat until stable.

Example:

Students grouped by marks:
Cluster 1 → High Performers
Cluster 2 → Average
Cluster 3 → Low Performers

Advantages:

  • Simple
  • Fast

Disadvantages:

  • Need to choose K beforehand

Vector Quantization

Technique for compressing data.

Applications:

  • Image compression
  • Signal processing

Self Organizing Feature Map (SOFM)

Neural network used for:

  • Visualization
  • Clustering
  • Pattern recognition

Developed by:

Also called: Kohonen Map


Instance Based Learning

Stores training examples and compares new examples.

Example:

  • K-Nearest Neighbour (KNN)

Advantages:

  • Simple

Disadvantages:

  • Slow for large datasets

Feature Reduction

Reducing the number of features while keeping important information.

Benefits:

  • Faster training
  • Reduced storage
  • Less overfitting

Probability in Machine Learning

Probability measures uncertainty.

Range:

0 ≤ Probability ≤ 1
  • 0 = Impossible
  • 1 = Certain

Bayes Learning

Based on Bayes Theorem.

Most important probability concept in ML.

Used in:

  • Spam detection
  • Disease prediction
  • Recommendation systems

Clustering

Grouping similar data points.

Applications:

  • Customer segmentation
  • Image processing
  • Market analysis

Adaptive Hierarchical Clustering

Creates clusters in tree form.

Types:

Agglomerative

Start with individual points and merge.

Divisive

Start with one cluster and split.


Gaussian Mixture Model (GMM)

Advanced clustering technique.

Assumes data is generated from multiple Gaussian distributions.

Advantages:

  • Flexible clusters
  • Better than K-Means for complex data

Applications:

  • Pattern recognition
  • Speech processing
  • Image segmentation

Important Exam Questions

Short Questions

  1. Define Precision.
  2. Define Recall.
  3. What is F1 Score?
  4. What is MSE?
  5. What is K-Means?
  6. What is Feature Reduction?
  7. State Bayes Theorem.
  8. What is GMM?

Long Questions

  1. Explain Precision, Recall and F1 Score.
  2. Explain K-Means Clustering with steps.
  3. Discuss Bayes Learning.
  4. Explain Gaussian Mixture Models.
  5. Explain Feature Reduction.
  6. Compare K-Means and Hierarchical Clustering.

Quick Revision

  • Precision = Correct positive predictions.
  • Recall = Found actual positives.
  • F1 Score = Balance of Precision and Recall.
  • MSE = Regression error measure.
  • K-Means = Popular clustering algorithm.
  • Bayes Theorem = Probability-based learning.
  • GMM = Advanced clustering method.
  • Feature Reduction = Fewer but important features.

Next Unit 3:

Logistic Regression, Support Vector Machine (SVM), Kernel Functions, Perceptron, Neural Networks, Backpropagation, Deep Neural Networks — the most important ML unit for exams and interviews.

No comments:

Post a Comment