The main objective of machine learning is to reduce the overall error, which consists of bias and variance. This total error can be divided into three parts: bias error, variance error, and fundamental error. Irreducible error represents the unavoidable noise in the data that we cannot eliminate, but we can reduce bias and variance by carefully selecting and tuning our models.
A well-optimized model achieves this balance, minimizing bias and variance to reach the lowest total error possible. This balance can be achieved through regularization (like L1 and L2), cross-validation, and ensemble methods. Regularization helps control the model complexity by adding a penalty for higher complexity. At the same time, cross-validation ensures that the model’s performance is tested on different subsets of the data, giving a more accurate reflection of its generalization ability.
Total error = Bias2 + Variance + Irreducible error
When developing a machine learning model to forecast house prices based on house size, we frequently face the variance tradeoff, which is essential for assessing the model's performance. Picture a dataset where blue dots signify training samples and orange dots indicate test samples.
Suppose we train a model that perfectly fits these blue training dots, which captures every detail and noise in the data. This is known as an overfitted model. An overfit model may perform exceptionally well on the training data because it tries to match the data exactly, leading to almost zero training error. However, this model performs poorly on test data, such as the orange dots, because it fails to generalize. The model becomes overly tailored to the training set, which causes a significant increase in test error.
For example, let’s say we calculate the error for one of the orange test points, and this error is represented by a gray dotted line (the difference between the actual and predicted values). If you average the errors across all the test points, you might find a high-test error around 100. This high error occurs because the overfitted model is too sensitive to the specifics of the training data and cannot predict new data accurately.
Suppose a friend also builds a model using the same methodology but selects a different set of training samples. You are fitting your models too closely to your training sets, leading to almost no errors during training. However, when you evaluate the test data, your model could have an average error of 100, while your friend's model might only have an error of 27. This difference arises because test error can fluctuate depending on the choice of training data points, a well-known sign of high variance.
High variance means that small changes in the training data lead to significant differences in model performance. Even though both models are built using the same approach, their performance on unseen data fluctuates because they have learned specific patterns rather than general patterns in their respective training sets. This variability in test error is a common issue with overfitting and is undesirable because it makes the model's predictions unreliable on new data.
Suppose a friend also builds a model using the same methodology but selects a different set of training samples. You are fitting your models too closely to your training sets, leading to almost no errors during training. However, when you evaluate the test data, your model could have an average error of 100, while your friend's model might only have an error of 27. This difference arises because test error can fluctuate depending on the choice of training data points, a well-known sign of high variance.
High variance means that small changes in the training data lead to significant differences in model performance. Even though both models are built using the same approach, their performance on unseen data fluctuates because they have learned specific patterns rather than general patterns in their respective training sets. This variability in test error is a common issue with overfitting and is undesirable because it makes the model's predictions unreliable on new data.
Some ML algorithms, such as k-nearest Neighbors (k-NN), Decision Trees, and Support Vector Machines (SVMs), are recognized for their low bias and high variance. These algorithms can easily overfit if not carefully adjusted, as they can identify complex patterns within the data. Consequently, they exhibit low bias but high variance, making them quite sensitive to variations in the training data. In contrast, algorithms like Linear Regression and Logistic Regression are characterized by high bias and low variance. These models operate under stronger assumptions about the data, which can lead to underfitting in certain situations. Still, they tend to be more stable when faced with different training sets, thereby minimizing variance.