How to Achieve a Balance Between Bias and Variance for Improved Model Performance

Nov 1 / Swapnil Srivastava

Introduction

The tradeoff between variance and bias is a crucial concept in machine learning. It helps us understand and manage a model's errors. These errors come from two primary sources: bias and variance. They significantly affect a model's performance and ability to work well with new, unseen data. Finding the right balance between bias and variance is essential to determining the best model complexity for optimal performance. Some trending keywords in machine learning include deep learning, neural networks, reinforcement learning, natural language processing, generative adversarial networks (GANs), transfer learning, AutoML (Automated Machine Learning), model explainability, and federated learning.

What is Bias? 

Bias refers to the mistake that arises when a model oversimplifies reality. This happens due to the use of simplistic assumptions that hinder the model's ability to grasp the complexities present in the data. In technical terms, bias can be described as the difference between the average predictions made by the model and the actual ground truth.
For instance, imagine you have data that follows a curved trend, but you fit a straight line to it. This linear model would ignore the nuances in the data, leading to large errors in prediction. This model exhibits a significant bias due to its overly simplistic nature, preventing it from accurately representing the underlying patterns. As a result, it often leads to underfitting, where the model does not adequately capture crucial relationships within the data.

The Mathematics of Bias

Cross Validation pic
Suppose we have a set of actual target values Y, and the predicted values by our model are denoted as Ŷ. Bias reflects the difference between the expected prediction of the model and the actual target values. Mathematically, bias can be expressed as:
Bias=E[Y^]−Y
Here:
● E[Y^] represents the expected value of the model’s predictions, averaged over different training datasets.
● Y is the actual value of the target variable.
Bias measures how far the predicted values Y^ deviate from the actual target values Y, on average.

How Bias Affects Model Performance

A model with high bias performs poorly on the training data and struggles with new, unseen data. By oversimplifying the data, it overlooks essential patterns and makes generalizations that often don’t hold. This can be likened to solving a complex problem with an overly simplistic method — it may sometimes work but fails to deliver in a broader context. For example, using a linear regression model on a dataset with a non-linear relationship between variables will result in high bias, as the model’s assumptions (linearity) are incorrect. This oversimplification can cause the model to consistently underpredict or overpredict, regardless of the training data it receives.

What is Variance?

In machine learning, variance explicitly describes the variability in model predictions. It tells us how much the model's predictions change when exposed to different subsets of the training dataset. Ideally, a model should generalize well to unseen data. A model with high variance tends to be overly sensitive to minor details in the training data, treating noise as a genuine signal. This results in overfitting, where the model perfectly matches the training data but has difficulty performing effectively on new, unseen data. Some trending keywords related to machine learning include deep learning, neural networks, natural language processing, reinforcement learning, and unsupervised learning.
Consider training a sophisticated decision tree model that can account for even the most minor variations in the data. While the model might achieve perfect accuracy on the training data, its performance could drop significantly when presented with new data. This is because the model "memorizes" the training data instead of learning general patterns, making it fragile to variations in future data.

The Mathematics of Variance

To understand variance more formally, let’s define it mathematically. Suppose we have the actual values of the target variable denoted as Y and the predicted values of the target variable as Y^. The variance of the model can be expressed as the expected value of the squared difference between the predicted values Y^ and the expected value of those predicted values E[Y^].
In mathematical terms, this is written as:
Variance= E[(Y^−E[Y^])^2]
E[Y^] represents the predicted values' average across all training data subsets. The expected value represents the "central tendency" of the predictions, while the variance indicates the extent to which the model's predictions vary from this central value. A model with high variance will exhibit considerable fluctuations in predictions across various data subsets, whereas a model with low variance will demonstrate greater consistency.

How Variance Effects Model Performance

High variance typically arises when a model is overly complex and tries to fit every slight variation in the training data. While this can yield high accuracy on the training set, the model becomes less flexible in adapting to new data, resulting in poor generalization. The overfitting issue frequently arises in models such as decision trees, neural networks, or polynomial regression when there are too many parameters compared to the complexity of the data being analyzed. For instance, employing a high-degree polynomial may be a perfect fit for the training data in polynomial regression. Still, it can also cause unpredictable predictions when faced with new test data. This occurs because the model concentrates on the details and noise in the training data instead of grasping the genuine underlying patterns.

Understanding the Tradeoff

It emphasizes the importance of balancing model complexity to achieve optimal performance. This tradeoff arises from the need to minimize both bias and variance while avoiding overfitting and underfitting. Understanding this balance is essential for creating models that perform well on new, unseen data.
At its core, the tradeoff between bias and variance revolves around model complexity. A model cannot simultaneously be highly complex and highly simple. When we increase a model's complexity to lower bias, we unintentionally raise variance, and the opposite is also true. The main challenge in machine learning is identifying the appropriate complexity level that harmonizes these two conflicting forces.
Relying on a very complex model can lead to overfitting, where the model excels on training data but needs help with test data due to its elevated variance. Conversely, a straightforward model may underfit, making overly strong assumptions about the data and missing crucial patterns, which results in high bias.

Minimizing Total Error

The main objective of machine learning is to reduce the overall error, which consists of bias and variance. This total error can be divided into three parts: bias error, variance error, and fundamental error. Irreducible error represents the unavoidable noise in the data that we cannot eliminate, but we can reduce bias and variance by carefully selecting and tuning our models.
A well-optimized model achieves this balance, minimizing bias and variance to reach the lowest total error possible. This balance can be achieved through regularization (like L1 and L2), cross-validation, and ensemble methods. Regularization helps control the model complexity by adding a penalty for higher complexity. At the same time, cross-validation ensures that the model’s performance is tested on different subsets of the data, giving a more accurate reflection of its generalization ability.
Total error = Bias2 + Variance + Irreducible error

Real World Examples

When developing a machine learning model to forecast house prices based on house size, we frequently face the variance tradeoff, which is essential for assessing the model's performance. Picture a dataset where blue dots signify training samples and orange dots indicate test samples.

Overfitting and Nonlinear Models

Suppose we train a model that perfectly fits these blue training dots, which captures every detail and noise in the data. This is known as an overfitted model. An overfit model may perform exceptionally well on the training data because it tries to match the data exactly, leading to almost zero training error. However, this model performs poorly on test data, such as the orange dots, because it fails to generalize. The model becomes overly tailored to the training set, which causes a significant increase in test error.
For example, let’s say we calculate the error for one of the orange test points, and this error is represented by a gray dotted line (the difference between the actual and predicted values). If you average the errors across all the test points, you might find a high-test error around 100. This high error occurs because the overfitted model is too sensitive to the specifics of the training data and cannot predict new data accurately.

High Variance in Overfitting

Suppose a friend also builds a model using the same methodology but selects a different set of training samples. You are fitting your models too closely to your training sets, leading to almost no errors during training. However, when you evaluate the test data, your model could have an average error of 100, while your friend's model might only have an error of 27. This difference arises because test error can fluctuate depending on the choice of training data points, a well-known sign of high variance.
High variance means that small changes in the training data lead to significant differences in model performance. Even though both models are built using the same approach, their performance on unseen data fluctuates because they have learned specific patterns rather than general patterns in their respective training sets. This variability in test error is a common issue with overfitting and is undesirable because it makes the model's predictions unreliable on new data.

Using Linear Models to Reduce Variance

Suppose a friend also builds a model using the same methodology but selects a different set of training samples. You are fitting your models too closely to your training sets, leading to almost no errors during training. However, when you evaluate the test data, your model could have an average error of 100, while your friend's model might only have an error of 27. This difference arises because test error can fluctuate depending on the choice of training data points, a well-known sign of high variance.
High variance means that small changes in the training data lead to significant differences in model performance. Even though both models are built using the same approach, their performance on unseen data fluctuates because they have learned specific patterns rather than general patterns in their respective training sets. This variability in test error is a common issue with overfitting and is undesirable because it makes the model's predictions unreliable on new data.

Examples of Bias and Variance in Machine Learning Algorithms

Some ML algorithms, such as k-nearest Neighbors (k-NN), Decision Trees, and Support Vector Machines (SVMs), are recognized for their low bias and high variance. These algorithms can easily overfit if not carefully adjusted, as they can identify complex patterns within the data. Consequently, they exhibit low bias but high variance, making them quite sensitive to variations in the training data. In contrast, algorithms like Linear Regression and Logistic Regression are characterized by high bias and low variance. These models operate under stronger assumptions about the data, which can lead to underfitting in certain situations. Still, they tend to be more stable when faced with different training sets, thereby minimizing variance.
Created with