Dec 7 / Kumar Satyam

Understanding Overfitting and Underfitting in Machine Learning Models

Introduction

Understanding Overfitting and Underfitting in Machine Learning Models pic 1
The art of machine learning consists primarily of designing systems that accurately predict the outcomes of previously unencountered cases. This ability to work well with new data is called generalization. To achieve this, models learn from past data by identifying patterns and relationships within it. These patterns help the model make predictions on future data. Nevertheless, it is essential to reach this learning process equilibrium. A model can become quite familiar with the training data to the point of discerning even the noise and other non-essential elements, which leads to such a phenomenon as overfitting. An overfitted model may excel within the confines of the training data but will find it difficult to generalize beyond that data to make correct predictions about other new and unseen datasets. Bard becomes a new solution that sometimes may perform model training boringly with no apparent objective attaining situation. On the other hand, underfitting implies the presence of a model that is so basic that it isn’t capable of absorbing sufficient detail from the training set encasement of data. Because of this, the growing model, the number of which is relatively small after the first one, banal over-crowd growing and “new management” creates problems and deficits both with riving conception in and out. It is essential to avert overfitting and underfitting appeals to these practices to develop a proper machine-learning model. The aim is to achieve a scenario in which the model is neither complex nor straightforward; however, it falls within the parameters that will allow it to make correct predictions about new data.

Overfitting

Overfitting refers to a situation where a model learns the training data too accurately, capturing even trivial inaccuracies, such as noise or outliers. As a result, the model becomes overfitted to the training data and loses its ability to generalize to new, unseen data. The model may perform excellently on the training data but fails severely with other data because it has not internalized the broad brushstrokes. Some modification is essential; overfitting impedes the development of reliable predictive models outside the clothed-in-data training environment.
Causes of Overfitting:
  1. Too Complex Models: Overfitting often happens when models are too complex. For example, using a high-degree polynomial in regression tasks can make the model fit the training data almost perfectly, but it won't perform well on new data. Complex models have many parameters and can capture every detail, including irrelevant noise, which doesn't reflect the actual pattern.
  2. Insufficient Training Data: If insufficient training data exists, a complex model might be overfitting by focusing on the unique quirks of the limited data available. Instead of learning general patterns, the model memorizes the training examples.
  3. Excessive Training: Training a model for too long can also cause overfitting. As the model keeps learning, it might start memorizing specific details and noise in the training data instead of understanding the general patterns.

Overfitting Examples: 

It occurs when a model pays too much attention to the training set, remembering every detail present in it and thus performing poorly in any new situation. Take the case of a model that has learned to detect dogs in images. If most of the photos the model was trained on are pictures of dogs taken in a park, the model might know that any picture containing grass has a dog. This would result in it overlooking dogs present in images taken indoors.
A different scenario is a model estimating a student's academic performance using factors like parental earnings and educational level. When the model is trained with data that includes only a few specific gender or ethnic groups, it is often valid only for that subgroup. Therefore, it may not be able to predict students' performance from different groups correctly, as it was not trained to generalize out of this particular set.

Underfitting

In machine learning, underfitting refers to a model's inability to capture trends due to its excessive simplicity. Hence, it performs poorly on familiar and unfamiliar data, lacking sufficient knowledge to make accurate forecasts. This is common in situations where the ML model cannot capture the actual complexity of the data. An underfitted model has no working principles that allow it to work on other datasets. Thus, it has no practical applications.
Causes of Underfitting:
  1. Too Simple Models: When the model is too basic for the complexity of data, it usually leads to underfitting. For instance, when a linear model is fitted on a complex, non-linear dataset, the model will result in underfitting since it is not flexible enough to accommodate the existing relationships in the data.
  2. Insufficient Features or Wrong Choice of Features: Underfitting can occur if the model doesn't have enough features or uses the wrong ones. The model will struggle to make correct predictions if the chosen features don’t represent the data well. This can happen if important details are omitted or irrelevant information is included, leading to poor predictions.
  3. Not Enough Training Time or Under-Trained Models: Underfitting can happen when the model is not given sufficient exposure or time to learn. The model may misconstrue the underlying patterns if the training duration is short or the number of iterations over the dataset is low. However, this issue is especially pronounced for advanced architectures such as deep learning, where more extended training periods are usually required for optimal model deployment.

Techniques to Reduce Underfitting:

  • Make the Model More Complex: We should choose a model to better understand the data's patterns.
  • Add More Features: We must include relevant features or improve the current ones to give the model better information.
  • Clean the Data: Remove any irrelevant or misleading data so the model can focus on the essential patterns.

Identifying Overfitting and Underfitting:

There are two primary types of identifying techniques :

 Visualization Techniques:

Learning Curves: Drawing learning curves indicating the change of the model's error/accuracy on the training and validation sets with time helps identify overfitting or underfitting, among other methods.
  • Overfitting: Overfitting is prone to happen in this case. While the low error is obtained in training error, the validation error remains high or increases with training. This suggests that training data is learned well but not new data.
  • Underfitting: This modeling appears to be overfitting if the training and validation errors are high and there is not a considerable change in the errors with time, correctly showing the model's performance reserves. This means the model is too simplistic and fails to represent the essential trends present in the data.
Understanding Overfitting and Underfitting in Machine Learning Models pic 2
Validation Curves: Another helpful instrument is the validation curve, which depicts the behavior of model performance measures such as error or accuracy as a function of the complexity of the fit, for instance, the degree of polynomial for regression or the number of layers in a neural network.
  • Overfitting: If the validation curve shows a big difference between training and validation errors—low error on the training data but high error on the validation data—as the model becomes more complex, it indicates overfitting. This means the model is too focused on the training data and doesn’t generalize to new data.
  • Underfitting: No matter how complex the model gets, it likely indicates underfitting if there is a high error in the training and validation data. This suggests the model is too simple and doesn’t capture the data’s key patterns.

 Performance Metrics:

  • Overfitting: Overfitting happens when a model is trained to perfection on the training dataset but fails to reproduce the same performance level on the validation or test datasets. This is because the model has memorized the training dataset, including its noise, and hence cannot apply what it has learned to any new data.
  • Underfitting: A model with low accuracy on the training and validation/test data likely indicates underfitting. This suggests the model is too simple and isn't capturing the essential patterns in the data, leading to poor performance overall.
Understanding Overfitting and Underfitting in Machine Learning Models pic 3

How do I avoid overfitting in my models?

  1. Simplify the Model: Use a simpler model with fewer parameters, as less complex models are less likely to overfit. For example, choose a linear model instead of a complicated one if the data doesn’t need complex relationships.
  2. Increase Training Data: Provide more varied and representative training data so the model learns general patterns instead of just memorizing specifics.
  3. Cross-Validation: Use k-fold cross-validation to test the model on different parts of the data, ensuring it performs well overall and not just on one subset.
  4. Early Stopping: Watch the model’s performance on validation data during training and stop when validation error starts rising, which helps prevent overfitting.
  5. Data Augmentation: In areas like image recognition, create variations of your data (like rotating or flipping images) to help the model generalize better.
  6. Ensemble Methods: Combine predictions from multiple models, like in random forests, to reduce overfitting by averaging the noise.
Created with