Benefits of cross-validation in model selection

Oct 29 / Swapnil Srivastava

What is cross-validation?

In simpler words, cross-validation is an indispensable method in machine learning that seeks to approximate the expectation of a learning machine model to arbitrary plays of the model on previously unseen data. Cross-validation aims to obtain results from any statistical analysis that can be carried out beyond the given dataset for better and dependable predictive modeling.
Cross-validation means that the available data is split into some parts, otherwise known as "folds." First, the data is divided into several parts, such as k equal folds. In the cross-validation process, this k-sized data is utilized in (k-1) training while leaving one-fold the data as the validation set.
In this way, out of k, k iterations are allowed, with each fold used as a validation set once. Once this is completed, the results calculated during each validation step, such as accuracy, precision, recall, F1 score, etc., are taken within the same number of iterations and averaged to get a more stable estimate of the model performance than in the case of k single tests.
In this way, out of k, k iterations are allowed, with each fold used as a validation set once. Once this is completed, the results calculated during each validation step, such as accuracy, precision, recall, F1 score, etc., are taken within the same number of iterations and averaged to get a more stable estimate of the model performance than in the case of k single tests.

Why is Cross Validation Important?

Cross Validation pic
The biggest question when examining the performance of an ML model is that which was used for training. However, writing something using this method is not a good idea for many reasons. The main problem is that we consider our training set to contain all the possible real world by testing the training data. We know this is nonsense. No dataset will represent all the possible things your model may experience in production. The purpose is to have the model make a good prediction on new data it has never seen, not just do well on the data it knows.
Despite being sourced from real-world sources, training data is a tiny sample of all the possible data points in the world. We might overestimate our actual performance if we test a model with the same data. The model may have memorized the patterns/trends or specific examples from the training data instead of generalizing for new, out-of-sample scenarios. To properly examine the performance of a model, it shall be tested on new data. This new data — often called the "testing set" or the test data — is a more accurate indicator of how well the model will work with real-world applications.
This usually consists of dividing the complete available dataset into a part used for training and another that comprises the test.!?): A standard split is a 70–30 ratio, where the first set is better used for training, and the other is ideal for testing, but it can differ depending on the dataset dimension and model complexity.
It can also prevent fitting the training data too much, better known as overfitting, since we would still need a test set to evaluate how well the predictions work. A final evaluation is done on the test set to get an unbiased measure of how well our model will perform when it sees new data.
Cross-validation imparts some robustness to our model beyond test and validation sets. K-fold cross-validation breaks the data into k subsets and trains a model on the data k times, each time using one of these subsets for validation or testing and all but that subset for training. This will help prevent the model from being biased only for a particular train-test split.

Benefits of Cross-Validation

The first and most important benefit of using cross-validation is to prevent overfitting. Overfitting is when a model is so well optimized on the training data that it cannot generalize beyond what it has learned. We use cross-validation to estimate better how the model will perform on new data. To do this, we train the model on a particular portion of the dataset and then evaluate it on the rest. This is then repeatable an arbitrary number of times and will yield a better expectation of what happens if we apply our model to new yet unseen data.
The second benefit is the model selection. Cross-validation will help to identify models when training multiple models or tuning different versions of the same model that consistently perform the best across various subsets of the data. Indication of how predictive power during validation splits [This consistency is important because this means our model has not overfit a particular subset of data but instead learned generalizable patterns] Instead of just taking a model that works best on one test dataset, cross-validation allows us to select a model which performs average across multiple sets and is thus more robust in choosing the best model.
A significant advantage of cross-validation is its importance in hyperparameter tuning. Hyperparameters are settings external to the model, like the regularization strength in logistic regression or the number of trees in a random forest. These settings aren't derived from the training data but must be established before the learning process. Cross-validation aids in fine-tuning these hyperparameters by allowing the evaluation of various configurations on validation sets, helping to identify those that yield the best performance.
Cross-validation effectively utilizes the available data. In contrast to traditional validation methods that typically split the dataset into static training and testing sets, which can restrict the training data, cross-validation enables the entire dataset to be employed for training and validation by cycling through various subsets. This approach enhances the use of data. It guarantees that the model encounters broader patterns throughout the training process.

Practical Tips for Implementing Cross-Validation

When preparing data for machine learning models, it's crucial to split the dataset logically to prevent data leakage, ensure robust validation, and maximize model performance. Let's discuss several considerations for separating data and selecting the correct cross-validation (CV) methods based on various data types.

Logical Splitting of Data

The splitting method you use must make sense within the context of your data. For example, a simple random split between training, validation, and test sets is generally sufficient when working with non-time-series data. However, suppose your dataset has inherent groupings, such as data from individual subjects or images from multiple categories. In that case, you must ensure these groupings don't cause leakage.

Cross-Validation (CV) Methods

When selecting the proper CV method, the choice should be tailored to the structure and characteristics of your data. Standard K-fold CV works well for balanced, independent datasets. However, this method can introduce bias in structured datasets like time series or grouped data. For grouped data, grouped k-fold cross-validation should be employed to ensure the groups (like patient IDs or company accounts) are kept intact during the splits. This method prevents the same group from being present in both the training and validation sets, thus avoiding leakage.

Special Considerations for Image Data

When working with image data, especially when cropping patches from larger images, it is essential to split the data based on the large image ID. This is because patches from the same image share spatial and contextual information. Suppose patches from the same image are present in the training and test sets. In that case, the model may overfit those patches, leading to poor generalization of new photos.

Case Study in Cross-Validation: In-Hospital Mortality and Length of Stay Prediction

In this study, we aimed to predict in-hospital mortality and length of stay using patient visit records. We focused on time-invariant features such as age, sex, and race alongside binary indicators for prior diagnoses. The diagnoses were organized into 25 higher-order categories according to the International Classification of Diseases codes, which were grouped using Clinical Classifications Software (CCS) codes.
In-hospital mortality was analyzed as a binary classification issue, with "1" representing death during the hospital stay and "0" representing survival. Length of stay (LOS) was modeled as a continuous outcome for a separate regression prediction task.
Preprocessing steps involved inputting continuous features using the median, capping outlier age values at 110 years, and standardizing all numerical features. We applied a feature selection process to reduce dimensionality. We selected the top 10 features for in-hospital mortality prediction and 30 or 50 features (depending on the chosen hyperparameter) for length of stay prediction. Logistic regression was used as the classifier for in-hospital mortality, and hyperparameters were optimized through grid search, exploring least absolute shrinkage and selection operator (L1), Ridge (L2), and no regularization. We employed random forest regression and tuned hyperparameters for the length of stay prediction, such as the number of estimators and maximum tree depth.
To ensure robust model performance estimates, we adopted a nested cross-validation approach. This method performs hyperparameter tuning and model selection within an "inner" cross-validation loop, mitigating optimistic bias when the same data is used for tuning and evaluation. Such bias, a form of overfitting, inflates the model's apparent performance due to random variations in the data and learning algorithm. The nested cross-validation procedure helps reduce this bias by separating the tuning and evaluation processes.
In nonnested cross-validation, model performance is evaluated on the test fold at each split, and the average performance is used to select the optimal model. However, this approach risks optimistic bias as it evaluates performance on the same data used for tuning. In contrast, nested cross-validation separates the tuning and evaluation processes, providing a more reliable estimate of out-of-sample performance.
We further demonstrated the optimism that can arise from improper validation strategies, such as using nonnested cross-validation for both model selection and evaluation. We split the dataset into an 80% training sample and a 20% withheld validation set. The withheld validation set was a holdout to simulate ground truth without naturally bound data (e.g., by site or clinical setting) in the MIMIC dataset. We compared model performance on the cross-validation test folds with its performance on the held-out validation set to highlight discrepancies caused by biased validation.
Performance metrics included discrimination measures like the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPR) for in-hospital mortality. We used mean absolute error (MAE) and median absolute error (MedAE) as the primary metrics for length of stay. Additionally, computational time was compared across cross-validation methods, which can significantly impact model training in large-scale data analyses.
Created with