What is cross-validation?
In this way, out of k, k iterations are allowed, with each fold used as a validation set once. Once this is completed, the results calculated during each validation step, such as accuracy, precision, recall, F1 score, etc., are taken within the same number of iterations and averaged to get a more stable estimate of the model performance than in the case of k single tests.
In this way, out of k, k iterations are allowed, with each fold used as a validation set once. Once this is completed, the results calculated during each validation step, such as accuracy, precision, recall, F1 score, etc., are taken within the same number of iterations and averaged to get a more stable estimate of the model performance than in the case of k single tests.
Why is Cross Validation Important?
The biggest question when examining the performance of an ML model is that which was used for training. However, writing something using this method is not a good idea for many reasons. The main problem is that we consider our training set to contain all the possible real world by testing the training data. We know this is nonsense. No dataset will represent all the possible things your model may experience in production. The purpose is to have the model make a good prediction on new data it has never seen, not just do well on the data it knows.
Despite being sourced from real-world sources, training data is a tiny sample of all the possible data points in the world. We might overestimate our actual performance if we test a model with the same data. The model may have memorized the patterns/trends or specific examples from the training data instead of generalizing for new, out-of-sample scenarios. To properly examine the performance of a model, it shall be tested on new data. This new data — often called the "testing set" or the test data — is a more accurate indicator of how well the model will work with real-world applications.
This usually consists of dividing the complete available dataset into a part used for training and another that comprises the test.!?): A standard split is a 70–30 ratio, where the first set is better used for training, and the other is ideal for testing, but it can differ depending on the dataset dimension and model complexity.
It can also prevent fitting the training data too much, better known as overfitting, since we would still need a test set to evaluate how well the predictions work. A final evaluation is done on the test set to get an unbiased measure of how well our model will perform when it sees new data.
Cross-validation imparts some robustness to our model beyond test and validation sets. K-fold cross-validation breaks the data into k subsets and trains a model on the data k times, each time using one of these subsets for validation or testing and all but that subset for training. This will help prevent the model from being biased only for a particular train-test split.
Benefits of Cross-Validation
The first and most important benefit of using cross-validation is to prevent overfitting. Overfitting is when a model is so well optimized on the training data that it cannot generalize beyond what it has learned. We use cross-validation to estimate better how the model will perform on new data. To do this, we train the model on a particular portion of the dataset and then evaluate it on the rest. This is then repeatable an arbitrary number of times and will yield a better expectation of what happens if we apply our model to new yet unseen data.
The second benefit is the model selection. Cross-validation will help to identify models when training multiple models or tuning different versions of the same model that consistently perform the best across various subsets of the data. Indication of how predictive power during validation splits [This consistency is important because this means our model has not overfit a particular subset of data but instead learned generalizable patterns] Instead of just taking a model that works best on one test dataset, cross-validation allows us to select a model which performs average across multiple sets and is thus more robust in choosing the best model.
A significant advantage of cross-validation is its importance in hyperparameter tuning. Hyperparameters are settings external to the model, like the regularization strength in logistic regression or the number of trees in a random forest. These settings aren't derived from the training data but must be established before the learning process. Cross-validation aids in fine-tuning these hyperparameters by allowing the evaluation of various configurations on validation sets, helping to identify those that yield the best performance.
Cross-validation effectively utilizes the available data. In contrast to traditional validation methods that typically split the dataset into static training and testing sets, which can restrict the training data, cross-validation enables the entire dataset to be employed for training and validation by cycling through various subsets. This approach enhances the use of data. It guarantees that the model encounters broader patterns throughout the training process.