Data Leakage in Machine Learning

Data leakage occurs when a model is trained with information that would not be available in real-world predictions. This can make models appear highly accurate during training or validation, but they fail when deployed.

There are two main types of leakage: target leakage, where predictors include future information about the target (e.g., using post-event features), and train-test contamination, where validation or test data influences training (e.g., preprocessing before splitting).

Leakage can be prevented by carefully separating training and validation data, excluding post-target features, and using pipelines for preprocessing. While removing leaky features may lower apparent accuracy, it ensures the model performs reliably on new data.

Search This Blog

Data Leakage in Machine Learning

Data Leakage in Machine Learning

Parent Topics

Contact Us