Data Leakage in Machine Learning
Data leakage occurs when a model is trained with information that would not be available in real-world predictions. This can make models appear highly accurate during training or validation, but they fail when deployed.
There are two main types of leakage: target leakage, where predictors include future information about the target (e.g., using post-event features), and train-test contamination, where validation or test data influences training (e.g., preprocessing before splitting).
Leakage can be prevented by carefully separating training and validation data, excluding post-target features, and using pipelines for preprocessing. While removing leaky features may lower apparent accuracy, it ensures the model performs reliably on new data.