Skip to main content

Data Leakage in Machine Learning


Data Leakage in Machine Learning

Data leakage occurs when a model is trained with information that would not be available in real-world predictions. This can make models appear highly accurate during training or validation, but they fail when deployed.

There are two main types of leakage: target leakage, where predictors include future information about the target (e.g., using post-event features), and train-test contamination, where validation or test data influences training (e.g., preprocessing before splitting).

Leakage can be prevented by carefully separating training and validation data, excluding post-target features, and using pipelines for preprocessing. While removing leaky features may lower apparent accuracy, it ensures the model performs reliably on new data.

People are good at skipping over material they already know!

View Related Topics to







Contact Us

Name

Email *

Message *