Handling Missing Values in Machine Learning
Why Missing Values Matter
Datasets often contain missing values (NaN), e.g., a house missing a third bedroom size or a survey respondent skipping a question. Machine learning models usually cannot handle missing values, so we must process them before training.
Three Approaches
1. Drop Columns with Missing Values
Simply remove columns that contain any missing entries. This is simple but can discard important data.
# Identify columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]
# Drop these columns
X_train_reduced = X_train.drop(cols_with_missing, axis=1)
X_valid_reduced = X_valid.drop(cols_with_missing, axis=1)
Result: MAE = 183,550 → worse performance due to lost information.
2. Imputation (Recommended)
Replace missing values with a substitute (mean, median, or mode). This usually improves model performance.
from sklearn.impute import SimpleImputer
# Create imputer to fill missing values
imputer = SimpleImputer()
# Fit on training data and transform both train and validation
X_train_imputed = pd.DataFrame(imputer.fit_transform(X_train))
X_valid_imputed = pd.DataFrame(imputer.transform(X_valid))
# Restore column names
X_train_imputed.columns = X_train.columns
X_valid_imputed.columns = X_valid.columns
Result: MAE = 178,166 → better performance than dropping columns.
3. Imputation + Indicator Columns
Impute missing values and also track which values were originally missing. This can sometimes improve predictions if the missingness itself is informative.
# Copy original data
X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()
# Add indicator columns
for col in cols_with_missing:
X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()
# Impute missing values
X_train_plus_imputed = pd.DataFrame(imputer.fit_transform(X_train_plus))
X_valid_plus_imputed = pd.DataFrame(imputer.transform(X_valid_plus))
# Restore column names
X_train_plus_imputed.columns = X_train_plus.columns
X_valid_plus_imputed.columns = X_valid_plus.columns
Result: MAE = 178,927 → slightly worse than simple imputation in this dataset.
Summary
- Dropping columns loses valuable information → higher error.
- Imputation improves performance → best default approach.
- Tracking missing values may help in some cases, but not always.
- Always check how many missing values exist per column:
# Check number of missing values per column
X_train.isnull().sum()[X_train.isnull().sum() > 0]
Use imputation (Approach 2) as the default strategy. Only consider Approach 3 if the pattern of missing values might carry meaningful information.