Problem 1: Bias-Variance in Ridge Regression (23 pt)
Setup
Consider scalar features (fixed, non-random) with labels:
where is fixed (unknown) and i.i.d.
Ridge regression with :
Closed-form solution:
Let .
1.1 Expected Label and Noise
(a) Prove
(b) Prove Noise =
1.2 Expected Predictor
Prove:
1.3 Squared Bias
Prove:
1.4 Variance
Prove:
Computing :
Therefore:
1.5 Behavior as Changes
Monotonicity Analysis
Bias is increasing in :
- As increases, both numerator and denominator grow
- The fraction approaches 1 as
Variance is decreasing in :
- Only the denominator grows with
- This shrinks the variance value
Limiting Behavior
(a) When :
- (unbiased estimator)
- (standard OLS variance)
This is the OLS solution with no regularization.
(b) When :
- (maximum bias)
- (minimum variance)
The predictor becomes constant (heavily regularized).
Interpretation: Increasing creates a bias-variance tradeoff:
- Higher : Lower variance, higher bias
- Lower : Higher variance, lower bias
Since we don’t know or the true distribution of , we use model selection (e.g., cross-validation) to find optimal .
1.6 Triangle Inequality Bound
Prove: for constraint
Given the regularization constraint forces inside a ball of radius :
- For every dataset :
- For the expected predictor:
By triangle inequality:
Therefore:
Interpretation: The maximum Euclidean distance between any two points in the ball is at most .
1.7 Variance Bound
Prove:
Rewrite the variance formula:
Using :
Note: This bound doesn’t depend on or , but it can be looser than the actual variance value from 1.4.
Problem 2: Optimal Classifier under Squared Loss (12 pt)
2.1 Find the Optimal Classifier
Goal: Minimize expected squared error loss:
Answer:
Derivation
Using the law of total expectation:
For fixed and , expand the inner expectation:
Thus:
The second term is independent of . To minimize , we minimize the first term by setting:
This is the conditional mean or regression function.
2.2 Find the Optimal Error Rate
Answer:
Plugging back into :
Interpretation: This is the irreducible error or Bayes error. It represents the inherent noise in the data that cannot be reduced by any predictor, as it comes from the variance of given .
No predictor can achieve error lower than this, as it’s determined by the randomness in the labels given the features.
Problem 3: Model Selection (19 pt)
3.1 K-Fold Cross-Validation Implementation
Task: Implement k-fold cross-validation to estimate model performance.
Algorithm:
- Split data into folds using
KFold(n_splits=k, shuffle=True, random_state=42) - For each fold:
- Use folds for training
- Use remaining fold for validation
- Train a copy of the model (using
deepcopy) - Compute validation MSE
- Return mean and standard deviation of validation errors
Key Implementation Points:
- Always create a fresh copy of the model for each fold using
deepcopy(model) - Calculate MSE:
- Return both average and standard deviation for statistical analysis
3.2 Model Selection
Task: Select best model across polynomial degrees and regularization strengths.
Search Space:
- Polynomial degrees: [1, 2, 3, 4, 5, 6, 7, 8]
- Regularization strengths (alpha): [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
Models Evaluated:
- Linear Regression: No regularization
- Ridge Regression: L2 regularization with various values
- Lasso Regression: L1 regularization with various values
Procedure:
- For each polynomial degree:
- Transform features using
PolynomialFeatures(degree) - Standardize using
StandardScaler() - Evaluate Linear Regression
- Evaluate Ridge with each
- Evaluate Lasso with each
- Transform features using
- Select model with lowest cross-validation error
- Return trained model pipeline
Code Implementation
K-Fold Cross-Validation (hw4_q3.py)
def cross_validate_model(X, y, model, k_folds=5):
"""
Perform k-fold cross-validation and return average validation error.
Args:
X: Training features (n_samples, n_features)
y: Training labels (n_samples,)
model: Sklearn model object
k_folds: Number of folds for cross-validation
Returns:
avg_val_error: Average validation MSE across all folds
std_val_error: Standard deviation of validation MSE
"""
val_errors = []
kf = KFold(n_splits=k_folds, shuffle=True, random_state=42)
for train_idx, validate_idx in kf.split(X):
# Clone the model for this fold
model_clone = deepcopy(model)
# Split data
cur_train_x, cur_validate_x = X[train_idx], X[validate_idx]
cur_train_y, cur_validate_y = y[train_idx], y[validate_idx]
# Train and evaluate
model_clone.fit(cur_train_x, cur_train_y)
y_pred = model_clone.predict(cur_validate_x)
val_errors.append(np.mean((cur_validate_y - y_pred) ** 2))
# Calculate statistics
avg_val_error = np.mean(val_errors)
std_val_error = np.std(val_errors)
return avg_val_error, std_val_error
Model Selection (hw4_q3.py)
def select_best_model(X_train, y_train):
"""
Select the best model using cross-validation.
Args:
X_train: Training features
y_train: Training labels
Returns:
returned_best_model: Trained best model pipeline
"""
degrees = [1, 2, 3, 4, 5, 6, 7, 8]
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]
cur_best_model = {
"error": np.inf,
"model": None,
}
best_degree = None
for degree in degrees:
# Create polynomial features
X_poly = create_polynomial_features(X_train, degree)
scaler = StandardScaler()
X_poly_scaled = scaler.fit_transform(X_poly)
# Models to evaluate
all_models = {
"Linear": [LinearRegression()],
"Ridge": [Ridge(alpha=a, random_state=42) for a in alphas],
"Lasso": [Lasso(alpha=a, random_state=42, max_iter=2000) for a in alphas],
}
# Evaluate all models
for name, models in all_models.items():
print(f"Evaluating {name} Regression with degree {degree}")
results = [
evaluate_model(X_poly_scaled, y_train, model)
for model in models
]
errors = [result[0] for result in results]
min_idx = np.argmin(errors)
# Update best model if needed
if errors[min_idx] < cur_best_model["error"]:
cur_best_model["error"] = errors[min_idx]
cur_best_model["model"] = results[min_idx][1]
best_degree = degree
best_model = cur_best_model["model"]
returned_best_model = ModelPipeline(best_degree, best_model, StandardScaler())
return returned_best_model
def evaluate_model(X, y, model):
"""Helper function to evaluate a single model"""
avg_err, std_err = cross_validate_model(X, y, model)
return avg_err, model
Key Takeaways
Bias-Variance Tradeoff in Ridge Regression
-
As increases:
- Bias increases (predictor moves away from true )
- Variance decreases (predictions become more stable)
-
Optimal :
- Balances bias and variance
- Found through cross-validation
- Different for each dataset
-
Variance Bounds:
- Regularization provides distribution-free bounds
- Actual variance depends on data, noise, and true parameters
Optimal Predictors
-
Under squared loss:
- Optimal predictor is conditional mean:
- Irreducible error is conditional variance:
-
Implications:
- No predictor can beat the Bayes error
- Focus on approximating conditional mean well
- Understand sources of irreducible noise
Model Selection Best Practices
-
Cross-validation:
- Provides unbiased estimate of generalization error
- K-fold balances bias and variance of error estimate
- Must use independent folds (no data leakage)
-
Hyperparameter tuning:
- Search over multiple dimensions (degree, regularization)
- Use validation set to select, test set to evaluate
- Report both mean and variance of CV scores
-
Model complexity:
- Higher polynomial degrees increase model capacity
- Regularization controls effective complexity
- Goal is to match model complexity to data complexity