Advanced Machine Learning Techniques

Real-world machine learning problems rarely yield to simple algorithms applied to clean data. You’ll encounter high-dimensional datasets, imbalanced classes, missing values, and complex relationships that require sophisticated approaches. Advanced techniques help you handle these challenges systematically.

The key insight about advanced ML is knowing when complexity is justified. Adding ensemble methods or dimensionality reduction should solve specific problems, not just make your pipeline look more impressive.

Ensemble Methods for Robust Predictions

Ensemble methods combine multiple models to create predictions that are often more accurate and stable than any individual model. The principle is simple: if several experts disagree, their average opinion is usually better than any single expert’s view.

import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, VotingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
from sklearn.preprocessing import StandardScaler

# Create sample classification dataset
X, y = make_classification(n_samples=1000, n_features=20, n_informative=15, 
                          n_redundant=5, n_classes=2, random_state=42)

# Individual models
models = {
    'Logistic Regression': LogisticRegression(random_state=42),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=42),
    'Gradient Boosting': GradientBoostingClassifier(random_state=42),
    'SVM': SVC(probability=True, random_state=42)
}

# Evaluate individual models
individual_scores = {}
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5, scoring='accuracy')
    individual_scores[name] = scores.mean()
    print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}")

# Create ensemble
ensemble = VotingClassifier([
    ('lr', LogisticRegression(random_state=42)),
    ('rf', RandomForestClassifier(n_estimators=100, random_state=42)),
    ('gb', GradientBoostingClassifier(random_state=42))
], voting='soft')

# Evaluate ensemble
ensemble_scores = cross_val_score(ensemble, X, y, cv=5, scoring='accuracy')
print(f"\nEnsemble: {ensemble_scores.mean():.3f} ± {ensemble_scores.std():.3f}")

Ensemble methods work best when individual models make different types of errors. Combining a linear model, tree-based model, and neural network often produces better results than using three similar algorithms.

Dimensionality Reduction Techniques

High-dimensional data creates computational challenges and can lead to overfitting. Dimensionality reduction techniques help by identifying the most important patterns in your data while discarding noise.

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Apply PCA for dimensionality reduction
pca = PCA()
X_pca = pca.fit_transform(X)

# Analyze explained variance
cumsum_variance = np.cumsum(pca.explained_variance_ratio_)
n_components_95 = np.argmax(cumsum_variance >= 0.95) + 1

print(f"Components needed for 95% variance: {n_components_95}")

# Visualize explained variance
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), 
         pca.explained_variance_ratio_, 'bo-')
plt.xlabel('Component')
plt.ylabel('Explained Variance Ratio')
plt.title('Individual Component Variance')

plt.subplot(1, 2, 2)
plt.plot(range(1, len(cumsum_variance) + 1), cumsum_variance, 'ro-')
plt.axhline(y=0.95, color='k', linestyle='--', alpha=0.7)
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.title('Cumulative Explained Variance')

plt.tight_layout()
plt.show()

# Use reduced dimensions for modeling
X_reduced = PCA(n_components=n_components_95).fit_transform(X)
print(f"Original dimensions: {X.shape[1]}")
print(f"Reduced dimensions: {X_reduced.shape[1]}")

Handling Imbalanced Datasets

Many real-world problems involve imbalanced classes where one outcome is much rarer than others. Standard accuracy metrics become misleading, and models tend to ignore minority classes.

from sklearn.datasets import make_classification
from sklearn.metrics import classification_report, confusion_matrix
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
from collections import Counter

# Create imbalanced dataset
X_imb, y_imb = make_classification(n_samples=1000, n_classes=2, weights=[0.9, 0.1],
                                  n_features=20, n_informative=15, random_state=42)

print("Original class distribution:")
print(Counter(y_imb))

# Resampling techniques
smote = SMOTE(random_state=42)
X_smote, y_smote = smote.fit_resample(X_imb, y_imb)

undersampler = RandomUnderSampler(random_state=42)
X_under, y_under = undersampler.fit_resample(X_imb, y_imb)

print("\nAfter SMOTE:")
print(Counter(y_smote))
print("\nAfter undersampling:")
print(Counter(y_under))

# Compare model performance on different datasets
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score

datasets = {
    'Original': (X_imb, y_imb),
    'SMOTE': (X_smote, y_smote),
    'Undersampled': (X_under, y_under)
}

for name, (X_data, y_data) in datasets.items():
    X_train, X_test, y_train, y_test = train_test_split(
        X_data, y_data, test_size=0.2, random_state=42, stratify=y_data
    )
    
    model = RandomForestClassifier(random_state=42)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    
    f1 = f1_score(y_test, y_pred)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
    
    print(f"\n{name} Dataset:")
    print(f"F1 Score: {f1:.3f}")
    print(f"AUC-ROC: {auc:.3f}")

Feature Selection and Engineering

Automated feature selection helps identify the most predictive variables while reducing overfitting and computational costs. Different selection methods capture different types of relationships.

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.linear_model import LogisticRegression

# Univariate feature selection
selector_univariate = SelectKBest(score_func=f_classif, k=10)
X_univariate = selector_univariate.fit_transform(X, y)

# Recursive feature elimination
estimator = LogisticRegression(random_state=42)
selector_rfe = RFE(estimator, n_features_to_select=10)
X_rfe = selector_rfe.fit_transform(X, y)

# Compare feature selection methods
methods = {
    'All Features': X,
    'Univariate Selection': X_univariate,
    'RFE Selection': X_rfe
}

for name, X_selected in methods.items():
    scores = cross_val_score(LogisticRegression(random_state=42), 
                           X_selected, y, cv=5, scoring='accuracy')
    print(f"{name}: {scores.mean():.3f} ± {scores.std():.3f}")

Model Interpretability and Explainability

Understanding why models make specific predictions becomes crucial for high-stakes decisions. SHAP (SHapley Additive exPlanations) provides a unified framework for model interpretation.

# Note: This requires 'pip install shap'
try:
    import shap
    
    # Train a model for interpretation
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    model.fit(X, y)
    
    # Create SHAP explainer
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X[:100])  # Explain first 100 samples
    
    print("SHAP analysis completed - would show feature importance plots")
    print("Mean absolute SHAP values (feature importance):")
    
    # Calculate mean absolute SHAP values
    mean_shap = np.mean(np.abs(shap_values[1]), axis=0)  # For positive class
    feature_importance = pd.DataFrame({
        'feature': [f'feature_{i}' for i in range(len(mean_shap))],
        'importance': mean_shap
    }).sort_values('importance', ascending=False)
    
    print(feature_importance.head())
    
except ImportError:
    print("SHAP not installed - skipping interpretability analysis")

Advanced machine learning techniques solve specific problems but add complexity. Use them when simpler approaches prove insufficient, and always validate that the added complexity improves performance on your specific problem.

In our next part, we’ll explore time series analysis and forecasting, learning how to handle temporal data patterns, seasonality, and trend analysis for predictive modeling over time.