Machine Learning Fundamentals with Scikit-learn

Machine learning often gets mystified as some kind of magic, but at its core, it’s about finding patterns in data and using those patterns to make predictions. The real challenge isn’t understanding the algorithms—it’s knowing which problems are suitable for ML, how to prepare your data properly, and how to evaluate whether your model actually works.

Scikit-learn makes machine learning accessible by providing a consistent interface across dozens of algorithms. Once you understand the basic workflow, you can experiment with different approaches without rewriting your entire pipeline.

The Machine Learning Workflow

Every machine learning project follows the same basic pattern: prepare data, train models, evaluate performance, and iterate. Understanding this workflow helps you approach new problems systematically rather than jumping straight to complex algorithms.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score
import matplotlib.pyplot as plt

# Create sample dataset for demonstration
np.random.seed(42)
n_samples = 1000

# Generate features with different relationships
X = pd.DataFrame({
    'feature1': np.random.normal(0, 1, n_samples),
    'feature2': np.random.normal(0, 1, n_samples),
    'feature3': np.random.uniform(0, 10, n_samples),
    'feature4': np.random.exponential(2, n_samples)
})

# Create target with known relationships
y = (2 * X['feature1'] + 
     -1.5 * X['feature2'] + 
     0.5 * X['feature3']**2 + 
     np.log(X['feature4'] + 1) + 
     np.random.normal(0, 0.5, n_samples))

print("Dataset shape:", X.shape)
print("Target statistics:")
print(f"Mean: {y.mean():.3f}, Std: {y.std():.3f}")

This synthetic dataset includes linear relationships, polynomial terms, and logarithmic transformations—patterns you’ll encounter in real data. The added noise simulates measurement error and unknown factors.

Data Preprocessing and Feature Engineering

Raw data rarely works well with machine learning algorithms. Preprocessing transforms your data into a format that algorithms can use effectively, while feature engineering creates new variables that capture important patterns.

# Split data before any preprocessing to avoid data leakage
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Feature scaling - important for many algorithms
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Feature engineering - create polynomial features
X_train_engineered = X_train.copy()
X_test_engineered = X_test.copy()

# Add polynomial terms
X_train_engineered['feature3_squared'] = X_train['feature3'] ** 2
X_test_engineered['feature3_squared'] = X_test['feature3'] ** 2

# Add logarithmic transformation
X_train_engineered['feature4_log'] = np.log(X_train['feature4'] + 1)
X_test_engineered['feature4_log'] = np.log(X_test['feature4'] + 1)

# Add interaction terms
X_train_engineered['feature1_x_feature2'] = X_train['feature1'] * X_train['feature2']
X_test_engineered['feature1_x_feature2'] = X_test['feature1'] * X_test['feature2']

print("Original features:", X_train.shape[1])
print("Engineered features:", X_train_engineered.shape[1])

Feature scaling ensures that variables with different units don’t dominate the learning process. Feature engineering incorporates domain knowledge about relationships that might not be obvious to algorithms.

Model Training and Comparison

Scikit-learn’s consistent API makes it easy to try different algorithms and compare their performance. Start with simple models to establish baselines, then experiment with more complex approaches.

# Define models to compare
models = {
    'Linear Regression': LinearRegression(),
    'Random Forest': RandomForestRegressor(n_estimators=100, random_state=42)
}

# Train and evaluate models
results = {}

for name, model in models.items():
    # Train on engineered features
    model.fit(X_train_engineered, y_train)
    
    # Make predictions
    y_pred_train = model.predict(X_train_engineered)
    y_pred_test = model.predict(X_test_engineered)
    
    # Calculate metrics
    train_mse = mean_squared_error(y_train, y_pred_train)
    test_mse = mean_squared_error(y_test, y_pred_test)
    train_r2 = r2_score(y_train, y_pred_train)
    test_r2 = r2_score(y_test, y_pred_test)
    
    results[name] = {
        'train_mse': train_mse,
        'test_mse': test_mse,
        'train_r2': train_r2,
        'test_r2': test_r2
    }
    
    print(f"\n{name} Results:")
    print(f"Train R²: {train_r2:.3f}, Test R²: {test_r2:.3f}")
    print(f"Train MSE: {train_mse:.3f}, Test MSE: {test_mse:.3f}")

The gap between training and test performance indicates overfitting. Models that perform much better on training data than test data have memorized noise rather than learning generalizable patterns.

Cross-Validation for Robust Evaluation

Single train-test splits can be misleading due to lucky or unlucky data divisions. Cross-validation provides more robust performance estimates by testing on multiple data splits.

from sklearn.model_selection import cross_val_score, KFold

# Set up cross-validation
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Evaluate models with cross-validation
cv_results = {}

for name, model in models.items():
    # Cross-validation scores
    cv_scores = cross_val_score(model, X_train_engineered, y_train, 
                               cv=cv, scoring='r2')
    
    cv_results[name] = {
        'mean_score': cv_scores.mean(),
        'std_score': cv_scores.std(),
        'scores': cv_scores
    }
    
    print(f"\n{name} Cross-Validation:")
    print(f"Mean R²: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
    print(f"Individual scores: {cv_scores}")

Cross-validation reveals model stability. High variance in CV scores suggests the model is sensitive to training data composition, which can indicate overfitting or insufficient data.

Hyperparameter Tuning

Most algorithms have hyperparameters that control their behavior. Grid search systematically tests different parameter combinations to find optimal settings for your specific dataset.

from sklearn.model_selection import GridSearchCV

# Define parameter grids for tuning
param_grids = {
    'Random Forest': {
        'n_estimators': [50, 100, 200],
        'max_depth': [None, 10, 20],
        'min_samples_split': [2, 5, 10]
    }
}

# Perform grid search
tuned_models = {}

for name, model in models.items():
    if name in param_grids:
        print(f"\nTuning {name}...")
        
        grid_search = GridSearchCV(
            model, param_grids[name], 
            cv=3, scoring='r2', n_jobs=-1
        )
        
        grid_search.fit(X_train_engineered, y_train)
        
        tuned_models[name] = grid_search.best_estimator_
        
        print(f"Best parameters: {grid_search.best_params_}")
        print(f"Best CV score: {grid_search.best_score_:.3f}")
    else:
        tuned_models[name] = model

Grid search can be computationally expensive, but it often improves model performance significantly. For large parameter spaces, consider random search or more sophisticated optimization methods.

Feature Importance and Model Interpretation

Understanding which features drive your model’s predictions is crucial for building trust and gaining insights. Different algorithms provide different types of interpretability.

# Feature importance for tree-based models
rf_model = tuned_models['Random Forest']
feature_names = X_train_engineered.columns

# Get feature importances
importances = rf_model.feature_importances_
feature_importance_df = pd.DataFrame({
    'feature': feature_names,
    'importance': importances
}).sort_values('importance', ascending=False)

print("Feature Importance (Random Forest):")
print(feature_importance_df)

# Visualize feature importance
plt.figure(figsize=(10, 6))
plt.barh(feature_importance_df['feature'], feature_importance_df['importance'])
plt.xlabel('Importance')
plt.title('Feature Importance - Random Forest')
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()

Feature importance helps validate that your model is using sensible patterns. If unimportant features dominate, it might indicate data leakage or spurious correlations.

Model Validation and Diagnostics

Beyond accuracy metrics, diagnostic plots help you understand model behavior and identify potential problems like heteroscedasticity or systematic bias.

# Generate predictions for diagnostic plots
best_model = tuned_models['Random Forest']
y_pred = best_model.predict(X_test_engineered)

# Diagnostic plots
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Predicted vs Actual
axes[0,0].scatter(y_test, y_pred, alpha=0.6)
axes[0,0].plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], 'r--')
axes[0,0].set_xlabel('Actual')
axes[0,0].set_ylabel('Predicted')
axes[0,0].set_title('Predicted vs Actual')

# Residuals vs Predicted
residuals = y_test - y_pred
axes[0,1].scatter(y_pred, residuals, alpha=0.6)
axes[0,1].axhline(y=0, color='r', linestyle='--')
axes[0,1].set_xlabel('Predicted')
axes[0,1].set_ylabel('Residuals')
axes[0,1].set_title('Residuals vs Predicted')

# Residuals histogram
axes[1,0].hist(residuals, bins=30, alpha=0.7)
axes[1,0].set_xlabel('Residuals')
axes[1,0].set_ylabel('Frequency')
axes[1,0].set_title('Residual Distribution')

# Q-Q plot for normality
from scipy import stats
stats.probplot(residuals, dist="norm", plot=axes[1,1])
axes[1,1].set_title('Q-Q Plot')

plt.tight_layout()
plt.show()

# Calculate final metrics
final_mse = mean_squared_error(y_test, y_pred)
final_r2 = r2_score(y_test, y_pred)

print(f"\nFinal Model Performance:")
print(f"Test R²: {final_r2:.3f}")
print(f"Test RMSE: {np.sqrt(final_mse):.3f}")

Good residual plots show random scatter around zero. Patterns in residuals indicate model limitations—systematic over- or under-prediction in certain ranges suggests missing features or wrong model assumptions.

Machine learning is an iterative process of experimentation and refinement. Start simple, understand your data thoroughly, and gradually increase complexity only when simpler approaches prove insufficient.

In our next part, we’ll explore advanced machine learning techniques including ensemble methods, dimensionality reduction, and strategies for handling imbalanced datasets and missing data in real-world scenarios.