Statistical Analysis and Hypothesis Testing

Data without context is just noise. Statistical analysis provides the framework for distinguishing meaningful patterns from random variation, helping you make confident decisions based on evidence rather than intuition. The difference between correlation and causation, statistical significance and practical importance, can make or break business decisions worth millions.

Understanding statistics isn’t about memorizing formulas—it’s about developing the intuition to ask the right questions and interpret results correctly. When someone claims their marketing campaign increased sales by 15%, statistical thinking helps you determine whether that’s a real effect or just random fluctuation.

Descriptive Statistics Beyond the Basics

Most people stop at mean, median, and standard deviation, but real insights come from understanding the shape and behavior of your data distributions. Skewness tells you whether extreme values pull your data in one direction, while kurtosis reveals whether you have more or fewer outliers than expected.

import numpy as np
import pandas as pd
import scipy.stats as stats
import matplotlib.pyplot as plt
import seaborn as sns

# Generate sample data with different characteristics
np.random.seed(42)
normal_data = np.random.normal(100, 15, 1000)
skewed_data = np.random.exponential(2, 1000)
bimodal_data = np.concatenate([np.random.normal(80, 10, 500), 
                              np.random.normal(120, 10, 500)])

# Comprehensive descriptive statistics
def analyze_distribution(data, name):
    stats_dict = {
        'mean': np.mean(data),
        'median': np.median(data),
        'std': np.std(data),
        'skewness': stats.skew(data),
        'kurtosis': stats.kurtosis(data),
        'min': np.min(data),
        'max': np.max(data),
        'q25': np.percentile(data, 25),
        'q75': np.percentile(data, 75)
    }
    
    print(f"\n{name} Distribution:")
    for key, value in stats_dict.items():
        print(f"{key}: {value:.3f}")
    
    return stats_dict

# Analyze different distributions
normal_stats = analyze_distribution(normal_data, "Normal")
skewed_stats = analyze_distribution(skewed_data, "Skewed")
bimodal_stats = analyze_distribution(bimodal_data, "Bimodal")

Skewness near zero indicates symmetric data, while values above 1 or below -1 suggest significant asymmetry. Kurtosis measures tail heaviness—high values mean more extreme outliers than a normal distribution would predict.

Confidence Intervals and Uncertainty

Point estimates like “the average is 100” tell only part of the story. Confidence intervals quantify uncertainty, telling you the range where the true population parameter likely falls. This distinction becomes crucial when making decisions based on sample data.

# Calculate confidence intervals for different scenarios
def confidence_interval(data, confidence=0.95):
    """Calculate confidence interval for the mean."""
    n = len(data)
    mean = np.mean(data)
    std_err = stats.sem(data)  # Standard error of the mean
    
    # t-distribution for small samples, normal for large samples
    if n < 30:
        t_val = stats.t.ppf((1 + confidence) / 2, n - 1)
        margin_error = t_val * std_err
    else:
        z_val = stats.norm.ppf((1 + confidence) / 2)
        margin_error = z_val * std_err
    
    return mean - margin_error, mean + margin_error

# Compare confidence intervals for different sample sizes
sample_sizes = [10, 30, 100, 1000]
population = np.random.normal(100, 15, 10000)

print("Confidence Intervals by Sample Size:")
for n in sample_sizes:
    sample = np.random.choice(population, n, replace=False)
    ci_lower, ci_upper = confidence_interval(sample)
    width = ci_upper - ci_lower
    
    print(f"n={n:4d}: [{ci_lower:.2f}, {ci_upper:.2f}] (width: {width:.2f})")

Larger samples produce narrower confidence intervals, giving you more precise estimates. This relationship helps you determine how much data you need to achieve desired precision.

Hypothesis Testing Framework

Hypothesis testing provides a structured approach to evaluating claims about your data. The key insight is that you’re not proving your hypothesis true—you’re determining whether the evidence is strong enough to reject the null hypothesis.

# A/B test example: comparing two marketing campaigns
def ab_test_analysis(control_data, treatment_data, alpha=0.05):
    """Perform independent t-test for A/B testing."""
    
    # Descriptive statistics
    control_mean = np.mean(control_data)
    treatment_mean = np.mean(treatment_data)
    
    # Statistical test
    t_stat, p_value = stats.ttest_ind(control_data, treatment_data)
    
    # Effect size (Cohen's d)
    pooled_std = np.sqrt(((len(control_data) - 1) * np.var(control_data, ddof=1) + 
                         (len(treatment_data) - 1) * np.var(treatment_data, ddof=1)) / 
                        (len(control_data) + len(treatment_data) - 2))
    cohens_d = (treatment_mean - control_mean) / pooled_std
    
    # Results interpretation
    is_significant = p_value < alpha
    improvement = ((treatment_mean - control_mean) / control_mean) * 100
    
    results = {
        'control_mean': control_mean,
        'treatment_mean': treatment_mean,
        'improvement_pct': improvement,
        't_statistic': t_stat,
        'p_value': p_value,
        'is_significant': is_significant,
        'effect_size': cohens_d,
        'sample_sizes': (len(control_data), len(treatment_data))
    }
    
    return results

# Simulate A/B test data
np.random.seed(42)
control_conversions = np.random.normal(0.12, 0.05, 1000)  # 12% baseline conversion
treatment_conversions = np.random.normal(0.14, 0.05, 1000)  # 14% treatment conversion

# Analyze results
ab_results = ab_test_analysis(control_conversions, treatment_conversions)

print("A/B Test Results:")
print(f"Control mean: {ab_results['control_mean']:.4f}")
print(f"Treatment mean: {ab_results['treatment_mean']:.4f}")
print(f"Improvement: {ab_results['improvement_pct']:.2f}%")
print(f"P-value: {ab_results['p_value']:.6f}")
print(f"Statistically significant: {ab_results['is_significant']}")
print(f"Effect size (Cohen's d): {ab_results['effect_size']:.3f}")

Effect size tells you whether a statistically significant result is practically meaningful. A p-value of 0.001 might indicate strong evidence, but if the effect size is tiny, the practical impact could be negligible.

Correlation vs Causation

Correlation analysis reveals relationships between variables, but interpreting these relationships requires careful thinking about causation, confounding variables, and the direction of influence.

# Demonstrate correlation analysis and interpretation
def correlation_analysis(df, var1, var2):
    """Comprehensive correlation analysis between two variables."""
    
    # Calculate different correlation coefficients
    pearson_r, pearson_p = stats.pearsonr(df[var1], df[var2])
    spearman_r, spearman_p = stats.spearmanr(df[var1], df[var2])
    
    # Linear regression for trend line
    slope, intercept, r_value, p_value, std_err = stats.linregress(df[var1], df[var2])
    
    print(f"Correlation Analysis: {var1} vs {var2}")
    print(f"Pearson correlation: {pearson_r:.3f} (p={pearson_p:.6f})")
    print(f"Spearman correlation: {spearman_r:.3f} (p={spearman_p:.6f})")
    print(f"R-squared: {r_value**2:.3f}")
    
    # Visualization
    plt.figure(figsize=(10, 6))
    plt.scatter(df[var1], df[var2], alpha=0.6)
    
    # Add trend line
    x_trend = np.linspace(df[var1].min(), df[var1].max(), 100)
    y_trend = slope * x_trend + intercept
    plt.plot(x_trend, y_trend, 'r-', linewidth=2)
    
    plt.xlabel(var1)
    plt.ylabel(var2)
    plt.title(f'Correlation: {var1} vs {var2} (r={pearson_r:.3f})')
    plt.show()
    
    return pearson_r, spearman_r

# Create sample data with different correlation patterns
np.random.seed(42)
sample_df = pd.DataFrame({
    'advertising_spend': np.random.uniform(1000, 10000, 200),
    'sales': np.random.normal(50000, 10000, 200),
    'temperature': np.random.normal(20, 10, 200),
    'ice_cream_sales': np.random.normal(1000, 300, 200)
})

# Add some actual correlation
sample_df['sales'] += sample_df['advertising_spend'] * 2 + np.random.normal(0, 5000, 200)
sample_df['ice_cream_sales'] += sample_df['temperature'] * 50 + np.random.normal(0, 200, 200)

# Analyze correlations
correlation_analysis(sample_df, 'advertising_spend', 'sales')
correlation_analysis(sample_df, 'temperature', 'ice_cream_sales')

Statistical Power and Sample Size

Understanding statistical power helps you design experiments that can actually detect the effects you’re looking for. Low power means you might miss real effects, while excessive sample sizes waste resources detecting trivial differences.

# Power analysis for experiment design
def power_analysis(effect_size, alpha=0.05, power=0.8):
    """Calculate required sample size for given effect size and power."""
    from scipy.stats import norm
    
    # Z-scores for alpha and power
    z_alpha = norm.ppf(1 - alpha/2)  # Two-tailed test
    z_beta = norm.ppf(power)
    
    # Sample size calculation (per group)
    n_per_group = 2 * ((z_alpha + z_beta) / effect_size) ** 2
    
    return int(np.ceil(n_per_group))

# Calculate sample sizes for different scenarios
effect_sizes = [0.1, 0.2, 0.5, 0.8]  # Small, small-medium, medium, large
powers = [0.8, 0.9, 0.95]

print("Sample Size Requirements (per group):")
print("Effect Size | Power=0.8 | Power=0.9 | Power=0.95")
print("-" * 50)

for es in effect_sizes:
    row = f"{es:10.1f} |"
    for power in powers:
        n = power_analysis(es, power=power)
        row += f"{n:9d} |"
    print(row)

This analysis shows why detecting small effects requires large sample sizes. Planning experiments with power analysis prevents the frustration of inconclusive results due to insufficient data.

Common Statistical Pitfalls

Multiple comparisons, p-hacking, and survivorship bias can lead to false discoveries. Understanding these pitfalls helps you design better analyses and interpret results more critically.

When testing multiple hypotheses simultaneously, adjust your significance threshold to account for increased false positive risk. The Bonferroni correction is conservative but simple: divide your alpha level by the number of tests.

Always define your analysis plan before looking at the data. Post-hoc analyses and data dredging can find “significant” patterns in pure noise. Document your methodology and stick to it, or clearly label exploratory analyses as such.

Statistical analysis provides the foundation for data-driven decision making, but it requires careful application and interpretation. The goal isn’t to find statistical significance at any cost, but to extract reliable insights that inform better decisions.

In our next part, we’ll explore machine learning fundamentals with scikit-learn, learning how to build predictive models and evaluate their performance. We’ll see how statistical concepts like bias, variance, and overfitting apply to machine learning algorithms.