Data Visualization with Matplotlib and Seaborn

Staring at spreadsheets full of numbers rarely reveals the patterns hiding in your data. A well-designed chart can expose trends, outliers, and relationships that would take hours to discover through statistical summaries alone. Visualization isn’t just about making pretty pictures—it’s about translating abstract data into insights your brain can process intuitively.

The key insight about data visualization is that it’s not about making charts; it’s about visual thinking. When you plot data, you’re translating abstract numbers into patterns your brain can process intuitively. This makes visualization essential for both exploration and communication.

Matplotlib Fundamentals

Matplotlib provides the foundation for most Python plotting. While its default styles aren’t always beautiful, understanding matplotlib’s architecture helps you create exactly the visualization you need.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

# Create sample data
np.random.seed(42)
x = np.linspace(0, 10, 100)
y = np.sin(x) + np.random.normal(0, 0.1, 100)

# Basic plotting
plt.figure(figsize=(10, 6))
plt.plot(x, y, 'b-', alpha=0.7, label='Noisy sine wave')
plt.xlabel('X values')
plt.ylabel('Y values')
plt.title('Basic Line Plot')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

The figure and axes system gives you complete control over your plots. I always set figure size explicitly because the default is often too small for presentations or reports.

Seaborn for Statistical Visualization

Seaborn builds on matplotlib to provide high-level statistical plotting functions. It handles many common visualization tasks with less code and produces more attractive defaults.

import seaborn as sns

# Create sample dataset
data = pd.DataFrame({
    'group': np.random.choice(['A', 'B', 'C'], 300),
    'value': np.random.normal(0, 1, 300),
    'category': np.random.choice(['X', 'Y'], 300)
})

# Statistical plots with seaborn
fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Box plot
sns.boxplot(data=data, x='group', y='value', ax=axes[0,0])
axes[0,0].set_title('Distribution by Group')

# Violin plot
sns.violinplot(data=data, x='group', y='value', hue='category', ax=axes[0,1])
axes[0,1].set_title('Distribution by Group and Category')

# Scatter plot with regression
sns.scatterplot(data=data, x='value', y=np.random.normal(0, 1, 300), 
                hue='group', ax=axes[1,0])
axes[1,0].set_title('Scatter Plot with Groups')

# Histogram
sns.histplot(data=data, x='value', hue='group', ax=axes[1,1])
axes[1,1].set_title('Histogram by Group')

plt.tight_layout()
plt.show()

Seaborn’s strength is handling categorical data and statistical relationships automatically. The hue parameter adds a third dimension to your plots without additional complexity.

Exploratory Data Analysis Plots

When exploring new datasets, I follow a standard sequence of visualizations that reveal different aspects of the data structure and quality.

# Load sample sales data
sales_data = pd.DataFrame({
    'date': pd.date_range('2023-01-01', periods=365),
    'sales': np.random.normal(1000, 200, 365) + 
             50 * np.sin(np.arange(365) * 2 * np.pi / 365),  # Seasonal pattern
    'region': np.random.choice(['North', 'South', 'East', 'West'], 365),
    'product': np.random.choice(['A', 'B', 'C'], 365)
})

# Time series plot
plt.figure(figsize=(15, 10))

plt.subplot(2, 3, 1)
plt.plot(sales_data['date'], sales_data['sales'])
plt.title('Sales Over Time')
plt.xticks(rotation=45)

# Distribution plot
plt.subplot(2, 3, 2)
sns.histplot(sales_data['sales'], bins=30)
plt.title('Sales Distribution')

# Box plot by category
plt.subplot(2, 3, 3)
sns.boxplot(data=sales_data, x='region', y='sales')
plt.title('Sales by Region')

# Correlation heatmap (for numerical data)
plt.subplot(2, 3, 4)
sales_data['month'] = sales_data['date'].dt.month
sales_data['day_of_year'] = sales_data['date'].dt.dayofyear
corr_data = sales_data[['sales', 'month', 'day_of_year']].corr()
sns.heatmap(corr_data, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')

# Scatter plot with trend
plt.subplot(2, 3, 5)
sns.scatterplot(data=sales_data, x='day_of_year', y='sales', hue='region', alpha=0.6)
plt.title('Sales vs Day of Year')

# Bar plot of averages
plt.subplot(2, 3, 6)
region_avg = sales_data.groupby('region')['sales'].mean()
region_avg.plot(kind='bar')
plt.title('Average Sales by Region')
plt.xticks(rotation=45)

plt.tight_layout()
plt.show()

This comprehensive view reveals patterns, outliers, and relationships that guide further analysis.

Advanced Visualization Techniques

For complex data stories, you need more sophisticated visualization techniques. These approaches help when simple charts don’t capture the full picture.

# Create complex sample data
np.random.seed(42)
complex_data = pd.DataFrame({
    'x': np.random.normal(0, 1, 1000),
    'y': np.random.normal(0, 1, 1000),
    'size': np.random.uniform(10, 100, 1000),
    'category': np.random.choice(['Type1', 'Type2', 'Type3', 'Type4'], 1000)
})

# Multi-dimensional scatter plot
plt.figure(figsize=(12, 8))

# Use size and color to show 4 dimensions
scatter = plt.scatter(complex_data['x'], complex_data['y'], 
                     s=complex_data['size'], 
                     c=complex_data['category'].astype('category').cat.codes,
                     alpha=0.6, cmap='viridis')

plt.xlabel('X Dimension')
plt.ylabel('Y Dimension')
plt.title('Multi-dimensional Scatter Plot')
plt.colorbar(scatter, label='Category')

# Add size legend
sizes = [20, 50, 100]
labels = ['Small', 'Medium', 'Large']
legend_elements = [plt.scatter([], [], s=s, c='gray', alpha=0.6) for s in sizes]
plt.legend(legend_elements, labels, title='Size', loc='upper right')

plt.show()

Customization and Styling

Professional visualizations require attention to styling and customization. I’ve learned that small details make a big difference in how your audience perceives your analysis.

# Set style for professional appearance
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")

# Create a publication-ready plot
fig, ax = plt.subplots(figsize=(10, 6))

# Plot with custom styling
for region in sales_data['region'].unique():
    region_data = sales_data[sales_data['region'] == region]
    monthly_sales = region_data.groupby(region_data['date'].dt.month)['sales'].mean()
    
    ax.plot(monthly_sales.index, monthly_sales.values, 
            marker='o', linewidth=2, markersize=6, label=region)

ax.set_xlabel('Month', fontsize=12, fontweight='bold')
ax.set_ylabel('Average Sales', fontsize=12, fontweight='bold')
ax.set_title('Seasonal Sales Patterns by Region', fontsize=14, fontweight='bold')
ax.legend(title='Region', title_fontsize=12, fontsize=10)
ax.grid(True, alpha=0.3)

# Customize tick labels
ax.set_xticks(range(1, 13))
ax.set_xticklabels(['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun',
                   'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'])

plt.tight_layout()
plt.show()

Interactive Visualizations

Static plots are great for reports, but interactive visualizations help with exploration and engagement. Plotly provides excellent interactive capabilities that work well in Jupyter notebooks.

import plotly.express as px
import plotly.graph_objects as go

# Interactive scatter plot
fig = px.scatter(sales_data, x='day_of_year', y='sales', 
                color='region', size='sales',
                hover_data=['date', 'product'],
                title='Interactive Sales Analysis')

fig.update_layout(
    xaxis_title='Day of Year',
    yaxis_title='Sales Amount',
    font=dict(size=12)
)

# This would show an interactive plot in Jupyter
# fig.show()

Visualization Best Practices

Effective data visualization follows principles that make information clear and actionable. I’ve learned these through years of creating charts that either illuminated insights or confused audiences.

Choose the right chart type for your data: line plots for time series, scatter plots for relationships, bar charts for comparisons, and histograms for distributions. Use color purposefully—to highlight important information, not just for decoration.

Always consider your audience. Technical stakeholders can handle complex multi-panel plots, while executives prefer simple, clear messages. Label everything clearly and provide context that helps viewers understand what they’re seeing.

Most importantly, every visualization should answer a specific question or support a particular argument. If you can’t explain why a chart matters, it probably doesn’t belong in your analysis.

Visualization is both an analytical tool and a communication medium. Master both aspects, and you’ll be able to discover insights in your data and share them effectively with others.

In our next part, we’ll explore statistical analysis and hypothesis testing, learning how to move beyond descriptive statistics to make inferences about populations and test specific hypotheses about your data.