Pandas DataFrames and Data Manipulation
Working with real-world data means dealing with missing values, mixed data types, inconsistent formats, and all the messiness that comes with information collected from different sources. Pure NumPy arrays can’t handle this complexity gracefully, which is why pandas exists. It bridges the gap between raw data and the clean, structured information you need for analysis.
The library’s name comes from “panel data,” but I think of pandas as the bridge between raw data and insights. It handles the messy reality of real-world data while providing an intuitive interface that feels familiar to anyone who’s worked with Excel or SQL.
DataFrames vs NumPy Arrays
While NumPy excels at numerical computation, pandas DataFrames excel at handling heterogeneous data with labels. A DataFrame can contain different data types in different columns, handle missing values gracefully, and provide meaningful row and column labels.
import pandas as pd
import numpy as np
# Create a DataFrame from a dictionary
data = {
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'age': [25, 30, 35, 28],
'salary': [50000, 60000, 70000, 55000],
'department': ['Engineering', 'Sales', 'Engineering', 'Marketing']
}
df = pd.DataFrame(data)
print(df)
print(f"\nData types:\n{df.dtypes}")
This example shows pandas’ strength: mixed data types in a single structure with meaningful column names. NumPy arrays can’t handle this heterogeneity as elegantly.
Loading and Exploring Data
Real data science work starts with loading data from various sources. Pandas provides readers for most common formats, and I use them daily for everything from CSV files to database connections.
# For demonstration, create sample data
np.random.seed(42)
sample_data = pd.DataFrame({
'date': pd.date_range('2024-01-01', periods=100),
'sales': np.random.normal(1000, 200, 100),
'region': np.random.choice(['North', 'South', 'East', 'West'], 100),
'product': np.random.choice(['A', 'B', 'C'], 100)
})
# Essential exploration methods
print("Dataset shape:", sample_data.shape)
print("\nFirst few rows:")
print(sample_data.head())
print("\nData types and memory usage:")
print(sample_data.info())
These exploration methods give you a quick overview of your dataset’s structure, which is essential before diving into analysis.
Data Selection and Filtering
Pandas provides multiple ways to select data, and choosing the right method makes your code more readable and maintainable. I use different approaches depending on whether I’m selecting by label, position, or condition.
# Column selection
sales_column = sample_data['sales'] # Single column (Series)
subset = sample_data[['sales', 'region']] # Multiple columns (DataFrame)
# Row selection by condition
high_sales = sample_data[sample_data['sales'] > 1200]
north_region = sample_data[sample_data['region'] == 'North']
# Combined conditions
north_high_sales = sample_data[
(sample_data['region'] == 'North') &
(sample_data['sales'] > 1200)
]
print(f"High sales records: {len(high_sales)}")
print(f"North region records: {len(north_region)}")
The .loc
and .iloc
accessors provide more explicit selection methods that I prefer for complex indexing operations.
Data Cleaning and Transformation
Real-world data is messy, and pandas provides excellent tools for cleaning and transforming it. I spend a significant portion of my time on these operations because clean data is essential for reliable analysis.
# Introduce some missing values for demonstration
sample_data_dirty = sample_data.copy()
sample_data_dirty.loc[5:10, 'sales'] = np.nan
# Check for missing values
print("Missing values per column:")
print(sample_data_dirty.isnull().sum())
# Handle missing values
clean_data = sample_data_dirty.dropna()
filled_data = sample_data_dirty.fillna({
'sales': sample_data_dirty['sales'].mean()
})
print(f"Original: {len(sample_data_dirty)} rows")
print(f"After dropna: {len(clean_data)} rows")
print(f"After fillna: {len(filled_data)} rows")
Data type conversion is another common cleaning task. Pandas often infers types correctly, but sometimes you need to be explicit.
# Data type conversions and new columns
sample_data['product'] = sample_data['product'].astype('category')
sample_data['date'] = pd.to_datetime(sample_data['date'])
sample_data['month'] = sample_data['date'].dt.month
sample_data['sales_category'] = pd.cut(sample_data['sales'],
bins=[0, 800, 1200, float('inf')],
labels=['Low', 'Medium', 'High'])
print("New columns:")
print(sample_data[['date', 'month', 'sales', 'sales_category']].head())
Grouping and Aggregation
GroupBy operations are where pandas really shines for data analysis. They follow the split-apply-combine pattern: split data into groups, apply a function to each group, then combine the results.
# Basic grouping operations
region_stats = sample_data.groupby('region')['sales'].agg([
'count', 'mean', 'std', 'min', 'max'
])
print("Sales statistics by region:")
print(region_stats.round(2))
# Multiple grouping variables
monthly_region_sales = sample_data.groupby(['month', 'region'])['sales'].sum()
print("\nMonthly sales by region:")
print(monthly_region_sales.head(10))
Custom aggregation functions give you flexibility when built-in functions aren’t sufficient.
# Custom aggregation functions
def sales_range(series):
return series.max() - series.min()
custom_stats = sample_data.groupby('region')['sales'].agg([
('mean', 'mean'),
('range', sales_range)
])
print("Custom statistics by region:")
print(custom_stats.round(2))
Merging and Time Series Operations
Real projects often involve combining data from multiple sources. Pandas provides several methods for joining DataFrames, similar to SQL joins but with more flexibility.
# Create sample datasets to merge
customers = pd.DataFrame({
'customer_id': [1, 2, 3, 4],
'name': ['Alice', 'Bob', 'Charlie', 'Diana'],
'city': ['New York', 'London', 'Tokyo', 'Paris']
})
orders = pd.DataFrame({
'order_id': [101, 102, 103, 104],
'customer_id': [1, 2, 2, 3],
'amount': [100, 150, 200, 75]
})
# Inner join (only matching records)
merged = pd.merge(customers, orders, on='customer_id', how='inner')
print("Merged data:")
print(merged)
Pandas excels at time series analysis, providing specialized functionality for working with dates and times.
# Time series operations
sample_data.set_index('date', inplace=True)
# Resampling to different frequencies
weekly_sales = sample_data['sales'].resample('W').mean()
monthly_sales = sample_data['sales'].resample('M').sum()
print("Monthly sales:")
print(monthly_sales)
# Rolling window calculations
sample_data['sales_7day_avg'] = sample_data['sales'].rolling(window=7).mean()
print("\nSales with moving average:")
print(sample_data[['sales', 'sales_7day_avg']].head(10))
Pandas DataFrames provide the foundation for most data science workflows in Python. The operations you’ve learned here—selection, filtering, grouping, and merging—form the building blocks for more complex analyses.
In our next part, we’ll explore data visualization with Matplotlib and Seaborn, learning how to create compelling visual representations of the data patterns we’ve discovered using pandas. Visualization is crucial for both exploratory analysis and communicating insights to stakeholders.