Data Pipeline Architecture
Architectural Patterns
Foundational approaches to data pipeline design:
Batch Processing:
- Processing data in scheduled intervals
- Handling large volumes efficiently
- Optimizing for throughput over latency
- Implementing idempotent operations
- Managing dependencies between jobs
Stream Processing:
- Processing data in near real-time
- Handling continuous data flows
- Implementing windowing strategies
- Managing state and checkpointing
- Ensuring exactly-once processing
Lambda Architecture:
- Combining batch and streaming layers
- Providing both accurate and real-time views
- Managing duplicate processing logic
- Reconciling batch and speed layers
- Optimizing for different access patterns
Kappa Architecture:
- Unifying batch and streaming with a single path
- Simplifying maintenance with one codebase
- Leveraging stream processing for all workloads
- Reprocessing historical data through streams
- Reducing architectural complexity
Data Mesh:
- Decentralizing data ownership
- Treating data as a product
- Implementing domain-oriented architecture
- Providing self-serve data infrastructure
- Establishing federated governance
Example Lambda Architecture:
┌───────────────┐
│ │
│ Data Sources │
│ │
└───────┬───────┘
│
▼
┌───────────────┐ ┌───────────────┐
│ │ │ │
│ Batch Layer │ │ Speed Layer │
│ │ │ │
└───────┬───────┘ └───────┬───────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ │ │ │
│ Batch Views │ │ Real-time │
│ │ │ Views │
└───────┬───────┘ └───────┬───────┘
│ │
└─────────┬───────────┘
│
▼
┌───────────────┐
│ │
│ Serving │
│ Layer │
│ │
└───────────────┘
ETL vs. ELT
Comparing transformation approaches:
ETL (Extract, Transform, Load):
- Transformation before loading to target
- Data cleansing outside the data warehouse
- Typically uses specialized ETL tools
- Better for complex transformations with limited compute
- Reduced storage requirements in target systems
ELT (Extract, Load, Transform):
- Loading raw data before transformation
- Leveraging data warehouse compute power
- Enabling exploration of raw data
- Simplifying pipeline architecture
- Supporting iterative transformation development
Hybrid Approaches:
- Light transformation during extraction
- Heavy transformation in the warehouse
- Preprocessing for specific use cases
- Optimizing for different data types
- Balancing performance and flexibility
When to Choose ETL:
- Limited data warehouse resources
- Complex transformations requiring specialized tools
- Strict data privacy requirements
- Legacy system integration
- Real-time transformation needs
When to Choose ELT:
- Modern cloud data warehouses with scalable compute
- Exploratory analytics requirements
- Evolving transformation requirements
- Large volumes of structured or semi-structured data
- Self-service analytics environments
Orchestration
Managing pipeline workflows and dependencies:
Orchestration Requirements:
- Dependency management
- Scheduling and triggering
- Error handling and retries
- Monitoring and alerting
- Resource management
Apache Airflow:
- DAG-based workflow definition
- Python-based configuration
- Rich operator ecosystem
- Extensive monitoring capabilities
- Strong community support
Example Airflow DAG:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.postgres.operators.postgres import PostgresOperator
default_args = {
'owner': 'data_engineering',
'depends_on_past': False,
'start_date': datetime(2025, 7, 1),
'email': ['[email protected]'],
'email_on_failure': True,
'email_on_retry': False,
'retries': 3,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'sales_data_pipeline',
default_args=default_args,
description='Load and transform sales data',
schedule_interval='0 2 * * *',
catchup=False,
tags=['sales', 'production'],
)
# Check if data is available
check_data_available = S3KeySensor(
task_id='check_data_available',
bucket_key='sales/{{ ds }}/*.csv',
wildcard_match=True,
bucket_name='sales-data',
aws_conn_id='aws_default',
timeout=60 * 30,
poke_interval=60,
dag=dag,
)
# Load data from S3 to Redshift
load_to_redshift = S3ToRedshiftOperator(
task_id='load_to_redshift',
schema='raw',
table='sales',
s3_bucket='sales-data',
s3_key='sales/{{ ds }}/',
redshift_conn_id='redshift_default',
aws_conn_id='aws_default',
copy_options=[
"DELIMITER ','",
"IGNOREHEADER 1",
"DATEFORMAT 'auto'",
],
method='REPLACE',
dag=dag,
)
# Transform data
transform_data = PostgresOperator(
task_id='transform_data',
postgres_conn_id='redshift_default',
sql="""
INSERT INTO analytics.daily_sales_summary
SELECT
date_trunc('day', sale_timestamp) as sale_date,
product_id,
SUM(quantity) as total_quantity,
SUM(amount) as total_amount,
COUNT(DISTINCT customer_id) as unique_customers
FROM raw.sales
WHERE DATE(sale_timestamp) = '{{ ds }}'
GROUP BY 1, 2
""",
dag=dag,
)
# Define task dependencies
check_data_available >> load_to_redshift >> transform_data
Other Orchestration Tools:
- Prefect
- Dagster
- Argo Workflows
- Luigi
- AWS Step Functions
Orchestration Best Practices:
- Define clear task boundaries
- Implement proper error handling
- Use parameterization for reusability
- Monitor pipeline performance
- Implement CI/CD for pipeline code