Data Pipeline Architecture

Architectural Patterns

Foundational approaches to data pipeline design:

Batch Processing:

  • Processing data in scheduled intervals
  • Handling large volumes efficiently
  • Optimizing for throughput over latency
  • Implementing idempotent operations
  • Managing dependencies between jobs

Stream Processing:

  • Processing data in near real-time
  • Handling continuous data flows
  • Implementing windowing strategies
  • Managing state and checkpointing
  • Ensuring exactly-once processing

Lambda Architecture:

  • Combining batch and streaming layers
  • Providing both accurate and real-time views
  • Managing duplicate processing logic
  • Reconciling batch and speed layers
  • Optimizing for different access patterns

Kappa Architecture:

  • Unifying batch and streaming with a single path
  • Simplifying maintenance with one codebase
  • Leveraging stream processing for all workloads
  • Reprocessing historical data through streams
  • Reducing architectural complexity

Data Mesh:

  • Decentralizing data ownership
  • Treating data as a product
  • Implementing domain-oriented architecture
  • Providing self-serve data infrastructure
  • Establishing federated governance

Example Lambda Architecture:

┌───────────────┐
│               │
│  Data Sources │
│               │
└───────┬───────┘
        │
        ▼
┌───────────────┐     ┌───────────────┐
│               │     │               │
│  Batch Layer  │     │ Speed Layer   │
│               │     │               │
└───────┬───────┘     └───────┬───────┘
        │                     │
        ▼                     ▼
┌───────────────┐     ┌───────────────┐
│               │     │               │
│  Batch Views  │     │ Real-time     │
│               │     │ Views         │
└───────┬───────┘     └───────┬───────┘
        │                     │
        └─────────┬───────────┘
                  │
                  ▼
          ┌───────────────┐
          │               │
          │  Serving      │
          │  Layer        │
          │               │
          └───────────────┘

ETL vs. ELT

Comparing transformation approaches:

ETL (Extract, Transform, Load):

  • Transformation before loading to target
  • Data cleansing outside the data warehouse
  • Typically uses specialized ETL tools
  • Better for complex transformations with limited compute
  • Reduced storage requirements in target systems

ELT (Extract, Load, Transform):

  • Loading raw data before transformation
  • Leveraging data warehouse compute power
  • Enabling exploration of raw data
  • Simplifying pipeline architecture
  • Supporting iterative transformation development

Hybrid Approaches:

  • Light transformation during extraction
  • Heavy transformation in the warehouse
  • Preprocessing for specific use cases
  • Optimizing for different data types
  • Balancing performance and flexibility

When to Choose ETL:

  • Limited data warehouse resources
  • Complex transformations requiring specialized tools
  • Strict data privacy requirements
  • Legacy system integration
  • Real-time transformation needs

When to Choose ELT:

  • Modern cloud data warehouses with scalable compute
  • Exploratory analytics requirements
  • Evolving transformation requirements
  • Large volumes of structured or semi-structured data
  • Self-service analytics environments

Orchestration

Managing pipeline workflows and dependencies:

Orchestration Requirements:

  • Dependency management
  • Scheduling and triggering
  • Error handling and retries
  • Monitoring and alerting
  • Resource management

Apache Airflow:

  • DAG-based workflow definition
  • Python-based configuration
  • Rich operator ecosystem
  • Extensive monitoring capabilities
  • Strong community support

Example Airflow DAG:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.postgres.operators.postgres import PostgresOperator

default_args = {
    'owner': 'data_engineering',
    'depends_on_past': False,
    'start_date': datetime(2025, 7, 1),
    'email': ['[email protected]'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'sales_data_pipeline',
    default_args=default_args,
    description='Load and transform sales data',
    schedule_interval='0 2 * * *',
    catchup=False,
    tags=['sales', 'production'],
)

# Check if data is available
check_data_available = S3KeySensor(
    task_id='check_data_available',
    bucket_key='sales/{{ ds }}/*.csv',
    wildcard_match=True,
    bucket_name='sales-data',
    aws_conn_id='aws_default',
    timeout=60 * 30,
    poke_interval=60,
    dag=dag,
)

# Load data from S3 to Redshift
load_to_redshift = S3ToRedshiftOperator(
    task_id='load_to_redshift',
    schema='raw',
    table='sales',
    s3_bucket='sales-data',
    s3_key='sales/{{ ds }}/',
    redshift_conn_id='redshift_default',
    aws_conn_id='aws_default',
    copy_options=[
        "DELIMITER ','",
        "IGNOREHEADER 1",
        "DATEFORMAT 'auto'",
    ],
    method='REPLACE',
    dag=dag,
)

# Transform data
transform_data = PostgresOperator(
    task_id='transform_data',
    postgres_conn_id='redshift_default',
    sql="""
    INSERT INTO analytics.daily_sales_summary
    SELECT 
        date_trunc('day', sale_timestamp) as sale_date,
        product_id,
        SUM(quantity) as total_quantity,
        SUM(amount) as total_amount,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM raw.sales
    WHERE DATE(sale_timestamp) = '{{ ds }}'
    GROUP BY 1, 2
    """,
    dag=dag,
)

# Define task dependencies
check_data_available >> load_to_redshift >> transform_data

Other Orchestration Tools:

  • Prefect
  • Dagster
  • Argo Workflows
  • Luigi
  • AWS Step Functions

Orchestration Best Practices:

  • Define clear task boundaries
  • Implement proper error handling
  • Use parameterization for reusability
  • Monitor pipeline performance
  • Implement CI/CD for pipeline code