Introduction | Andrew Odendaal

Data Pipeline Architecture

Architectural Patterns

Foundational approaches to data pipeline design:

Batch Processing:

Processing data in scheduled intervals
Handling large volumes efficiently
Optimizing for throughput over latency
Implementing idempotent operations
Managing dependencies between jobs

Stream Processing:

Processing data in near real-time
Handling continuous data flows
Implementing windowing strategies
Managing state and checkpointing
Ensuring exactly-once processing

Lambda Architecture:

Combining batch and streaming layers
Providing both accurate and real-time views
Managing duplicate processing logic
Reconciling batch and speed layers
Optimizing for different access patterns

Kappa Architecture:

Unifying batch and streaming with a single path
Simplifying maintenance with one codebase
Leveraging stream processing for all workloads
Reprocessing historical data through streams
Reducing architectural complexity

Data Mesh:

Decentralizing data ownership
Treating data as a product
Implementing domain-oriented architecture
Providing self-serve data infrastructure
Establishing federated governance

Example Lambda Architecture:

┌───────────────┐
│               │
│  Data Sources │
│               │
└───────┬───────┘
        │
        ▼
┌───────────────┐     ┌───────────────┐
│               │     │               │
│  Batch Layer  │     │ Speed Layer   │
│               │     │               │
└───────┬───────┘     └───────┬───────┘
        │                     │
        ▼                     ▼
┌───────────────┐     ┌───────────────┐
│               │     │               │
│  Batch Views  │     │ Real-time     │
│               │     │ Views         │
└───────┬───────┘     └───────┬───────┘
        │                     │
        └─────────┬───────────┘
                  │
                  ▼
          ┌───────────────┐
          │               │
          │  Serving      │
          │  Layer        │
          │               │
          └───────────────┘

ETL vs. ELT

Comparing transformation approaches:

ETL (Extract, Transform, Load):

Transformation before loading to target
Data cleansing outside the data warehouse
Typically uses specialized ETL tools
Better for complex transformations with limited compute
Reduced storage requirements in target systems

ELT (Extract, Load, Transform):

Loading raw data before transformation
Leveraging data warehouse compute power
Enabling exploration of raw data
Simplifying pipeline architecture
Supporting iterative transformation development

Hybrid Approaches:

Light transformation during extraction
Heavy transformation in the warehouse
Preprocessing for specific use cases
Optimizing for different data types
Balancing performance and flexibility

When to Choose ETL:

Limited data warehouse resources
Complex transformations requiring specialized tools
Strict data privacy requirements
Legacy system integration
Real-time transformation needs

When to Choose ELT:

Modern cloud data warehouses with scalable compute
Exploratory analytics requirements
Evolving transformation requirements
Large volumes of structured or semi-structured data
Self-service analytics environments

Orchestration

Managing pipeline workflows and dependencies:

Orchestration Requirements:

Dependency management
Scheduling and triggering
Error handling and retries
Monitoring and alerting
Resource management

Apache Airflow:

DAG-based workflow definition
Python-based configuration
Rich operator ecosystem
Extensive monitoring capabilities
Strong community support

Example Airflow DAG:

from datetime import datetime, timedelta
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.providers.amazon.aws.transfers.s3_to_redshift import S3ToRedshiftOperator
from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor
from airflow.providers.postgres.operators.postgres import PostgresOperator

default_args = {
    'owner': 'data_engineering',
    'depends_on_past': False,
    'start_date': datetime(2025, 7, 1),
    'email': ['[email protected]'],
    'email_on_failure': True,
    'email_on_retry': False,
    'retries': 3,
    'retry_delay': timedelta(minutes=5),
}

dag = DAG(
    'sales_data_pipeline',
    default_args=default_args,
    description='Load and transform sales data',
    schedule_interval='0 2 * * *',
    catchup=False,
    tags=['sales', 'production'],
)

# Check if data is available
check_data_available = S3KeySensor(
    task_id='check_data_available',
    bucket_key='sales/{{ ds }}/*.csv',
    wildcard_match=True,
    bucket_name='sales-data',
    aws_conn_id='aws_default',
    timeout=60 * 30,
    poke_interval=60,
    dag=dag,
)

# Load data from S3 to Redshift
load_to_redshift = S3ToRedshiftOperator(
    task_id='load_to_redshift',
    schema='raw',
    table='sales',
    s3_bucket='sales-data',
    s3_key='sales/{{ ds }}/',
    redshift_conn_id='redshift_default',
    aws_conn_id='aws_default',
    copy_options=[
        "DELIMITER ','",
        "IGNOREHEADER 1",
        "DATEFORMAT 'auto'",
    ],
    method='REPLACE',
    dag=dag,
)

# Transform data
transform_data = PostgresOperator(
    task_id='transform_data',
    postgres_conn_id='redshift_default',
    sql="""
    INSERT INTO analytics.daily_sales_summary
    SELECT 
        date_trunc('day', sale_timestamp) as sale_date,
        product_id,
        SUM(quantity) as total_quantity,
        SUM(amount) as total_amount,
        COUNT(DISTINCT customer_id) as unique_customers
    FROM raw.sales
    WHERE DATE(sale_timestamp) = '{{ ds }}'
    GROUP BY 1, 2
    """,
    dag=dag,
)

# Define task dependencies
check_data_available >> load_to_redshift >> transform_data

Other Orchestration Tools:

Prefect
Dagster
Argo Workflows
Luigi
AWS Step Functions

Orchestration Best Practices:

Define clear task boundaries
Implement proper error handling
Use parameterization for reusability
Monitor pipeline performance
Implement CI/CD for pipeline code

Continue Your Learning

This is part 1 of 5 in the comprehensive guide.

Guide Overview See all 5 parts Next → Fundamentals and Core Concepts