Implementation Strategies | Andrew Odendaal

Data Warehouse Design

Structuring data for analytical workloads:

Schema Design Approaches:

Star schema
Snowflake schema
Data vault
One Big Table (OBT)
Hybrid approaches

Example Star Schema:

-- Fact table
CREATE TABLE fact_sales (
    sale_id INT PRIMARY KEY,
    date_id INT NOT NULL REFERENCES dim_date(date_id),
    product_id INT NOT NULL REFERENCES dim_product(product_id),
    customer_id INT NOT NULL REFERENCES dim_customer(customer_id),
    store_id INT NOT NULL REFERENCES dim_store(store_id),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2) NOT NULL,
    sales_amount DECIMAL(10,2) NOT NULL,
    cost_amount DECIMAL(10,2) NOT NULL,
    profit_amount DECIMAL(10,2) NOT NULL
);

-- Dimension tables
CREATE TABLE dim_date (
    date_id INT PRIMARY KEY,
    date_actual DATE NOT NULL,
    day_of_week VARCHAR(10) NOT NULL,
    month_actual INT NOT NULL,
    month_name VARCHAR(10) NOT NULL,
    quarter_actual INT NOT NULL,
    year_actual INT NOT NULL,
    is_weekend BOOLEAN NOT NULL,
    is_holiday BOOLEAN NOT NULL
);

CREATE TABLE dim_product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100) NOT NULL,
    category VARCHAR(50) NOT NULL,
    subcategory VARCHAR(50),
    unit_cost DECIMAL(10,2) NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL
);

Partitioning and Clustering:

Time-based partitioning
Range partitioning
List partitioning
Hash partitioning
Clustering keys

Example BigQuery Partitioning and Clustering:

-- BigQuery partitioned and clustered table
CREATE OR REPLACE TABLE `project.dataset.fact_sales`
(
  sale_id STRING,
  sale_timestamp TIMESTAMP,
  customer_id STRING,
  product_id STRING,
  store_id STRING,
  quantity INT64,
  unit_price NUMERIC,
  sales_amount NUMERIC
)
PARTITION BY DATE(sale_timestamp)
CLUSTER BY store_id, product_id;

Data Lake Organization

Structuring raw and processed data:

Data Lake Zones:

Landing zone (raw data)
Bronze zone (validated raw data)
Silver zone (transformed/enriched data)
Gold zone (business-ready data)
Sandbox zone (exploration area)

Example Data Lake Structure:

data_lake/
├── landing/                 # Raw ingested data
│   ├── sales/
│   │   └── YYYY-MM-DD/      # Partitioned by ingestion date
│   ├── customers/
│   └── products/
├── bronze/                  # Validated raw data
│   ├── sales/
│   │   └── YYYY-MM-DD/      # Partitioned by ingestion date
│   ├── customers/
│   └── products/
├── silver/                  # Transformed data
│   ├── sales/
│   │   └── YYYY/MM/DD/      # Partitioned by business date
│   ├── customers/
│   └── products/
└── gold/                    # Business-ready data
    ├── analytics/
    │   ├── customer_360/
    │   └── sales_performance/
    └── reporting/
        ├── daily_sales_summary/
        └── monthly_kpis/

File Format Considerations:

Parquet for analytical workloads
Avro for schema evolution
ORC for columnar storage
JSON for flexibility
CSV for simplicity and compatibility

Streaming Data Processing

Stream Processing Patterns

Handling real-time data flows:

Event Streaming Architecture:

Producer/consumer model
Pub/sub messaging
Stream processing topologies
State management
Exactly-once processing

Common Stream Processing Operations:

Filtering and routing
Enrichment and transformation
Aggregation and windowing
Pattern detection
Joining streams

Stream Processing Technologies:

Apache Kafka Streams
Apache Flink
Apache Spark Structured Streaming
AWS Kinesis Data Analytics
Google Dataflow

Example Kafka Streams Application:

// Kafka Streams application for real-time sales analytics
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.kstream.*;

// Create topology
StreamsBuilder builder = new StreamsBuilder();

// Read from sales topic
KStream<String, Sale> salesStream = builder.stream(
    "sales-events", 
    Consumed.with(Serdes.String(), SaleSerdes.Sale())
);

// Calculate revenue by product category with 5-minute tumbling windows
salesStream
    .groupBy((key, sale) -> sale.getProductCategory())
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(
        () -> 0.0,  // initializer
        (key, sale, total) -> total + sale.getAmount(),  // adder
        Materialized.with(Serdes.String(), Serdes.Double())
    )
    .toStream()
    .map((windowedKey, revenue) -> {
        String category = windowedKey.key();
        long windowStart = windowedKey.window().start();
        long windowEnd = windowedKey.window().end();
        return KeyValue.pair(
            category, 
            new CategoryRevenue(category, revenue, windowStart, windowEnd)
        );
    })
    .to(
        "category-revenue", 
        Produced.with(Serdes.String(), CategoryRevenueSerdes.CategoryRevenue())
    );

Stream Processing Best Practices:

Design for fault tolerance
Implement proper error handling
Consider state management carefully
Plan for data reprocessing
Monitor stream lag and throughput

Data Warehouse Design

Data Lake Organization

Streaming Data Processing

Stream Processing Patterns

Continue Your Learning