Data Warehouse Design

Structuring data for analytical workloads:

Schema Design Approaches:

  • Star schema
  • Snowflake schema
  • Data vault
  • One Big Table (OBT)
  • Hybrid approaches

Example Star Schema:

-- Fact table
CREATE TABLE fact_sales (
    sale_id INT PRIMARY KEY,
    date_id INT NOT NULL REFERENCES dim_date(date_id),
    product_id INT NOT NULL REFERENCES dim_product(product_id),
    customer_id INT NOT NULL REFERENCES dim_customer(customer_id),
    store_id INT NOT NULL REFERENCES dim_store(store_id),
    quantity INT NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL,
    discount_amount DECIMAL(10,2) NOT NULL,
    sales_amount DECIMAL(10,2) NOT NULL,
    cost_amount DECIMAL(10,2) NOT NULL,
    profit_amount DECIMAL(10,2) NOT NULL
);

-- Dimension tables
CREATE TABLE dim_date (
    date_id INT PRIMARY KEY,
    date_actual DATE NOT NULL,
    day_of_week VARCHAR(10) NOT NULL,
    month_actual INT NOT NULL,
    month_name VARCHAR(10) NOT NULL,
    quarter_actual INT NOT NULL,
    year_actual INT NOT NULL,
    is_weekend BOOLEAN NOT NULL,
    is_holiday BOOLEAN NOT NULL
);

CREATE TABLE dim_product (
    product_id INT PRIMARY KEY,
    product_name VARCHAR(100) NOT NULL,
    category VARCHAR(50) NOT NULL,
    subcategory VARCHAR(50),
    unit_cost DECIMAL(10,2) NOT NULL,
    unit_price DECIMAL(10,2) NOT NULL
);

Partitioning and Clustering:

  • Time-based partitioning
  • Range partitioning
  • List partitioning
  • Hash partitioning
  • Clustering keys

Example BigQuery Partitioning and Clustering:

-- BigQuery partitioned and clustered table
CREATE OR REPLACE TABLE `project.dataset.fact_sales`
(
  sale_id STRING,
  sale_timestamp TIMESTAMP,
  customer_id STRING,
  product_id STRING,
  store_id STRING,
  quantity INT64,
  unit_price NUMERIC,
  sales_amount NUMERIC
)
PARTITION BY DATE(sale_timestamp)
CLUSTER BY store_id, product_id;

Data Lake Organization

Structuring raw and processed data:

Data Lake Zones:

  • Landing zone (raw data)
  • Bronze zone (validated raw data)
  • Silver zone (transformed/enriched data)
  • Gold zone (business-ready data)
  • Sandbox zone (exploration area)

Example Data Lake Structure:

data_lake/
├── landing/                 # Raw ingested data
│   ├── sales/
│   │   └── YYYY-MM-DD/      # Partitioned by ingestion date
│   ├── customers/
│   └── products/
├── bronze/                  # Validated raw data
│   ├── sales/
│   │   └── YYYY-MM-DD/      # Partitioned by ingestion date
│   ├── customers/
│   └── products/
├── silver/                  # Transformed data
│   ├── sales/
│   │   └── YYYY/MM/DD/      # Partitioned by business date
│   ├── customers/
│   └── products/
└── gold/                    # Business-ready data
    ├── analytics/
    │   ├── customer_360/
    │   └── sales_performance/
    └── reporting/
        ├── daily_sales_summary/
        └── monthly_kpis/

File Format Considerations:

  • Parquet for analytical workloads
  • Avro for schema evolution
  • ORC for columnar storage
  • JSON for flexibility
  • CSV for simplicity and compatibility

Streaming Data Processing

Stream Processing Patterns

Handling real-time data flows:

Event Streaming Architecture:

  • Producer/consumer model
  • Pub/sub messaging
  • Stream processing topologies
  • State management
  • Exactly-once processing

Common Stream Processing Operations:

  • Filtering and routing
  • Enrichment and transformation
  • Aggregation and windowing
  • Pattern detection
  • Joining streams

Stream Processing Technologies:

  • Apache Kafka Streams
  • Apache Flink
  • Apache Spark Structured Streaming
  • AWS Kinesis Data Analytics
  • Google Dataflow

Example Kafka Streams Application:

// Kafka Streams application for real-time sales analytics
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.kstream.*;

// Create topology
StreamsBuilder builder = new StreamsBuilder();

// Read from sales topic
KStream<String, Sale> salesStream = builder.stream(
    "sales-events", 
    Consumed.with(Serdes.String(), SaleSerdes.Sale())
);

// Calculate revenue by product category with 5-minute tumbling windows
salesStream
    .groupBy((key, sale) -> sale.getProductCategory())
    .windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
    .aggregate(
        () -> 0.0,  // initializer
        (key, sale, total) -> total + sale.getAmount(),  // adder
        Materialized.with(Serdes.String(), Serdes.Double())
    )
    .toStream()
    .map((windowedKey, revenue) -> {
        String category = windowedKey.key();
        long windowStart = windowedKey.window().start();
        long windowEnd = windowedKey.window().end();
        return KeyValue.pair(
            category, 
            new CategoryRevenue(category, revenue, windowStart, windowEnd)
        );
    })
    .to(
        "category-revenue", 
        Produced.with(Serdes.String(), CategoryRevenueSerdes.CategoryRevenue())
    );

Stream Processing Best Practices:

  • Design for fault tolerance
  • Implement proper error handling
  • Consider state management carefully
  • Plan for data reprocessing
  • Monitor stream lag and throughput