Data Warehouse Design
Structuring data for analytical workloads:
Schema Design Approaches:
- Star schema
- Snowflake schema
- Data vault
- One Big Table (OBT)
- Hybrid approaches
Example Star Schema:
-- Fact table
CREATE TABLE fact_sales (
sale_id INT PRIMARY KEY,
date_id INT NOT NULL REFERENCES dim_date(date_id),
product_id INT NOT NULL REFERENCES dim_product(product_id),
customer_id INT NOT NULL REFERENCES dim_customer(customer_id),
store_id INT NOT NULL REFERENCES dim_store(store_id),
quantity INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
discount_amount DECIMAL(10,2) NOT NULL,
sales_amount DECIMAL(10,2) NOT NULL,
cost_amount DECIMAL(10,2) NOT NULL,
profit_amount DECIMAL(10,2) NOT NULL
);
-- Dimension tables
CREATE TABLE dim_date (
date_id INT PRIMARY KEY,
date_actual DATE NOT NULL,
day_of_week VARCHAR(10) NOT NULL,
month_actual INT NOT NULL,
month_name VARCHAR(10) NOT NULL,
quarter_actual INT NOT NULL,
year_actual INT NOT NULL,
is_weekend BOOLEAN NOT NULL,
is_holiday BOOLEAN NOT NULL
);
CREATE TABLE dim_product (
product_id INT PRIMARY KEY,
product_name VARCHAR(100) NOT NULL,
category VARCHAR(50) NOT NULL,
subcategory VARCHAR(50),
unit_cost DECIMAL(10,2) NOT NULL,
unit_price DECIMAL(10,2) NOT NULL
);
Partitioning and Clustering:
- Time-based partitioning
- Range partitioning
- List partitioning
- Hash partitioning
- Clustering keys
Example BigQuery Partitioning and Clustering:
-- BigQuery partitioned and clustered table
CREATE OR REPLACE TABLE `project.dataset.fact_sales`
(
sale_id STRING,
sale_timestamp TIMESTAMP,
customer_id STRING,
product_id STRING,
store_id STRING,
quantity INT64,
unit_price NUMERIC,
sales_amount NUMERIC
)
PARTITION BY DATE(sale_timestamp)
CLUSTER BY store_id, product_id;
Data Lake Organization
Structuring raw and processed data:
Data Lake Zones:
- Landing zone (raw data)
- Bronze zone (validated raw data)
- Silver zone (transformed/enriched data)
- Gold zone (business-ready data)
- Sandbox zone (exploration area)
Example Data Lake Structure:
data_lake/
├── landing/ # Raw ingested data
│ ├── sales/
│ │ └── YYYY-MM-DD/ # Partitioned by ingestion date
│ ├── customers/
│ └── products/
├── bronze/ # Validated raw data
│ ├── sales/
│ │ └── YYYY-MM-DD/ # Partitioned by ingestion date
│ ├── customers/
│ └── products/
├── silver/ # Transformed data
│ ├── sales/
│ │ └── YYYY/MM/DD/ # Partitioned by business date
│ ├── customers/
│ └── products/
└── gold/ # Business-ready data
├── analytics/
│ ├── customer_360/
│ └── sales_performance/
└── reporting/
├── daily_sales_summary/
└── monthly_kpis/
File Format Considerations:
- Parquet for analytical workloads
- Avro for schema evolution
- ORC for columnar storage
- JSON for flexibility
- CSV for simplicity and compatibility
Streaming Data Processing
Stream Processing Patterns
Handling real-time data flows:
Event Streaming Architecture:
- Producer/consumer model
- Pub/sub messaging
- Stream processing topologies
- State management
- Exactly-once processing
Common Stream Processing Operations:
- Filtering and routing
- Enrichment and transformation
- Aggregation and windowing
- Pattern detection
- Joining streams
Stream Processing Technologies:
- Apache Kafka Streams
- Apache Flink
- Apache Spark Structured Streaming
- AWS Kinesis Data Analytics
- Google Dataflow
Example Kafka Streams Application:
// Kafka Streams application for real-time sales analytics
import org.apache.kafka.streams.StreamsBuilder;
import org.apache.kafka.streams.kstream.*;
// Create topology
StreamsBuilder builder = new StreamsBuilder();
// Read from sales topic
KStream<String, Sale> salesStream = builder.stream(
"sales-events",
Consumed.with(Serdes.String(), SaleSerdes.Sale())
);
// Calculate revenue by product category with 5-minute tumbling windows
salesStream
.groupBy((key, sale) -> sale.getProductCategory())
.windowedBy(TimeWindows.of(Duration.ofMinutes(5)))
.aggregate(
() -> 0.0, // initializer
(key, sale, total) -> total + sale.getAmount(), // adder
Materialized.with(Serdes.String(), Serdes.Double())
)
.toStream()
.map((windowedKey, revenue) -> {
String category = windowedKey.key();
long windowStart = windowedKey.window().start();
long windowEnd = windowedKey.window().end();
return KeyValue.pair(
category,
new CategoryRevenue(category, revenue, windowStart, windowEnd)
);
})
.to(
"category-revenue",
Produced.with(Serdes.String(), CategoryRevenueSerdes.CategoryRevenue())
);
Stream Processing Best Practices:
- Design for fault tolerance
- Implement proper error handling
- Consider state management carefully
- Plan for data reprocessing
- Monitor stream lag and throughput