Real-Time Data Processing Fundamentals
Core Concepts and Terminology
Understanding the building blocks of real-time systems:
Real-Time Processing vs. Batch Processing:
- Real-time: Continuous processing with minimal latency
- Batch: Periodic processing of accumulated data
- Micro-batch: Small batches with higher frequency
- Near real-time: Low but not immediate latency
- Stream processing: Continuous data flow processing
Key Concepts:
- Events: Discrete data records representing occurrences
- Streams: Unbounded sequences of events
- Producers: Systems generating event data
- Consumers: Systems processing event data
- Topics/Channels: Named streams for event organization
- Partitions: Subdivisions of streams for parallelism
- Offsets: Positions within event streams
Processing Semantics:
- At-most-once: Events may be lost but never processed twice
- At-least-once: Events are never lost but may be processed multiple times
- Exactly-once: Events are processed once and only once
- Processing guarantees vs. delivery guarantees
- End-to-end exactly-once semantics
Time Concepts in Streaming:
- Event time: When the event actually occurred
- Processing time: When the system processes the event
- Ingestion time: When the system receives the event
- Watermarks: Progress indicators for event time
- Windows: Time-based groupings of events
Real-Time Processing Architectures
Common patterns for building real-time systems:
Lambda Architecture:
- Combines batch and stream processing
- Batch layer for accuracy
- Speed layer for low latency
- Serving layer for query access
- Reconciliation between layers
- Duplicate processing logic
Example Lambda Architecture:
┌───────────────┐
│ │
│ Data Sources │
│ │
└───────┬───────┘
│
▼
┌───────────────┐ ┌───────────────┐
│ │ │ │
│ Batch Layer │ │ Speed Layer │
│ │ │ │
└───────┬───────┘ └───────┬───────┘
│ │
▼ ▼
┌───────────────┐ ┌───────────────┐
│ │ │ │
│ Batch Views │ │ Real-time │
│ │ │ Views │
└───────┬───────┘ └───────┬───────┘
│ │
└─────────┬───────────┘
│
▼
┌───────────────┐
│ │
│ Serving │
│ Layer │
│ │
└───────────────┘
Kappa Architecture:
- Stream processing only
- Single processing path
- Reprocessing for historical data
- Simplified maintenance
- Unified programming model
- Reduced complexity
Example Kappa Architecture:
┌───────────────┐
│ │
│ Data Sources │
│ │
└───────┬───────┘
│
▼
┌───────────────┐
│ │
│ Stream │
│ Processing │
│ Layer │
│ │
└───────┬───────┘
│
▼
┌───────────────┐
│ │
│ Serving │
│ Layer │
│ │
└───────────────┘
Modern Event-Driven Architecture:
- Event backbone (e.g., Kafka)
- Event processors (e.g., Flink, Kafka Streams)
- Event stores
- Command and query responsibility segregation (CQRS)
- Event sourcing
- Materialized views
Example Event-Driven Architecture:
┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ │ │ │ │ │
│ Event │ │ Event │ │ Event │
│ Producers │────▶│ Backbone │────▶│ Processors │
│ │ │ │ │ │
└───────────────┘ └───────────────┘ └───────┬───────┘
│
│
┌───────────────┐ │
│ │ │
│ Query │◀────────────┘
│ Services │
│ │
└───────────────┘