Streaming Technologies and Platforms
Event Streaming Platforms
Core technologies for event distribution:
Apache Kafka:
- Distributed log-based messaging
- High throughput and scalability
- Persistent storage
- Topic-based organization
- Consumer groups
- Partitioning for parallelism
- Exactly-once semantics
Example Kafka Architecture:
┌───────────────────────────────────────────────────────────┐
│ │
│ Kafka Cluster │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ │ │ │ │ │ │
│ │ Broker 1 │ │ Broker 2 │ │ Broker 3 │ │
│ │ │ │ │ │ │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
└───────────────────────────────────────────────────────────┘
▲ ▲
│ │
│ │
┌─────────┴─────────┐ ┌─────────┴─────────┐
│ │ │ │
│ Producers │ │ Consumers │
│ │ │ │
└───────────────────┘ └───────────────────┘
Example Kafka Topic Configuration:
# Topic configuration
num.partitions=12
replication.factor=3
min.insync.replicas=2
retention.ms=604800000 # 7 days
cleanup.policy=delete
Apache Pulsar:
- Multi-tenant architecture
- Tiered storage
- Geo-replication
- Unified messaging model
- Pulsar Functions
- Schema registry
- Stronger durability guarantees
Cloud-Based Streaming Services:
- Amazon Kinesis
- Azure Event Hubs
- Google Cloud Pub/Sub
- Confluent Cloud
- IBM Event Streams
- Redpanda
Stream Processing Frameworks
Technologies for analyzing and transforming streams:
Apache Flink:
- Stateful stream processing
- Event time processing
- Exactly-once semantics
- Windowing operations
- Checkpointing for fault tolerance
- High throughput and low latency
- SQL interface
Example Flink Streaming Job:
// Flink streaming job example
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// Enable checkpointing for exactly-once processing
env.enableCheckpointing(60000); // 60 seconds
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);
// Configure Kafka source
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kafka:9092");
properties.setProperty("group.id", "flink-consumer-group");
FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<>(
"input-topic",
new SimpleStringSchema(),
properties
);
kafkaSource.assignTimestampsAndWatermarks(
WatermarkStrategy
.<String>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((event, timestamp) -> extractTimestamp(event))
);
// Define the processing pipeline
DataStream<String> stream = env.addSource(kafkaSource);
// Parse JSON events
DataStream<Event> events = stream
.map(json -> parseJson(json))
.filter(event -> event.getType().equals("PURCHASE"));
// Process events with 5-minute tumbling windows
DataStream<Result> results = events
.keyBy(event -> event.getUserId())
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.aggregate(new PurchaseAggregator());
// Write results to Kafka
FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<>(
"output-topic",
new SimpleStringSchema(),
properties
);
results
.map(result -> convertToJson(result))
.addSink(kafkaSink);
// Execute the job
env.execute("Purchase Processing Job");
Apache Spark Structured Streaming:
- Micro-batch processing
- DataFrame and Dataset APIs
- Integration with Spark ecosystem
- Continuous processing mode
- Watermarking support
- Stateful processing
- Machine learning integration
Kafka Streams:
- Lightweight client library
- Stateful processing
- Exactly-once semantics
- Integration with Kafka ecosystem
- No separate cluster required
- Interactive queries
- Processor API and DSL
Example Kafka Streams Application:
// Kafka Streams application example
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "purchase-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);
StreamsBuilder builder = new StreamsBuilder();
// Read from input topic
KStream<String, String> inputStream = builder.stream("input-topic");
// Parse JSON and filter events
KStream<String, Purchase> purchaseStream = inputStream
.mapValues(value -> parsePurchase(value))
.filter((key, purchase) -> purchase != null && purchase.getAmount() > 0);
// Group by user ID
KGroupedStream<String, Purchase> groupedByUser = purchaseStream
.groupByKey(Grouped.with(Serdes.String(), purchaseSerde));
// Aggregate purchases in 5-minute windows
TimeWindows timeWindows = TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5));
KTable<Windowed<String>, UserPurchaseSummary> windowedAggregates = groupedByUser
.windowedBy(timeWindows)
.aggregate(
UserPurchaseSummary::new,
(userId, purchase, summary) -> summary.addPurchase(purchase),
Materialized.<String, UserPurchaseSummary, WindowStore<Bytes, byte[]>>as("purchase-store")
.withKeySerde(Serdes.String())
.withValueSerde(userPurchaseSummarySerde)
);
// Convert back to stream and write to output topic
windowedAggregates
.toStream()
.map((windowedKey, summary) -> KeyValue.pair(
windowedKey.key(),
formatSummaryAsJson(windowedKey.key(), summary, windowedKey.window().startTime())
))
.to("output-topic", Produced.with(Serdes.String(), Serdes.String()));
// Build and start the topology
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();
Other Stream Processing Technologies:
- Apache Samza
- Apache Storm
- Apache Beam
- KSQL/ksqlDB
- Amazon Kinesis Data Analytics
- Azure Stream Analytics
- Google Dataflow