Fundamentals and Core Concepts

Streaming Technologies and Platforms

Event Streaming Platforms

Core technologies for event distribution:

Apache Kafka:

Distributed log-based messaging
High throughput and scalability
Persistent storage
Topic-based organization
Consumer groups
Partitioning for parallelism
Exactly-once semantics

Example Kafka Architecture:

┌───────────────────────────────────────────────────────────┐
│                                                           │
│                     Kafka Cluster                         │
│                                                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │             │  │             │  │             │        │
│  │  Broker 1   │  │  Broker 2   │  │  Broker 3   │        │
│  │             │  │             │  │             │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                           │
└───────────────────────────────────────────────────────────┘
          ▲                                  ▲
          │                                  │
          │                                  │
┌─────────┴─────────┐              ┌─────────┴─────────┐
│                   │              │                   │
│    Producers      │              │    Consumers      │
│                   │              │                   │
└───────────────────┘              └───────────────────┘

Example Kafka Topic Configuration:

# Topic configuration
num.partitions=12
replication.factor=3
min.insync.replicas=2
retention.ms=604800000  # 7 days
cleanup.policy=delete

Apache Pulsar:

Multi-tenant architecture
Tiered storage
Geo-replication
Unified messaging model
Pulsar Functions
Schema registry
Stronger durability guarantees

Cloud-Based Streaming Services:

Amazon Kinesis
Azure Event Hubs
Google Cloud Pub/Sub
Confluent Cloud
IBM Event Streams
Redpanda

Stream Processing Frameworks

Technologies for analyzing and transforming streams:

Apache Flink:

Stateful stream processing
Event time processing
Exactly-once semantics
Windowing operations
Checkpointing for fault tolerance
High throughput and low latency
SQL interface

Example Flink Streaming Job:

// Flink streaming job example
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Enable checkpointing for exactly-once processing
env.enableCheckpointing(60000); // 60 seconds
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

// Configure Kafka source
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kafka:9092");
properties.setProperty("group.id", "flink-consumer-group");

FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<>(
    "input-topic",
    new SimpleStringSchema(),
    properties
);
kafkaSource.assignTimestampsAndWatermarks(
    WatermarkStrategy
        .<String>forBoundedOutOfOrderness(Duration.ofSeconds(5))
        .withTimestampAssigner((event, timestamp) -> extractTimestamp(event))
);

// Define the processing pipeline
DataStream<String> stream = env.addSource(kafkaSource);

// Parse JSON events
DataStream<Event> events = stream
    .map(json -> parseJson(json))
    .filter(event -> event.getType().equals("PURCHASE"));

// Process events with 5-minute tumbling windows
DataStream<Result> results = events
    .keyBy(event -> event.getUserId())
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new PurchaseAggregator());

// Write results to Kafka
FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<>(
    "output-topic",
    new SimpleStringSchema(),
    properties
);

results
    .map(result -> convertToJson(result))
    .addSink(kafkaSink);

// Execute the job
env.execute("Purchase Processing Job");

Apache Spark Structured Streaming:

Micro-batch processing
DataFrame and Dataset APIs
Integration with Spark ecosystem
Continuous processing mode
Watermarking support
Stateful processing
Machine learning integration

Kafka Streams:

Lightweight client library
Stateful processing
Exactly-once semantics
Integration with Kafka ecosystem
No separate cluster required
Interactive queries
Processor API and DSL

Example Kafka Streams Application:

// Kafka Streams application example
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "purchase-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);

StreamsBuilder builder = new StreamsBuilder();

// Read from input topic
KStream<String, String> inputStream = builder.stream("input-topic");

// Parse JSON and filter events
KStream<String, Purchase> purchaseStream = inputStream
    .mapValues(value -> parsePurchase(value))
    .filter((key, purchase) -> purchase != null && purchase.getAmount() > 0);

// Group by user ID
KGroupedStream<String, Purchase> groupedByUser = purchaseStream
    .groupByKey(Grouped.with(Serdes.String(), purchaseSerde));

// Aggregate purchases in 5-minute windows
TimeWindows timeWindows = TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5));

KTable<Windowed<String>, UserPurchaseSummary> windowedAggregates = groupedByUser
    .windowedBy(timeWindows)
    .aggregate(
        UserPurchaseSummary::new,
        (userId, purchase, summary) -> summary.addPurchase(purchase),
        Materialized.<String, UserPurchaseSummary, WindowStore<Bytes, byte[]>>as("purchase-store")
            .withKeySerde(Serdes.String())
            .withValueSerde(userPurchaseSummarySerde)
    );

// Convert back to stream and write to output topic
windowedAggregates
    .toStream()
    .map((windowedKey, summary) -> KeyValue.pair(
        windowedKey.key(),
        formatSummaryAsJson(windowedKey.key(), summary, windowedKey.window().startTime())
    ))
    .to("output-topic", Produced.with(Serdes.String(), Serdes.String()));

// Build and start the topology
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Other Stream Processing Technologies:

Apache Samza
Apache Storm
Apache Beam
KSQL/ksqlDB
Amazon Kinesis Data Analytics
Azure Stream Analytics
Google Dataflow

Streaming Technologies and Platforms

Event Streaming Platforms

Stream Processing Frameworks

Continue Your Learning