Streaming Technologies and Platforms

Event Streaming Platforms

Core technologies for event distribution:

Apache Kafka:

  • Distributed log-based messaging
  • High throughput and scalability
  • Persistent storage
  • Topic-based organization
  • Consumer groups
  • Partitioning for parallelism
  • Exactly-once semantics

Example Kafka Architecture:

┌───────────────────────────────────────────────────────────┐
│                                                           │
│                     Kafka Cluster                         │
│                                                           │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐        │
│  │             │  │             │  │             │        │
│  │  Broker 1   │  │  Broker 2   │  │  Broker 3   │        │
│  │             │  │             │  │             │        │
│  └─────────────┘  └─────────────┘  └─────────────┘        │
│                                                           │
└───────────────────────────────────────────────────────────┘
          ▲                                  ▲
          │                                  │
          │                                  │
┌─────────┴─────────┐              ┌─────────┴─────────┐
│                   │              │                   │
│    Producers      │              │    Consumers      │
│                   │              │                   │
└───────────────────┘              └───────────────────┘

Example Kafka Topic Configuration:

# Topic configuration
num.partitions=12
replication.factor=3
min.insync.replicas=2
retention.ms=604800000  # 7 days
cleanup.policy=delete

Apache Pulsar:

  • Multi-tenant architecture
  • Tiered storage
  • Geo-replication
  • Unified messaging model
  • Pulsar Functions
  • Schema registry
  • Stronger durability guarantees

Cloud-Based Streaming Services:

  • Amazon Kinesis
  • Azure Event Hubs
  • Google Cloud Pub/Sub
  • Confluent Cloud
  • IBM Event Streams
  • Redpanda

Stream Processing Frameworks

Technologies for analyzing and transforming streams:

Apache Flink:

  • Stateful stream processing
  • Event time processing
  • Exactly-once semantics
  • Windowing operations
  • Checkpointing for fault tolerance
  • High throughput and low latency
  • SQL interface

Example Flink Streaming Job:

// Flink streaming job example
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

// Enable checkpointing for exactly-once processing
env.enableCheckpointing(60000); // 60 seconds
env.getCheckpointConfig().setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE);

// Configure Kafka source
Properties properties = new Properties();
properties.setProperty("bootstrap.servers", "kafka:9092");
properties.setProperty("group.id", "flink-consumer-group");

FlinkKafkaConsumer<String> kafkaSource = new FlinkKafkaConsumer<>(
    "input-topic",
    new SimpleStringSchema(),
    properties
);
kafkaSource.assignTimestampsAndWatermarks(
    WatermarkStrategy
        .<String>forBoundedOutOfOrderness(Duration.ofSeconds(5))
        .withTimestampAssigner((event, timestamp) -> extractTimestamp(event))
);

// Define the processing pipeline
DataStream<String> stream = env.addSource(kafkaSource);

// Parse JSON events
DataStream<Event> events = stream
    .map(json -> parseJson(json))
    .filter(event -> event.getType().equals("PURCHASE"));

// Process events with 5-minute tumbling windows
DataStream<Result> results = events
    .keyBy(event -> event.getUserId())
    .window(TumblingEventTimeWindows.of(Time.minutes(5)))
    .aggregate(new PurchaseAggregator());

// Write results to Kafka
FlinkKafkaProducer<String> kafkaSink = new FlinkKafkaProducer<>(
    "output-topic",
    new SimpleStringSchema(),
    properties
);

results
    .map(result -> convertToJson(result))
    .addSink(kafkaSink);

// Execute the job
env.execute("Purchase Processing Job");

Apache Spark Structured Streaming:

  • Micro-batch processing
  • DataFrame and Dataset APIs
  • Integration with Spark ecosystem
  • Continuous processing mode
  • Watermarking support
  • Stateful processing
  • Machine learning integration

Kafka Streams:

  • Lightweight client library
  • Stateful processing
  • Exactly-once semantics
  • Integration with Kafka ecosystem
  • No separate cluster required
  • Interactive queries
  • Processor API and DSL

Example Kafka Streams Application:

// Kafka Streams application example
Properties props = new Properties();
props.put(StreamsConfig.APPLICATION_ID_CONFIG, "purchase-processor");
props.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, "kafka:9092");
props.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass());
props.put(StreamsConfig.PROCESSING_GUARANTEE_CONFIG, StreamsConfig.EXACTLY_ONCE_V2);

StreamsBuilder builder = new StreamsBuilder();

// Read from input topic
KStream<String, String> inputStream = builder.stream("input-topic");

// Parse JSON and filter events
KStream<String, Purchase> purchaseStream = inputStream
    .mapValues(value -> parsePurchase(value))
    .filter((key, purchase) -> purchase != null && purchase.getAmount() > 0);

// Group by user ID
KGroupedStream<String, Purchase> groupedByUser = purchaseStream
    .groupByKey(Grouped.with(Serdes.String(), purchaseSerde));

// Aggregate purchases in 5-minute windows
TimeWindows timeWindows = TimeWindows.ofSizeWithNoGrace(Duration.ofMinutes(5));

KTable<Windowed<String>, UserPurchaseSummary> windowedAggregates = groupedByUser
    .windowedBy(timeWindows)
    .aggregate(
        UserPurchaseSummary::new,
        (userId, purchase, summary) -> summary.addPurchase(purchase),
        Materialized.<String, UserPurchaseSummary, WindowStore<Bytes, byte[]>>as("purchase-store")
            .withKeySerde(Serdes.String())
            .withValueSerde(userPurchaseSummarySerde)
    );

// Convert back to stream and write to output topic
windowedAggregates
    .toStream()
    .map((windowedKey, summary) -> KeyValue.pair(
        windowedKey.key(),
        formatSummaryAsJson(windowedKey.key(), summary, windowedKey.window().startTime())
    ))
    .to("output-topic", Produced.with(Serdes.String(), Serdes.String()));

// Build and start the topology
KafkaStreams streams = new KafkaStreams(builder.build(), props);
streams.start();

Other Stream Processing Technologies:

  • Apache Samza
  • Apache Storm
  • Apache Beam
  • KSQL/ksqlDB
  • Amazon Kinesis Data Analytics
  • Azure Stream Analytics
  • Google Dataflow