OpenTelemetry Integration

Implementing the open standard for observability:

OpenTelemetry Components:

  • API: Instrumentation interfaces
  • SDK: Implementation and configuration
  • Collector: Data processing and export
  • Instrumentation: Language-specific libraries
  • Semantic Conventions: Standardized naming

Example OpenTelemetry Collector Configuration:

# OpenTelemetry Collector configuration
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 1s
    send_batch_size: 1024
  
  # Add service name to all telemetry if missing
  resource:
    attributes:
      - key: service.name
        value: "unknown-service"
        action: insert
  
  # Filter out health check endpoints
  filter:
    spans:
      exclude:
        match_type: regexp
        attributes:
          - key: http.url
            value: ".*/health$"

exporters:
  prometheus:
    endpoint: 0.0.0.0:8889
    namespace: otel
  
  elasticsearch:
    endpoints: ["https://elasticsearch:9200"]
    index: logs-%{service.name}-%{+YYYY.MM.dd}
  
  jaeger:
    endpoint: jaeger-collector:14250
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, resource, filter]
      exporters: [jaeger]
    
    metrics:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [prometheus]
    
    logs:
      receivers: [otlp]
      processors: [batch, resource]
      exporters: [elasticsearch]

OpenTelemetry Deployment Models:

  • Agent: Sidecar container or host agent
  • Gateway: Centralized collector per cluster/region
  • Hierarchical: Multiple collection layers
  • Direct Export: Services export directly to backends
  • Hybrid: Combination based on requirements

Service Mesh Observability

Leveraging service mesh for enhanced visibility:

Service Mesh Monitoring Features:

  • Automatic metrics collection
  • Distributed tracing integration
  • Traffic visualization
  • Protocol-aware monitoring
  • Zero-code instrumentation

Example Istio Telemetry Configuration:

# Istio telemetry configuration
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
  name: mesh-default
  namespace: istio-system
spec:
  # Configure metrics
  metrics:
    - providers:
        - name: prometheus
      overrides:
        - match:
            metric: REQUEST_COUNT
            mode: CLIENT_AND_SERVER
          disabled: false
        - match:
            metric: REQUEST_DURATION
            mode: CLIENT_AND_SERVER
          disabled: false
  
  # Configure access logs
  accessLogging:
    - providers:
        - name: envoy
      filter:
        expression: "response.code >= 400"
  
  # Configure tracing
  tracing:
    - providers:
        - name: zipkin
      randomSamplingPercentage: 10.0

Service Mesh Observability Benefits:

  • Consistent telemetry across services
  • Protocol-aware metrics (HTTP, gRPC, TCP)
  • Automatic dependency mapping
  • Reduced instrumentation burden
  • Enhanced security visibility

Monitoring Infrastructure

Metrics Collection and Storage

Systems for gathering and storing time-series data:

Metrics Collection Approaches:

  • Pull-based collection (Prometheus)
  • Push-based collection (StatsD, OpenTelemetry)
  • Agent-based collection (Telegraf, collectd)
  • Cloud provider metrics (CloudWatch, Stackdriver)
  • Hybrid approaches

Time-Series Databases:

  • Prometheus
  • InfluxDB
  • TimescaleDB
  • Graphite
  • VictoriaMetrics

Example Prometheus Configuration:

# Prometheus configuration for Kubernetes service discovery
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)
      - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
        action: replace
        regex: ([^:]+)(?::\d+)?;(\d+)
        replacement: $1:$2
        target_label: __address__
      - action: labelmap
        regex: __meta_kubernetes_pod_label_(.+)
      - source_labels: [__meta_kubernetes_namespace]
        action: replace
        target_label: kubernetes_namespace
      - source_labels: [__meta_kubernetes_pod_name]
        action: replace
        target_label: kubernetes_pod_name

Metrics Storage Considerations:

  • Retention period requirements
  • Query performance needs
  • Cardinality management
  • High availability setup
  • Long-term storage strategies

Log Management

Collecting, processing, and analyzing log data:

Log Collection Methods:

  • Sidecar containers (Fluentbit, Filebeat)
  • Node-level agents (Fluentd, Vector)
  • Direct application shipping
  • Log forwarders
  • API-based collection

Example Fluentd Configuration:

# Fluentd configuration for Kubernetes logs
<source>
  @type tail
  path /var/log/containers/*.log
  pos_file /var/log/fluentd-containers.log.pos
  tag kubernetes.*
  read_from_head true
  <parse>
    @type json
    time_format %Y-%m-%dT%H:%M:%S.%NZ
  </parse>
</source>

# Kubernetes metadata enrichment
<filter kubernetes.**>
  @type kubernetes_metadata
  kubernetes_url https://kubernetes.default.svc
  bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
  ca_file /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
</filter>

# Output to Elasticsearch
<match kubernetes.**>
  @type elasticsearch
  host elasticsearch
  port 9200
  logstash_format true
  logstash_prefix k8s-logs
</match>

Log Processing and Analysis:

  • Structured logging formats
  • Log parsing and enrichment
  • Log aggregation and correlation
  • Full-text search capabilities
  • Log retention and archiving

Distributed Tracing

Tracking requests across service boundaries:

Tracing System Components:

  • Instrumentation libraries
  • Trace context propagation
  • Sampling strategies
  • Trace collection and storage
  • Visualization and analysis

Sampling Strategies:

  • Head-based sampling (before trace starts)
  • Tail-based sampling (after trace completes)
  • Rate-limiting sampling
  • Probabilistic sampling
  • Dynamic and adaptive sampling