Back to Home

Kubernetes in Production

Observability

Implementing comprehensive logging, monitoring, and tracing for Kubernetes workloads.

The Three Pillars of Observability

Effective observability requires three complementary approaches: logs for detailed events, metrics for aggregated data, and traces for request flows.

📝

Logs

Detailed event records

  • Application errors
  • Request details
  • Audit trails
📊

Metrics

Aggregated time-series data

  • CPU usage
  • Request rates
  • Error counts
🔍

Traces

Request flow tracking

  • Service dependencies
  • Latency breakdown
  • Bottleneck identification

Logging in Kubernetes

Logging Architecture

Node-level logging: Container logs written to stdout/stderr, collected by kubelet

Cluster-level logging: Centralized log aggregation using agents (Fluentd, Fluent Bit, Filebeat)

Storage backends: Elasticsearch, Loki, CloudWatch, Stackdriver

Log rotation: Automatic rotation prevents disk space issues

Popular Logging Stacks

EFK Stack

Elasticsearch + Fluentd + Kibana

Most common open-source solution, powerful search and visualization

PLG Stack

Promtail + Loki + Grafana

Lightweight, integrates with Prometheus, cost-effective

ELK Stack

Elasticsearch + Logstash + Kibana

Traditional stack, robust parsing and transformation

Cloud Native

CloudWatch, Stackdriver, Azure Monitor

Managed solutions, seamless cloud integration

Logging Best Practices

  • Structured logging: Use JSON format for easy parsing and querying
  • Log levels: Use appropriate levels (DEBUG, INFO, WARN, ERROR)
  • Context enrichment: Include pod name, namespace, node, trace IDs
  • Sampling: Sample high-volume logs to reduce costs
  • Retention policies: Define appropriate retention based on compliance needs

Monitoring with Prometheus

Prometheus is the standard monitoring solution for Kubernetes, providing powerful metrics collection, querying, and alerting.

Prometheus Architecture

  • Pull model: Prometheus scrapes metrics from endpoints
  • Service discovery: Automatic target discovery via Kubernetes API
  • Time-series DB: Efficient storage for metrics data
  • PromQL: Powerful query language for metrics analysis
  • Alertmanager: Handles alerts, routing, and deduplication

Key Metrics to Monitor

Infrastructure

  • Node CPU/memory usage
  • Disk I/O and space
  • Network throughput
  • Pod resource consumption

Application

  • Request rate and latency
  • Error rates
  • Business KPIs
  • Custom application metrics

Kubernetes

  • Pod restarts
  • Node conditions
  • Deployment status
  • Resource quota usage

SLIs/SLOs

  • Availability
  • Latency percentiles
  • Error budget
  • Service level objectives

Grafana Integration

Grafana provides visualization for Prometheus metrics with rich dashboards and alerting capabilities.

  • • Pre-built dashboards for Kubernetes monitoring
  • • Custom dashboards with PromQL queries
  • • Alert visualization and management
  • • Multiple data source support
  • • Templating for reusable dashboards

Distributed Tracing

Distributed tracing tracks requests as they flow through microservices, helping identify bottlenecks and dependencies.

Jaeger

CNCF project, OpenTracing compatible, excellent UI

Trace visualizationService dependency graphRoot cause analysisPerformance optimization

Zipkin

Twitter-origin, simple setup, widely adopted

Trace collectionStorage backendsQuery interfaceMinimal overhead

Tempo

Grafana Labs, integrates with Loki and Prometheus

Cost-effectiveObject storage backendUnified observabilityPromQL-like query

Implementing Tracing

  • Instrumentation: Add tracing libraries to applications (OpenTelemetry recommended)
  • Context propagation: Pass trace IDs through service calls
  • Sampling strategy: Use intelligent sampling to reduce overhead
  • Service mesh: Istio/Linkerd provide automatic tracing

OpenTelemetry: The Modern Standard (2025)

Why OpenTelemetry (OTel)?

OpenTelemetry (CNCF Incubating → Graduated path) is the industry standard for observability instrumentation, unifying metrics, logs, and traces. It replaces OpenTracing and OpenCensus with a vendor-neutral approach.

Key Benefits

  • • Vendor-neutral instrumentation
  • • Single SDK for all signals (logs, metrics, traces)
  • • Automatic instrumentation for popular frameworks
  • • Context propagation across services
  • • Export to multiple backends simultaneously

Components

  • SDK: Instrument applications
  • Collector: Receive, process, export telemetry
  • Auto-instrumentation: No-code instrumentation
  • APIs: Language-agnostic specifications
  • Semantic Conventions: Standard attribute names

Example: OpenTelemetry Collector Configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch:
  memory_limiter:
    limit_mib: 512

exporters:
  prometheus:
    endpoint: "0.0.0.0:8889"
  jaeger:
    endpoint: jaeger:14250
  loki:
    endpoint: http://loki:3100/loki/api/v1/push

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [batch]
      exporters: [loki]

Complete Observability Stack (2025)

🔭Grafana LGTM Stack (Loki, Grafana, Tempo, Mimir)

Unified observability platform from Grafana Labs with excellent Kubernetes integration

Components:

  • Loki: Log aggregation (like Prometheus for logs)
  • Grafana: Visualization and dashboards
  • Tempo: Distributed tracing with object storage
  • Mimir: Prometheus-compatible metrics at scale

Benefits:

  • • Cost-effective (uses object storage)
  • • Unified correlation across signals
  • • PromQL and LogQL for queries
  • • Excellent Kubernetes support

eBPF-Based Observability

Extended Berkeley Packet Filter enables kernel-level observability without code changes

Pixie (CNCF)

Instant Kubernetes observability with auto-telemetry

  • • No code changes required
  • • Real-time application profiling
  • • Network monitoring
  • • Service map generation

Cilium Hubble

Network observability with eBPF

  • • Deep network visibility
  • • DNS monitoring
  • • Service dependency map
  • • Security event tracking

Parca

Continuous profiling for performance optimization

  • • CPU and memory profiling
  • • Always-on collection
  • • Flame graph visualization
  • • Low overhead

Commercial APM Solutions

Full-Stack APM:

  • Datadog: Complete observability platform, excellent UX
  • New Relic: APM pioneer, now OpenTelemetry-first
  • Dynatrace: AI-powered observability and automation
  • Elastic APM: Built on Elasticsearch stack

When to Consider:

  • • Need managed solution
  • • Multi-cloud/hybrid environments
  • • Advanced analytics and AI features
  • • Compliance requirements

Alerting Strategy

Effective Alerting Principles

Alert on symptoms, not causes

Alert when users are affected, not just when something breaks

Actionable alerts only

Every alert should require human intervention; eliminate noise

Multiple severity levels

Critical (page), warning (ticket), info (dashboard)

Context in alerts

Include relevant metrics, logs, runbook links

Common Alert Rules

  • High error rate: Error percentage exceeds threshold
  • Slow response time: p99 latency above SLO
  • Pod crash loop: Container restarting frequently
  • Resource saturation: CPU/memory near limits
  • Node down: Node not ready for extended period
  • Deployment failure: Rollout stuck or failing

Key Takeaways

  • Observability requires logs, metrics, and traces working together
  • Prometheus + Grafana is the standard for Kubernetes monitoring
  • Structured logging with proper context makes debugging much easier
  • Distributed tracing is essential for understanding microservices performance
  • Good alerting is actionable, contextual, and symptom-based