Observability
Implementing comprehensive logging, monitoring, and tracing for Kubernetes workloads.
The Three Pillars of Observability
Effective observability requires three complementary approaches: logs for detailed events, metrics for aggregated data, and traces for request flows.
Logs
Detailed event records
- • Application errors
- • Request details
- • Audit trails
Metrics
Aggregated time-series data
- • CPU usage
- • Request rates
- • Error counts
Traces
Request flow tracking
- • Service dependencies
- • Latency breakdown
- • Bottleneck identification
Logging in Kubernetes
Logging Architecture
Node-level logging: Container logs written to stdout/stderr, collected by kubelet
Cluster-level logging: Centralized log aggregation using agents (Fluentd, Fluent Bit, Filebeat)
Storage backends: Elasticsearch, Loki, CloudWatch, Stackdriver
Log rotation: Automatic rotation prevents disk space issues
Popular Logging Stacks
EFK Stack
Elasticsearch + Fluentd + Kibana
Most common open-source solution, powerful search and visualization
PLG Stack
Promtail + Loki + Grafana
Lightweight, integrates with Prometheus, cost-effective
ELK Stack
Elasticsearch + Logstash + Kibana
Traditional stack, robust parsing and transformation
Cloud Native
CloudWatch, Stackdriver, Azure Monitor
Managed solutions, seamless cloud integration
Logging Best Practices
- Structured logging: Use JSON format for easy parsing and querying
- Log levels: Use appropriate levels (DEBUG, INFO, WARN, ERROR)
- Context enrichment: Include pod name, namespace, node, trace IDs
- Sampling: Sample high-volume logs to reduce costs
- Retention policies: Define appropriate retention based on compliance needs
Monitoring with Prometheus
Prometheus is the standard monitoring solution for Kubernetes, providing powerful metrics collection, querying, and alerting.
Prometheus Architecture
- Pull model: Prometheus scrapes metrics from endpoints
- Service discovery: Automatic target discovery via Kubernetes API
- Time-series DB: Efficient storage for metrics data
- PromQL: Powerful query language for metrics analysis
- Alertmanager: Handles alerts, routing, and deduplication
Key Metrics to Monitor
Infrastructure
- • Node CPU/memory usage
- • Disk I/O and space
- • Network throughput
- • Pod resource consumption
Application
- • Request rate and latency
- • Error rates
- • Business KPIs
- • Custom application metrics
Kubernetes
- • Pod restarts
- • Node conditions
- • Deployment status
- • Resource quota usage
SLIs/SLOs
- • Availability
- • Latency percentiles
- • Error budget
- • Service level objectives
Grafana Integration
Grafana provides visualization for Prometheus metrics with rich dashboards and alerting capabilities.
- • Pre-built dashboards for Kubernetes monitoring
- • Custom dashboards with PromQL queries
- • Alert visualization and management
- • Multiple data source support
- • Templating for reusable dashboards
Distributed Tracing
Distributed tracing tracks requests as they flow through microservices, helping identify bottlenecks and dependencies.
Jaeger
CNCF project, OpenTracing compatible, excellent UI
Zipkin
Twitter-origin, simple setup, widely adopted
Tempo
Grafana Labs, integrates with Loki and Prometheus
Implementing Tracing
- Instrumentation: Add tracing libraries to applications (OpenTelemetry recommended)
- Context propagation: Pass trace IDs through service calls
- Sampling strategy: Use intelligent sampling to reduce overhead
- Service mesh: Istio/Linkerd provide automatic tracing
OpenTelemetry: The Modern Standard (2025)
Why OpenTelemetry (OTel)?
OpenTelemetry (CNCF Incubating → Graduated path) is the industry standard for observability instrumentation, unifying metrics, logs, and traces. It replaces OpenTracing and OpenCensus with a vendor-neutral approach.
Key Benefits
- • Vendor-neutral instrumentation
- • Single SDK for all signals (logs, metrics, traces)
- • Automatic instrumentation for popular frameworks
- • Context propagation across services
- • Export to multiple backends simultaneously
Components
- • SDK: Instrument applications
- • Collector: Receive, process, export telemetry
- • Auto-instrumentation: No-code instrumentation
- • APIs: Language-agnostic specifications
- • Semantic Conventions: Standard attribute names
Example: OpenTelemetry Collector Configuration
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch:
memory_limiter:
limit_mib: 512
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
jaeger:
endpoint: jaeger:14250
loki:
endpoint: http://loki:3100/loki/api/v1/push
service:
pipelines:
traces:
receivers: [otlp]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [batch]
exporters: [loki]Complete Observability Stack (2025)
🔭Grafana LGTM Stack (Loki, Grafana, Tempo, Mimir)
Unified observability platform from Grafana Labs with excellent Kubernetes integration
Components:
- • Loki: Log aggregation (like Prometheus for logs)
- • Grafana: Visualization and dashboards
- • Tempo: Distributed tracing with object storage
- • Mimir: Prometheus-compatible metrics at scale
Benefits:
- • Cost-effective (uses object storage)
- • Unified correlation across signals
- • PromQL and LogQL for queries
- • Excellent Kubernetes support
⚡eBPF-Based Observability
Extended Berkeley Packet Filter enables kernel-level observability without code changes
Pixie (CNCF)
Instant Kubernetes observability with auto-telemetry
- • No code changes required
- • Real-time application profiling
- • Network monitoring
- • Service map generation
Cilium Hubble
Network observability with eBPF
- • Deep network visibility
- • DNS monitoring
- • Service dependency map
- • Security event tracking
Parca
Continuous profiling for performance optimization
- • CPU and memory profiling
- • Always-on collection
- • Flame graph visualization
- • Low overhead
Commercial APM Solutions
Full-Stack APM:
- • Datadog: Complete observability platform, excellent UX
- • New Relic: APM pioneer, now OpenTelemetry-first
- • Dynatrace: AI-powered observability and automation
- • Elastic APM: Built on Elasticsearch stack
When to Consider:
- • Need managed solution
- • Multi-cloud/hybrid environments
- • Advanced analytics and AI features
- • Compliance requirements
Alerting Strategy
Effective Alerting Principles
Alert on symptoms, not causes
Alert when users are affected, not just when something breaks
Actionable alerts only
Every alert should require human intervention; eliminate noise
Multiple severity levels
Critical (page), warning (ticket), info (dashboard)
Context in alerts
Include relevant metrics, logs, runbook links
Common Alert Rules
- • High error rate: Error percentage exceeds threshold
- • Slow response time: p99 latency above SLO
- • Pod crash loop: Container restarting frequently
- • Resource saturation: CPU/memory near limits
- • Node down: Node not ready for extended period
- • Deployment failure: Rollout stuck or failing
Key Takeaways
- Observability requires logs, metrics, and traces working together
- Prometheus + Grafana is the standard for Kubernetes monitoring
- Structured logging with proper context makes debugging much easier
- Distributed tracing is essential for understanding microservices performance
- Good alerting is actionable, contextual, and symptom-based