This article introduces the foundational concepts of observability: logs, metrics, and traces. It explains how these three telemetry types provide different perspectives on events generated by a running service, enabling engineers to understand system behavior, diagnose issues, and make informed architectural decisions. Understanding these primitives is crucial for designing resilient and maintainable distributed systems.
Read original on ByteByteGoObservability is a critical aspect of modern system design, especially in distributed environments. It refers to how well you can understand the internal states of a system by examining its external outputs. The three pillars of observability are logs, metrics, and traces, each providing a distinct lens into system operations.
Cardinality and Cost
A key consideration in system design for observability is cardinality. Logs typically have high cardinality due to their detailed and unique nature, making them expensive to store and query. Metrics, especially aggregated ones, tend to have lower cardinality and are more cost-effective for long-term storage and trend analysis. Traces can have varying cardinality depending on how much detail (number of spans, custom tags) is captured per request, impacting storage and processing costs.
When designing a system, it's crucial to instrument applications appropriately from the outset. This involves integrating logging frameworks, metrics exporters (e.g., Prometheus clients), and distributed tracing libraries (e.g., OpenTelemetry SDKs). Architectural decisions around data ingestion, storage (e.g., ELK stack for logs, Prometheus/Grafana for metrics, Jaeger/Zipkin for traces), and correlation mechanisms are fundamental to building an observable system. A robust observability strategy helps maintain system reliability, reduce MTTR (Mean Time To Recovery), and ensure operational efficiency.