Menu
ByteByteGo·June 18, 2026

Observability Fundamentals: Logs, Metrics, and Traces in System Design

This article introduces the foundational concepts of observability: logs, metrics, and traces. It explains how these three telemetry types provide different perspectives on events generated by a running service, enabling engineers to understand system behavior, diagnose issues, and make informed architectural decisions. Understanding these primitives is crucial for designing resilient and maintainable distributed systems.

Read original on ByteByteGo

Observability is a critical aspect of modern system design, especially in distributed environments. It refers to how well you can understand the internal states of a system by examining its external outputs. The three pillars of observability are logs, metrics, and traces, each providing a distinct lens into system operations.

The Three Pillars of Observability

  • Logs: These are timestamped, immutable records of discrete events that occur within an application or system. They are often human-readable text and provide detailed context about what happened at a specific point in time, including errors, warnings, and informational messages. Logs are excellent for post-mortem debugging and understanding specific occurrences.
  • Metrics: These are numerical measurements representing the state of a system over time. Metrics are aggregated data points (e.g., CPU utilization, request count, error rates) that are collected at regular intervals. They are ideal for monitoring system health, identifying trends, and alerting on anomalies, providing a quantitative overview rather than detailed event data.
  • Traces: A trace represents the end-to-end journey of a request or transaction as it propagates through multiple services in a distributed system. Each step in the journey is called a 'span', and traces link these spans together, showing the causal relationships and latency contributions of each service. Traces are invaluable for performance profiling, identifying bottlenecks across service boundaries, and understanding complex distributed interactions.
💡

Cardinality and Cost

A key consideration in system design for observability is cardinality. Logs typically have high cardinality due to their detailed and unique nature, making them expensive to store and query. Metrics, especially aggregated ones, tend to have lower cardinality and are more cost-effective for long-term storage and trend analysis. Traces can have varying cardinality depending on how much detail (number of spans, custom tags) is captured per request, impacting storage and processing costs.

Designing for Observability

When designing a system, it's crucial to instrument applications appropriately from the outset. This involves integrating logging frameworks, metrics exporters (e.g., Prometheus clients), and distributed tracing libraries (e.g., OpenTelemetry SDKs). Architectural decisions around data ingestion, storage (e.g., ELK stack for logs, Prometheus/Grafana for metrics, Jaeger/Zipkin for traces), and correlation mechanisms are fundamental to building an observable system. A robust observability strategy helps maintain system reliability, reduce MTTR (Mean Time To Recovery), and ensure operational efficiency.

observabilitylogsmetricstracesmonitoringdistributed tracingsystem healthtelemetry

Comments

Loading comments...