ByteByteGo·June 18, 2026

Observability Fundamentals: Logs, Metrics, and Traces in System Design

This article introduces the foundational concepts of observability: logs, metrics, and traces. It explains how these three telemetry types provide different perspectives on events generated by a running service, enabling engineers to understand system behavior, diagnose issues, and make informed architectural decisions. Understanding these primitives is crucial for designing resilient and maintainable distributed systems.

DevOps & SRE Distributed Systems Tools & Frameworks

Read original on ByteByteGo

Observability is a critical aspect of modern system design, especially in distributed environments. It refers to how well you can understand the internal states of a system by examining its external outputs. The three pillars of observability are logs, metrics, and traces, each providing a distinct lens into system operations.

The Three Pillars of Observability

Logs: These are timestamped, immutable records of discrete events that occur within an application or system. They are often human-readable text and provide detailed context about what happened at a specific point in time, including errors, warnings, and informational messages. Logs are excellent for post-mortem debugging and understanding specific occurrences.
Metrics: These are numerical measurements representing the state of a system over time. Metrics are aggregated data points (e.g., CPU utilization, request count, error rates) that are collected at regular intervals. They are ideal for monitoring system health, identifying trends, and alerting on anomalies, providing a quantitative overview rather than detailed event data.
Traces: A trace represents the end-to-end journey of a request or transaction as it propagates through multiple services in a distributed system. Each step in the journey is called a 'span', and traces link these spans together, showing the causal relationships and latency contributions of each service. Traces are invaluable for performance profiling, identifying bottlenecks across service boundaries, and understanding complex distributed interactions.

💡

Cardinality and Cost

A key consideration in system design for observability is cardinality. Logs typically have high cardinality due to their detailed and unique nature, making them expensive to store and query. Metrics, especially aggregated ones, tend to have lower cardinality and are more cost-effective for long-term storage and trend analysis. Traces can have varying cardinality depending on how much detail (number of spans, custom tags) is captured per request, impacting storage and processing costs.

Designing for Observability

When designing a system, it's crucial to instrument applications appropriately from the outset. This involves integrating logging frameworks, metrics exporters (e.g., Prometheus clients), and distributed tracing libraries (e.g., OpenTelemetry SDKs). Architectural decisions around data ingestion, storage (e.g., ELK stack for logs, Prometheus/Grafana for metrics, Jaeger/Zipkin for traces), and correlation mechanisms are fundamental to building an observable system. A robust observability strategy helps maintain system reliability, reduce MTTR (Mean Time To Recovery), and ensure operational efficiency.

observabilitylogsmetricstracesmonitoringdistributed tracingsystem healthtelemetry

Comments

Loading comments...

Architecture Design

Design this yourself

Design an observability platform capable of ingesting, storing, and visualizing logs, metrics, and traces from a large-scale microservices architecture. Detail the data ingestion pipelines, storage solutions (considering high cardinality for logs vs. time-series for metrics), and correlation mechanisms across these three telemetry types for effective debugging and performance analysis. Consider how to handle high-volume data, ensure data retention, and provide a unified user interface.

Practice Interview

Focus: observability platform with logs, metrics, and traces

Other design angles

· Design the data ingestion pipeline for logs and metrics, focusing on scalability and reliability, capable of handling petabytes of data daily.· Design a distributed tracing system for a polyglot microservices environment, ensuring low-overhead instrumentation and effective visualization of service dependencies and latencies.· Design a monitoring and alerting system using metrics, covering key performance indicators (KPIs) and service level objectives (SLOs) for a critical e-commerce application.

Observability Fundamentals: Logs, Metrics, and Traces in System Design

The Three Pillars of Observability

Designing for Observability

Comments

Architecture Design

Related Lessons