Menu
InfoQ Architecture·March 28, 2026

Implementing Distributed Tracing in Elixir Actor Systems at Discord

Discord engineered a custom solution to integrate distributed tracing into their Elixir-based actor model without performance penalties. This article details their "Transport" library, which wraps messages with trace context, addressing the challenge of propagating tracing information in message-passing systems where traditional HTTP header propagation is not applicable. Key design decisions included dynamic sampling, context propagation optimizations, and a gradual rollout strategy to ensure zero-downtime deployment and scalability for millions of concurrent users.

Read original on InfoQ Architecture

The Challenge of Distributed Tracing in Actor Models

Traditional distributed tracing solutions, like OpenTelemetry, are well-suited for HTTP-based microservices where trace context can be easily passed in request headers. However, actor-based systems, such as those built with Elixir, operate on a message-passing paradigm where processes exchange arbitrary messages without a built-in metadata layer. Discord faced the fundamental challenge of propagating trace context across these messages to achieve end-to-end visibility across their high-scale chat infrastructure, which serves millions of concurrent users. This required a custom approach that would be ergonomic, support various message types, and allow for zero-downtime deployment.

Discord's Solution: The Transport Library and Envelope Primitive

Discord developed a "Transport" library that introduces an "Envelope" primitive. This Envelope is a simple struct that wraps the original message along with a serialized trace carrier. This design allows trace context to travel with the message itself, effectively embedding the metadata required for tracing into the actor communication flow. The library provides drop-in replacements for standard Elixir GenServer functions, automating the wrapping and unwrapping of messages with trace context.

elixir
defmodule Discord.Transport.Envelope do
  defstruct [:message, trace_carrier: []]

  def wrap_message(message) do
    %__MODULE__{ message: message, trace_carrier: :otel_propagator_text_map.inject([]) }
  end
end
ℹ️

Gradual Rollout Strategy

A critical aspect of the solution was the ability to handle both old-style bare messages and new Envelope-wrapped messages during deployment. This "normalization" feature allowed Discord to perform a gradual migration without requiring a full system restart or simultaneous updates across their entire fleet, ensuring service continuity.

Optimizations for High-Scale Actor Systems

  • Dynamic Sampling: To prevent overwhelming observability infrastructure, Discord implemented dynamic sampling based on message fanout size. Operations fanning out to many recipients (e.g., millions of guild members) had a significantly lower sampling rate, while single-recipient messages were always sampled. This preserved useful data without generating excessive spans.
  • Propagate Context Only for Sampled Operations: Initial deployments showed performance overhead due to unpacking trace context even for unsampled operations. The fix was to only include trace context in envelopes for operations that were actually sampled, significantly reducing serialization and parsing costs.
  • Session Service Optimization: In the sessions service, capturing new spans during fanout increased CPU usage. By preventing sessions from starting new traces (allowing them only to continue existing ones), CPU overhead was nearly eliminated.
  • gRPC Request Filtering: For inter-service communication with Python APIs, a filter was built to read only the sampling flag from the encoded trace context string without full deserialization. If a trace wasn't sampled, the context was not propagated, drastically reducing processing time.

These optimizations were crucial for making distributed tracing viable at Discord's scale, turning an essential debugging tool into a performant and integral part of their system monitoring. The ability to diagnose complex incidents, such as 16-minute connection delays to a guild, highlights the value of this architectural investment.

💡

When designing observability for highly concurrent, message-passing systems, consider custom context propagation mechanisms rather than forcing HTTP-centric patterns. Prioritize dynamic sampling and lazy deserialization of trace context to manage overhead at scale.

distributed tracingobservabilityelixiractor modelperformance optimizationsamplinggen_servermessage passing

Comments

Loading comments...