Discord engineered a custom solution to integrate distributed tracing into their Elixir-based actor model without performance penalties. This article details their "Transport" library, which wraps messages with trace context, addressing the challenge of propagating tracing information in message-passing systems where traditional HTTP header propagation is not applicable. Key design decisions included dynamic sampling, context propagation optimizations, and a gradual rollout strategy to ensure zero-downtime deployment and scalability for millions of concurrent users.
Read original on InfoQ ArchitectureTraditional distributed tracing solutions, like OpenTelemetry, are well-suited for HTTP-based microservices where trace context can be easily passed in request headers. However, actor-based systems, such as those built with Elixir, operate on a message-passing paradigm where processes exchange arbitrary messages without a built-in metadata layer. Discord faced the fundamental challenge of propagating trace context across these messages to achieve end-to-end visibility across their high-scale chat infrastructure, which serves millions of concurrent users. This required a custom approach that would be ergonomic, support various message types, and allow for zero-downtime deployment.
Discord developed a "Transport" library that introduces an "Envelope" primitive. This Envelope is a simple struct that wraps the original message along with a serialized trace carrier. This design allows trace context to travel with the message itself, effectively embedding the metadata required for tracing into the actor communication flow. The library provides drop-in replacements for standard Elixir GenServer functions, automating the wrapping and unwrapping of messages with trace context.
defmodule Discord.Transport.Envelope do
defstruct [:message, trace_carrier: []]
def wrap_message(message) do
%__MODULE__{ message: message, trace_carrier: :otel_propagator_text_map.inject([]) }
end
endGradual Rollout Strategy
A critical aspect of the solution was the ability to handle both old-style bare messages and new Envelope-wrapped messages during deployment. This "normalization" feature allowed Discord to perform a gradual migration without requiring a full system restart or simultaneous updates across their entire fleet, ensuring service continuity.
These optimizations were crucial for making distributed tracing viable at Discord's scale, turning an essential debugging tool into a performant and integral part of their system monitoring. The ability to diagnose complex incidents, such as 16-minute connection delays to a guild, highlights the value of this architectural investment.
When designing observability for highly concurrent, message-passing systems, consider custom context propagation mechanisms rather than forcing HTTP-centric patterns. Prioritize dynamic sampling and lazy deserialization of trace context to manage overhead at scale.