Distributed tracing: OpenTelemetry in practice — the good and the bad

·161 views

we've spent the last six months rolling out OpenTelemetry across roughly 25 core microservices, and it's been a mixed bag, honestly. on the one hand, distributed tracing is absolutely invaluable for debugging complex interactions across our system. pinpointing latency hotspots or understanding failure paths that span multiple services has become significantly easier. on the other hand, the overhead is non-trivial. we're seeing an average 5-10% increase in latency for requests that generate extensive traces. the data volume is enormous too, even with a 1% sampling rate in production. we're spending a lot of time optimizing collector configurations and storage. it feels like a necessary evil, but i'm curious about how other teams are managing the operational challenges and cost of OTel at scale. are there specific strategies or tools you've found effective in mitigating the downsides while maximizing the debugging benefits?

6 comments

Distributed tracing: OpenTelemetry in practice — the good and the bad

Comments