This article outlines a robust push-based observability architecture using AWS services to stream CloudWatch metrics to self-hosted OpenTelemetry collectors within a private VPC. It addresses common challenges of traditional pull-based monitoring at scale, such as API throttling and vendor lock-in, by leveraging CloudWatch Metric Streams, Amazon Kinesis Data Firehose, and AWS Lambda for real-time, cost-efficient data delivery.
Read original on AWS Architecture BlogOrganizations are increasingly adopting open-source observability frameworks like OpenTelemetry to reduce licensing costs and avoid vendor lock-in. This approach offers significant benefits for enterprises seeking to achieve sub-minute latency for real-time alerting and consolidate observability data from various sources. While CloudWatch Metric Streams natively support OpenTelemetry endpoints, self-hosting collectors within a Virtual Private Cloud (VPC) presents a connectivity challenge that requires an intermediary solution.
The article highlights the drawbacks of traditional pull-based monitoring, exemplified by Prometheus, at scale. Frequent API polling can lead to high costs, API throttling, metric loss, and gaps in observability data, failing to meet real-time alerting requirements. A push-based architecture, where metrics are actively sent to collectors, offers substantial advantages, especially for event-driven systems requiring near real-time data.
The proposed solution addresses the challenge of streaming CloudWatch metrics to private VPC-based OpenTelemetry collectors. It leverages an intermediary AWS Lambda function to bridge the gap, as Amazon Kinesis Data Firehose, while supporting HTTP endpoints, requires them to be public. This architecture ensures strict data privacy requirements are met by keeping the metric data and OpenTelemetry collector within the customer's VPC.
Key Components and Their Roles
The architecture comprises: 1. CloudWatch Metric Streams: Streams metrics in near real-time, configured to output in JSON format to Firehose. 2. Amazon Kinesis Data Firehose: A fully managed service for reliable real-time data capture, transformation, and delivery. 3. AWS Lambda Transform Function: Invoked synchronously by Firehose to push metrics securely through an internal Network Load Balancer (NLB) to the VPC-based collector. This function preprocesses and filters data as needed. 4. OpenTelemetry Collector (on EC2): Runs as a container on EC2 instances within a private subnet, acting as a central hub to receive, process (via receivers, processors, exporters), and forward telemetry data to various backends (e.g., Amazon Managed Prometheus, AWS X-Ray, Amazon CloudWatch).
An internal Network Load Balancer (NLB) is crucial for distributing TCP traffic to the OpenTelemetry collectors running on EC2 instances. This setup provides a scalable and secure way to ingest metrics into a customer's private observability infrastructure, allowing aggregation of metrics from diverse sources into a single pane of glass.