InfoQ Cloud·May 26, 2026

Architecting Cloud-Native Kafka with Tiered Storage and Disaggregated Compute

This article explores the evolution of Apache Kafka towards a cloud-native, disaggregated architecture, focusing on tiered storage and the future of diskless operation. It details how architects can leverage new KIPs (Kafka Improvement Proposals) to optimize cost, performance, and operational flexibility in cloud environments. Key themes include the financial implications of cloud object storage, the need for granular cost attribution, and improvements in consumer group rebalancing and parallelism.

Distributed Systems Cloud & Infrastructure Performance & Scaling

Read original on InfoQ Cloud

The Shift to Cloud-Native Kafka and Economic Considerations

Apache Kafka, traditionally optimized for bare-metal deployments with local disk storage, is undergoing a significant architectural transformation to thrive in cloud environments. This shift, often dubbed an "economic operating system," emphasizes cost optimization and resource disaggregation. Moving from fixed infrastructure costs to usage-based cloud billing introduces new financial realities, particularly concerning network egress and object storage API charges. Platform teams must adapt by implementing telemetry-driven chargeback pipelines and cost-aware governance workflows to manage variable cloud expenses effectively.

Tiered Storage: Decoupling Compute and Capacity

KIP-405: Kafka Tiered Storage is a cornerstone of this evolution. It introduces two distinct data retention layers: a latency-optimized local tier (using block storage) and a capacity-optimized remote tier (leveraging object storage like Amazon S3). An internal Remote Log Manager asynchronously moves older log segments from local disks to the remote tier, significantly reducing storage costs for cold data.

💡

When to Enable Tiered Storage

Tiered storage is most beneficial for Kafka clusters with long retention requirements (e.g., compliance logs for seven years) or replay-heavy analytics workloads that scan historical data. It dramatically reduces costs by offloading cold data to cheaper object storage. However, for real-time workloads with short retention, the added complexity and potential object storage API overhead may not yield a positive ROI.

FinOps Challenges: Request Amplification and Cost Attribution

While tiered storage reduces block storage costs, it introduces new FinOps risks. Cloud object storage bills not just for data at rest but also for API interactions (e.g., GET requests). A misconfigured consumer fetching large historical datasets can lead to request amplification, generating thousands of S3 GET requests per second and causing unexpected bill spikes. Architects need to consider tuning `max.partition.fetch.bytes` to align with remote segment sizes to mitigate this, though KIP-1178 proposes a dedicated configuration for remote fetches.

Furthermore, attributing these new costs was initially challenging as traditional Kafka metrics lacked client-level granularity. KIP-1267 addresses this by proposing client-level JMX telemetry (e.g., `RemoteFetchBytesPerSec`, `RemoteFetchRequestsPerSec`) to enable precise cost attribution and chargeback to specific applications. This is crucial for robust FinOps governance.

Enhancing Operational Agility and Multi-tenancy

Improved Rebalancing Protocol: Kafka's legacy rebalancing protocol caused processing pauses during consumer scaling. The next-generation protocol significantly reduces this disruption, making Kubernetes-native autoscaling for Kafka consumers much more practical.
Virtual Clusters: To address the costly trade-off between dedicated clusters and weak isolation in shared environments, virtual clusters propose a middle path, offering strict tenant boundaries without duplicating underlying infrastructure.
Share Groups (KIP-932): Traditionally, Kafka coupled partition count to consumer parallelism. Share Groups break this constraint, allowing teams to scale consumers independently without requiring expensive re-partitioning of topics, enhancing flexibility and resource utilization.

KafkaCloud-NativeTiered StorageFinOpsCost OptimizationDistributed StreamingMicroservicesKubernetes