InfoQ Architecture·May 26, 2026

Architecting Cloud-Native Kafka: From Tiered Storage to a Diskless Future

This article explores the architectural evolution of Apache Kafka in cloud-native environments, focusing on the disaggregation of compute and storage through tiered storage and the challenges and solutions related to cost attribution and scaling. It details how Kafka is adapting to cloud economics by moving from local disk dependency towards object storage, addressing FinOps risks, and improving multi-tenancy and consumer scaling capabilities.

Cloud & Infrastructure Distributed Systems Performance & Scaling

Read original on InfoQ Architecture

Introduction to Cloud-Native Kafka Challenges

Apache Kafka, traditionally optimized for bare-metal deployments with its shared-nothing design and reliance on local disk for low-latency, append-only logs, faces significant economic and operational challenges in cloud environments. "Lift and shift" approaches lead to high costs due to factors like network egress fees for data mirroring across Availability Zones and the expense of premium cloud block storage for large retention periods. The article highlights that Kafka is evolving into a highly disaggregated architecture to survive the cloud, emphasizing the need for an "economic operating system" where platform teams manage costs with telemetry-driven chargeback and cost-aware workflows.

Tiered Storage: Decoupling Compute and Capacity

KIP-405, Kafka Tiered Storage, is a fundamental architectural shift that decouples data retention into two layers: a latency-optimized local tier (e.g., block storage) and a capacity-optimized remote tier (e.g., object storage like Amazon S3). The Remote Log Manager orchestrates asynchronous movement of log segments from local disk to external storage. This is crucial for reducing block storage costs, especially for workloads with long retention requirements like compliance and audit logs, where cold data can be moved to much cheaper object storage.

💡

When to Enable Tiered Storage

Architects should enable tiered storage for clusters that retain data well beyond their active processing window (e.g., more than seven days), as the majority of data becomes cold and can be offloaded to object storage for significant cost savings. Workloads with short retention or latency-sensitive hot-read patterns may not see positive ROI due to increased API overhead.

FinOps Risks: Request Amplification and Cost Attribution

While tiered storage reduces block storage costs, it introduces new FinOps risks, primarily request amplification. Object storage providers bill for API interactions (e.g., GET requests), and a misconfigured Kafka consumer fetching historical data can generate thousands of S3 GET requests per second, leading to massive API bill spikes. To mitigate this, platform engineers should consider aligning the consumer's `max.partition.fetch.bytes` with the broker's remote segment size, although this is an empirical exercise until KIP-1178 provides a dedicated configuration for remote fetches. The article also highlights the critical need for cost attribution, addressed by KIP-1267 (under discussion), which proposes granular client-level JMX telemetry (e.g., `RemoteFetchBytesPerSec`) to accurately attribute remote fetch costs to specific consumers, enabling proper chargeback and FinOps governance.

Enhanced Scaling and Multi-tenancy with Next-Gen Protocols

Kafka's legacy rebalancing protocol made dynamic consumer scaling disruptive. The next-generation protocol significantly reduces processing pauses during scale events, making Kubernetes-native autoscaling much more practical. Furthermore, for multi-tenancy, virtual clusters propose a middle ground between dedicated clusters and weakly isolated shared ones, providing strict tenant boundaries without infrastructure duplication. Share Groups further break the traditional coupling of partition count to consumer parallelism, allowing teams to scale consumers independently without costly topic re-partitioning, which is crucial for efficient resource utilization in dynamic cloud environments.

KafkaCloud-NativeTiered StorageFinOpsDistributed MessagingEvent StreamingScalabilityApache Kafka