Menu
Dev.to #architecture·March 13, 2026

Hadoop Architecture Fundamentals: Interview Questions and Design Insights

This article, framed as a set of interview questions, provides a solid overview of Hadoop's core architecture, covering HDFS, YARN, and MapReduce. It delves into critical design considerations like fault tolerance, resource management, data locality, and scalability, offering both basic and advanced insights into building and operating big data systems.

Read original on Dev.to #architecture

Hadoop remains a foundational technology for big data processing, even with the rise of newer frameworks. Understanding its architecture is crucial for designing scalable data pipelines and distributed storage solutions. This guide, presented as interview questions, breaks down the key components and architectural decisions behind Hadoop.

Core Hadoop Components and Their Roles

  • HDFS (Hadoop Distributed File System): Designed for high-throughput access to large datasets, it uses a block-based storage model (default 128 MB blocks) to reduce NameNode metadata overhead and enable sequential reads. Fault tolerance is achieved through default 3x data replication across DataNodes, with rack-aware placement for resilience against rack failures.
  • YARN (Yet Another Resource Negotiator): Decouples resource management from job scheduling. The ResourceManager allocates containers, and each application runs its own ApplicationMaster, enabling multiple processing frameworks (Spark, Tez, Flink) to share the same cluster.
  • MapReduce: A programming model and framework for processing large datasets in parallel. It consists of Map tasks for parallel processing and Reducer tasks for aggregation, with a crucial shuffle and sort phase that often determines performance.

Key Architectural Concepts and Design Decisions

ℹ️

HDFS Fault Tolerance

HDFS ensures data durability through block replication. If a DataNode fails, the NameNode detects it via missing heartbeats and automatically triggers re-replication from surviving copies to maintain the configured replication factor. Rack-awareness places replicas across different racks to mitigate rack-level outages.

  • NameNode vs. DataNode: The NameNode manages file system metadata (directory tree, file-to-block mapping) but stores no actual data. DataNodes store data blocks. This separation allows independent scaling of metadata and data storage.
  • HDFS Federation: Addresses NameNode scalability limits by allowing multiple independent NameNodes, each managing a portion of the namespace, all sharing the same DataNode pool.
  • Data Locality: A fundamental optimization in Hadoop, where processing tasks are scheduled on nodes that hold the data blocks, minimizing network transfers and improving performance. The scheduler prioritizes node-local, then rack-local, then any node.
  • High-Availability (HA) NameNode: Achieved with two NameNodes (Active/Standby) sharing an edit log via JournalNodes. ZooKeeper-based failover controllers manage leader election and fence the failed active node to prevent split-brain scenarios.

Practical Design Considerations and Trade-offs

The article also touches on practical design choices relevant to modern big data architectures:

  • HDFS vs. Object Storage (e.g., S3): HDFS offers strong data locality for on-cluster processing, while object storage provides infinite scalability and compute-storage separation at the cost of network latency. The choice depends on workload requirements (locality vs. elasticity).
  • Spark vs. MapReduce: Spark excels in iterative algorithms and interactive queries due to in-memory caching. MapReduce is still viable for very large, single-pass ETL jobs where disk-based shuffle is acceptable.
  • Schema Evolution: Critical for data lakes. Using formats like Avro or Parquet with embedded schemas and schema registries allows consumers to handle backward and forward compatibility without breaking pipelines.
  • Cluster Sizing: Involves estimating ingest volume, replication factor, retention, and compression to calculate storage needs, then sizing CPU and memory based on expected concurrent workloads and YARN container requirements.
📌

Designing a 10 TB/day Data Pipeline

For continuous ingestion, use tools like Kafka or Flume into HDFS, partitioning data by date/source. Process data hourly with Spark or MapReduce, then move cleaned data into a curated layer, often exposed via Hive tables.

HadoopHDFSYARNMapReduceBig DataDistributed StorageData PipelinesScalability

Comments

Loading comments...