Hadoop is an on-premise cluster framework built on Google's original MapReduce concept (2004 paper). Its philosophy: everything is a file, and many apps share their internal state.
HDFS is the storage backbone of Hadoop and acts as the data lake. It is not POSIX-compliant and uses large block sizes (default: 64MB).
Hadoop workloads can be lifted to the cloud by persisting data in GCS (Google Cloud Storage). This allows querying from BigQuery using federated queries (EXTERNAL_QUERY
).
HDFS Limitations in Cloud:
hdfs://
can be replaced with gs://
in cloud-native environments.
Note: Cloud Storage doesn't support directories in the same way HDFS does (no real directory hierarchy).
Common Commands:
$ hadoop fs -lsr / # hadoop-fuse-dfs dfs://hadoop-hdfs /mnt $ ls /mnt/ # mount -t nfs -o vers=3,proto=tcp,nolock host:/ /mnt $ ls /mnt/
MapReduce is a batch-oriented processing framework designed for large-scale data. It brings computation to the data and follows a highly parallelizable model.
Processes data by filtering, projecting, and transforming.
(Key1, Value1)
List(Key2, Value2)
Groups values by keys across partitions, typically sorted.
List(Key2, Value2)
Sort(Partition(List(Key2, List(Value2))))
Aggregates all values for a single key.
List(Key2, List(Value2))
List(Key3, Value3)
All keys and values must be serializable for network transmission.