Hadoop is an on-premise cluster framework built on Google's original MapReduce concept (2004 paper). Its philosophy: everything is a file, and many apps share their internal state.
HDFS is the storage backbone of Hadoop and acts as the data lake. It is not POSIX-compliant and uses large block sizes (default: 64MB).
Hadoop workloads can be lifted to the cloud by persisting data in GCS (Google Cloud Storage). This allows querying from BigQuery using federated queries (EXTERNAL_QUERY).
HDFS Limitations in Cloud:
hdfs:// can be replaced with gs:// in cloud-native environments.
Note: Cloud Storage doesn't support directories in the same way HDFS does (no real directory hierarchy).
Common Commands:
$ hadoop fs -lsr /
# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt
$ ls /mnt/
# mount -t nfs -o vers=3,proto=tcp,nolock host:/ /mnt
$ ls /mnt/
MapReduce is a batch-oriented processing framework designed for large-scale data. It brings computation to the data and follows a highly parallelizable model.
Processes data by filtering, projecting, and transforming.
(Key1, Value1)List(Key2, Value2)Groups values by keys across partitions, typically sorted.
List(Key2, Value2)Sort(Partition(List(Key2, List(Value2))))Aggregates all values for a single key.
List(Key2, List(Value2))List(Key3, Value3)All keys and values must be serializable for network transmission.