What is the Hadoop Ecosystem?

Hadoop is an on-premise cluster framework built on Google's original MapReduce concept (2004 paper). Its philosophy: everything is a file, and many apps share their internal state.

Hadoop Ecosystem Components

HDFS (Hadoop Distributed File System)

Amazon S3

HDFS is the storage backbone of Hadoop and acts as the data lake. It is not POSIX-compliant and uses large block sizes (default: 64MB).

Hadoop workloads can be lifted to the cloud by persisting data in GCS (Google Cloud Storage). This allows querying from BigQuery using federated queries (EXTERNAL_QUERY).

HDFS Limitations in Cloud:

hdfs:// can be replaced with gs:// in cloud-native environments.

Note: Cloud Storage doesn't support directories in the same way HDFS does (no real directory hierarchy).

Common Commands:

$ hadoop fs -lsr /
# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt
$ ls /mnt/
# mount -t nfs -o vers=3,proto=tcp,nolock host:/ /mnt
$ ls /mnt/
    

MapReduce

MapReduce is a batch-oriented processing framework designed for large-scale data. It brings computation to the data and follows a highly parallelizable model.

Logical Phases of MapReduce

1. Map

Processes data by filtering, projecting, and transforming.

2. Shuffle

Groups values by keys across partitions, typically sorted.

3. Reduce

Aggregates all values for a single key.

All keys and values must be serializable for network transmission.