Hadoop an on-permise cluster. Hadoop was built on Google's original MapReduce design from the 2004 white paper.
everthing is a file
Apps sharet their internal state
Hadoop an on-permise cluster
HDFS = Hadoop Destributed File System
HDFS is a the datalake
HDFS: not a POSIXfs
Huge Blocks 64MB, which means storing 2MB stores actually 64MB
Lifting and shifting Hadoop workloads is very often the first step into the cloud. HDFS can
then be run in the cloud. This works without code change. But HDFS is not meant to be in such an
architecture in the long run. This is because of how HDFS works on the clusters, with block size,
the data locality, and the replication of the data in HDFS.
You can even persist HDFS Data in GCS, inquiry it directly from BigQuery via federated query(EXTERNAL_QUERY).
hdfs:// ==> gs://
Directory rename in HDFS not the same as in Cloud storage, because Cloud Storage has no concepts of directores.
How do I use it?
$ hadoop fs -lsr /
# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt
$ ls /mnt /
# mount -t nfs -o vers=3, proto=tcp,nolock host:/ /mnt
$ ls /mnt /
Overview
MapReduce is based on the contept of records.
Record = (Key, Value)
Key: comparable, serializable // everthing has to be serializable because its going to besend via the wire.
Value: serializable
logical Phases of mapReduce are: input, map, shuffle, reduce, output
Map
you take any kind of data from HDFS and do for example projections, filtering, transformation on this data
Shuffle
taking this data form map, so those key, value -pairs and receive sorted partions by the key
Reduce
taking all the values corresponding to the same key and run computation on those data