GCP Storage  Azure Cosmos DB 

What is the Hadoop Ecosystem?

Hadoop an on-permise cluster. Hadoop was built on Google's original MapReduce design from the 2004 white paper.

everthing is a file

Apps sharet their internal state

Hadoop Ecosystem

Hadoop an on-permise cluster

HDFS

s3 

HDFS = Hadoop Destributed File System
HDFS is a the datalake

HDFS: not a POSIXfs
Huge Blocks 64MB, which means storing 2MB stores actually 64MB

Lifting and shifting Hadoop workloads is very often the first step into the cloud. HDFS can
then be run in the cloud. This works without code change. But HDFS is not meant to be in such an
architecture in the long run. This is because of how HDFS works on the clusters, with block size,
the data locality, and the replication of the data in HDFS.

You can even persist HDFS Data in GCS, inquiry it directly from BigQuery via federated query(EXTERNAL_QUERY).

hdfs:// ==> gs://

Directory rename in HDFS not the same as in Cloud storage, because Cloud Storage has no concepts of directores.

How do I use it?
$ hadoop fs -lsr /
# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt
$ ls /mnt /
# mount -t nfs -o vers=3, proto=tcp,nolock host:/ /mnt
$ ls /mnt /

MapReduce

Overview

MapReduce is based on the contept of records.
Record = (Key, Value)
Key: comparable, serializable // everthing has to be serializable because its going to besend via the wire.
Value: serializable
logical Phases of mapReduce are: input, map, shuffle, reduce, output

Map
you take any kind of data from HDFS and do for example projections, filtering, transformation on this data

Shuffle
taking this data form map, so those key, value -pairs and receive sorted partions by the key

Reduce
taking all the values corresponding to the same key and run computation on those data