Hadoop Ecosystem

Hadoop an on-permise cluster

yarn = cluster resource management
mapReduce = Cluster data processing
Spark = Cluster data processing
Pig = Scripting
Flink = Streams
HBase = Columnar datastore
Ambari = Management and monitoring
Presto = Distributed SQL query
Oozie = Workflow automation
Zookeeper = Coordination
AVRO = JSON
HCatalog = Metadata
Mahout & SparkML (Machine Learning)
Flume = Log aggregation and transport
Sqoop = Import and export of relational data

HDFS

HDFS = Hadoop Destributed File System
HDFS is a the datalake

HDFS: not a POSIXfs
Huge Blocks 64MB, which means storing 2MB stores actually 64MB

Lifting and shifting Hadoop workloads is very often the first step into the cloud. HDFS can
then be run in the cloud. This works without code change. But HDFS is not meant to be in such an
architecture in the long run. This is because of how HDFS works on the clusters, with block size,
the data locality, and the replication of the data in HDFS.

block size
data locality
replication of the data

You can even persist HDFS Data in GCS, inquiry it directly from BigQuery via federated query(EXTERNAL_QUERY).

hdfs:// ==> gs://

Directory rename in HDFS not the same as in Cloud storage, because Cloud Storage has no concepts of directores.

How do I use it?
$ hadoop fs -lsr /
# hadoop-fuse-dfs dfs://hadoop-hdfs /mnt
$ ls /mnt /
# mount -t nfs -o vers=3, proto=tcp,nolock host:/ /mnt
$ ls /mnt /

MapReduce

Overview

batch oriented (long jobs; final results)
brings the computation to the data
very constrained programming model
embarrassingly parallel programming model
used to be the only game in town for compute

MapReduce is based on the contept of records.
Record = (Key, Value)
Key: comparable, serializable // everthing has to be serializable because its going to besend via the wire.
Value: serializable
logical Phases of mapReduce are: input, map, shuffle, reduce, output

Map
you take any kind of data from HDFS and do for example projections, filtering, transformation on this data

Input:(Key1, Value1)
Output: List(Key2, Value2)
Projections, filtering, transformation

Shuffle
taking this data form map, so those key, value -pairs and receive sorted partions by the key

Input: List(Key2, Value2)
Output

Sort(Partition(List(Key2,List(Value2))))

provided by Hadoop: Several Customizations Possible

Reduce
taking all the values corresponding to the same key and run computation on those data

Input: List(Key2, List(Value2))
Output: List(Key3, Value3)
Aggregations

What is the Hadoop Ecosystem?