Hadoop  DataFlow 

Dataproc


Cloud Dataproc runs Hadoop on GCP.
Cloud Dataproc is used for data processing in a cluster of Apache Spark or Hadoop.
Cloud Dataproc automatically configures the hardware and the software on the clusters for you.
You have multiple ways that you can manage your cluster including the GCP console the Cloud SDK,
RESTful APIs and Direct SSH access.
Cloud Dataproc provides the ability for Spark programs to separate compute & storage by.

Initialization actions lets you customize your cluster by specifying executables or
scripts that Cloud Dataproc will run on all nodes in your Cloud Dataproc
cluster immediately after the cluster is set up.

Using Cloud Dataproc


  1. Setup ==> create a cluster via Console, gcloud command / YAML file
  2. Configure ==> cluster options (region, zone, global, regional), master node options, worker nodes
  3. Optiomize ==> preemptible VMs, custom images, minimun CPU platfrom
  4. Utilize ==> submit a job via Console, gcloud command
  5. Monitor ==> job driver output, Logs, Stackdriver monitoring, cluster details graphs

Persistent clusters vs. ephemeral clusters

What is a ephemeral model or clusters?

The ephemeral model is most cost effective.
You store your data and cloud storage to persist
multiple temporary processing clusters data.
The cluster is used for processing those jobs
are allocated as needed, and then released,
and turned down as the job's finished.

The ephemeral model requires that storage to be decoupled from compute.

You've got efficient utilization, don't pay for resources that you don't use.
A fixed amount of time after the cluster enters an idle state, you can automatically set a timer.