What is Cloud Dataproc?

Cloud Dataproc is a fully managed service for running Apache Spark and Apache Hadoop clusters on Google Cloud. It simplifies cluster setup and management while supporting powerful data processing workloads.

Initialization actions let you customize your cluster by specifying executables or scripts to run on all nodes immediately after setup.

Using Cloud Dataproc

  1. Setup → Create a cluster via Console, `gcloud` command, or YAML
  2. Configure → Define cluster options (region, zone), master and worker nodes
  3. Optimize → Use preemptible VMs, custom images, minimum CPU platform
  4. Utilize → Submit jobs via Console or `gcloud` CLI
  5. Monitor → View driver output, logs, Stackdriver (Cloud Monitoring), and cluster metrics

Persistent vs. Ephemeral Clusters

Cloud Dataproc supports both persistent and ephemeral cluster models. The ephemeral model is often more cost-effective for temporary or ad-hoc workloads.

What is an Ephemeral Cluster?

Ephemeral clusters are short-lived and created specifically for a job or a batch of jobs. Once the job is complete, the cluster can automatically shut down.