Cloud Dataproc

What is Cloud Dataproc?

Cloud Dataproc is a fully managed service for running Apache Spark and Apache Hadoop clusters on Google Cloud. It simplifies cluster setup and management while supporting powerful data processing workloads.

Runs Spark and Hadoop workloads on GCP clusters
Automatically configures hardware and software
Supports GCP Console, Cloud SDK, REST APIs, and SSH for management
Separates compute and storage by reading/writing directly from Cloud Storage
Enables creation of job-specific clusters without relying on HDFS
Reduces admin overhead for running big data pipelines

Initialization actions let you customize your cluster by specifying executables or scripts to run on all nodes immediately after setup.

Using Cloud Dataproc

Setup → Create a cluster via Console, `gcloud` command, or YAML
Configure → Define cluster options (region, zone), master and worker nodes
Optimize → Use preemptible VMs, custom images, minimum CPU platform
Utilize → Submit jobs via Console or `gcloud` CLI
Monitor → View driver output, logs, Stackdriver (Cloud Monitoring), and cluster metrics

Persistent vs. Ephemeral Clusters

Cloud Dataproc supports both persistent and ephemeral cluster models. The ephemeral model is often more cost-effective for temporary or ad-hoc workloads.

What is an Ephemeral Cluster?

Ephemeral clusters are short-lived and created specifically for a job or a batch of jobs. Once the job is complete, the cluster can automatically shut down.

Most cost-effective for temporary workloads
Data is stored in Cloud Storage (separate from compute)
Clusters are spun up as needed and deleted when finished
Compute resources are only used (and billed) when jobs are running
Supports automatic idle timeout for shutdown