Cloud Dataproc is a fully managed service for running Apache Spark and Apache Hadoop clusters on Google Cloud. It simplifies cluster setup and management while supporting powerful data processing workloads.
Runs Spark and Hadoop workloads on GCP clusters
Automatically configures hardware and software
Supports GCP Console, Cloud SDK, REST APIs, and SSH for management
Separates compute and storage by reading/writing directly from Cloud Storage
Enables creation of job-specific clusters without relying on HDFS
Reduces admin overhead for running big data pipelines
Initialization actions let you customize your cluster by specifying executables or scripts to run on all nodes immediately after setup.
Using Cloud Dataproc
Setup → Create a cluster via Console, `gcloud` command, or YAML
Cloud Dataproc supports both persistent and ephemeral cluster models. The ephemeral model is often more cost-effective for temporary or ad-hoc workloads.
What is an Ephemeral Cluster?
Ephemeral clusters are short-lived and created specifically for a job or a batch of jobs. Once the job is complete, the cluster can automatically shut down.
Most cost-effective for temporary workloads
Data is stored in Cloud Storage (separate from compute)
Clusters are spun up as needed and deleted when finished
Compute resources are only used (and billed) when jobs are running