Dataproc
- Hadoop, Spark, Hive, Pig
- Lift and shift to GCP
Map Reduce
data:image/s3,"s3://crabby-images/e3afe/e3afeef64273a9e1e6a146d1e96f86e1bc8bc893" alt="map_reduce"
Converting from HDFS to Google Cloud Storage
- Copy data to GCS
- Install connector or copy manually
- Update file prefix in scripts
- From hdfs:// to gs://
- Use Dataproc and run against/output to GCS
Dataproc performance optimization
- Keep your data close to your cluster
- Place Dataproc cluster in same region as storage bucket
- Larger persistent disk = better performance
- Using SSD over HDD
- Allocate more VMs
- Use preemptible VM to save on costs