dataproc-pyspark-mapreduce

runs a simple pyspark job on hadoop computing cluster managed by dataproc
as per https://github.com/GoogleCloudPlatform/training-data-analyst/tree/master/courses/unstructured
and per https://www.udemy.com/gcp-data-engineer-and-cloud-architect/learn/v4/t/lecture/7598404?start=0
and per https://codelabs.developers.google.com/codelabs/cloud-dataproc-starter/#4

upload_to_bucket.sh

uploads init scripts, inputs and py stuff to cloud storage bucket
usage ... bucket-name folder-name
init scripts must have 'init' prefix
input must be txt and must have 'input' prefix

init_script.sh

installs py api client
just an illustration, not needed here

create_cluster.sh

create computing cluster
uses basic config (1 master, 2 worker nodes..)
runs 2 initialization scripts:
init_script.sh see above
datalab.sh from gcp
for more templates see https://github.com/GoogleCloudPlatform/dataproc-initialization-actions

how to

run upload.sh
run create_cluster.sh
(chmod u+x ...)
submit job

NB

the above does not take into account auth for bucket files for gsutil in bash
either make files in the bucket public, create iam rules for cluster,
or access via oauth2
e.g. https://cloud.google.com/storage/docs/access-control/making-data-public#storage-make-object-public-python
gsutil acl ch -u AllUsers:R gs://[BUCKET_NAME]/[OBJECT_NAME]
gsutil iam ch allUsers:objectViewer gs://[BUCKET_NAME]

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
README.md		README.md
create_cluster.sh		create_cluster.sh
image.png		image.png
init_script.sh		init_script.sh
input.txt		input.txt
job.py		job.py
upload_to_bucket.sh		upload_to_bucket.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

dataproc-pyspark-mapreduce

upload_to_bucket.sh

init_script.sh

create_cluster.sh

how to

NB

About

Releases

Packages

Languages

redvg/dataproc-pyspark-mapreduce

Folders and files

Latest commit

History

Repository files navigation

dataproc-pyspark-mapreduce

upload_to_bucket.sh

init_script.sh

create_cluster.sh

how to

NB

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages