PYSPARK Tutorial

Description

Spark is the name of the engine to realise cluster computing while PySpark is the Python’s library to use Spark

Getting Started

Dependencies

Jupyter Notebook
Pyspark package
Apache Spark

Installation

We assume that you have already installed anaconda and have some basic knowledge of python and sql to follow along with the tutorial.

Go to Apache Spark website.
Choose a spark release. In our case we have used 2.4.3 (May 07 2019).
Choose a package type. It will be selected by default.
Click the Download spark link.

You can follow the link to install Spark.

Credits: Michael Galarnyk

Authors

Karan
Gunnika
Ravinder

Dataset

@GoogleDrive

References

To learn different terms such as SparkContext, RDDs, Transformation/Actions and using methods like show(), groupBy(), etc. we have used the guru99 website.

Please read the report to better understand the tutorial. Also, the datasets used are bigger in size so please contact me on LinkedIn and I will give you access to the drive link.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md
Report.pdf		Report.pdf
pyspark-tutorial.ipynb		pyspark-tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PYSPARK Tutorial

Description

Getting Started

Dependencies

Installation

Authors

Dataset

References

About

Releases

Packages

Languages

JKiddu/pyspark-tutorial

Folders and files

Latest commit

History

Repository files navigation

PYSPARK Tutorial

Description

Getting Started

Dependencies

Installation

Authors

Dataset

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages