Spark is the name of the engine to realise cluster computing while PySpark is the Python’s library to use Spark
- Jupyter Notebook
- Pyspark package
- Apache Spark
We assume that you have already installed anaconda and have some basic knowledge of python and sql to follow along with the tutorial.
- Go to Apache Spark website.
- Choose a spark release. In our case we have used 2.4.3 (May 07 2019).
- Choose a package type. It will be selected by default.
- Click the Download spark link.
You can follow the link to install Spark.
Credits: Michael Galarnyk
- Karan
- Gunnika
- Ravinder
- To learn different terms such as SparkContext, RDDs, Transformation/Actions and using methods like show(), groupBy(), etc. we have used the guru99 website.
Please read the report to better understand the tutorial. Also, the datasets used are bigger in size so please contact me on LinkedIn and I will give you access to the drive link.