Skip to content

Latest commit

 

History

History

homework

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Week 5 Homework

you may refer homework.ipynb for the solution.

In this homework we'll put what we learned about Spark in practice.

For this homework we will be using the FHVHV 2021-06 data found here. FHVHV Data

Question 1:

Install Spark and PySpark

  • Install Spark
  • Run PySpark
  • Create a local spark session
  • Execute spark.version.

What's the output?

  • 3.3.2 ✅
  • 2.1.4
  • 1.2.3
  • 5.4

Question 2:

HVFHW June 2021

Read it with Spark using the same schema as we did in the lessons.
We will use this dataset for all the remaining questions.
Repartition it to 12 partitions and save it to parquet.
What is the average size of the Parquet (ending with .parquet extension) Files that were created (in MB)? Select the answer which most closely matches.

  • 2MB
  • 24MB ✅
  • 100MB
  • 250MB

Question 3:

Count records

How many taxi trips were there on June 15?

Consider only trips that started on June 15.

  • 308,164
  • 12,856
  • 452,470 ✅
  • 50,982

Question 4:

Longest trip for each day

Now calculate the duration for each trip.
How long was the longest trip in Hours?

  • 66.87 Hours ✅
  • 243.44 Hours
  • 7.68 Hours
  • 3.32 Hours

Question 5:

User Interface

Spark’s User Interface which shows application's dashboard runs on which local port?

  • 80
  • 443
  • 4040 ✅
  • 8080

Question 6:

Most frequent pickup location zone

Load the zone lookup data into a temp view in Spark
Zone Data

Using the zone lookup data and the fhvhv June 2021 data, what is the name of the most frequent pickup location zone?

  • East Chelsea
  • Astoria
  • Union Sq
  • Crown Heights North ✅