Skip to content

Latest commit

 

History

History
150 lines (76 loc) · 8.24 KB

README.md

File metadata and controls

150 lines (76 loc) · 8.24 KB

Covid-19 Spark project

GitHub watchers GitHub forks GitHub Repo stars GitHub last commit

Designing and the implementation of different Spark applications written in Python language to accomplish different jobs used to analyze a dataset on Covid-19 disease created by Our World In Data.

Design and implementation in Spark of:

  • an application returning the classification of continents in order of decreasing total_cases
  • an application returning, for each location, the average number of new-tests per day
  • an application returning the 5 days in which the total number of patients in Intensive care units (ICUs) and hospital was highest (icu_patients+ hosp_patients), by decreasing order of this total number.

All the Spark applications will be executed with different size of input values and with different configurations (local, Standalone cluster and with YARN) to evaluate the different execution times.

🚀 About Me

I'm a computer science Master's Degree student and this is one of my university project. See my other projects here on GitHub!

portfolio linkedin

💻 The project

Dataset

The dataset is a collection of the COVID-19 data manteined and updated daily by Our World In Data and contains data on confirmed cases, deaths, hospitalizations, and other variables of potential interest.

The dataset available in different formats can be found here, while the data dictionary useful to understand the meaning of all the dataset's columns is available here.

Data dictionary

Job 1

Implementation of a Spark application that returns the classification of continents in order of decreasing total_cases.

An illustrative graph of the various steps performed to solve the problem is shown:

immagine

After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:

immagine

Job 1 results:

The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

immagine

Input: MARCH-AUGUST data

immagine

Input: MARCH-OCTOBER data

immagine

Job 2

Implementation of a Spark application that returns for each location the average number of new-tests per day

An illustrative graph of the various steps performed to solve the problem is shown:

immagine

After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:

immagine

Job 2 results:

The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

marapr

Input: MARCH-AUGUST data

marago

Input: MARCH-OCTOBER data

marott

Job 3

Implementation of a Spark application that returns the classification of the 5 days in which the total number of patients in Intensive Care Unit (ICUs) and hospital (icu_patients + hosp_patients) was highest, by decreasing order of this total number

An illustrative graph of the various steps performed to solve the problem is shown:

immagine

After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:

immagine

Job 3 results:

The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

immagine

Input: MARCH-AUGUST data

immagine

Input: MARCH-OCTOBER data

immagine

Spark configurations time comparison

For every job some tabular and graphical comparison of job's execution times in local, standalone-cluster and with YARN configurations have been computed. Obviously all these execution times are taken even considering the different input sizes used.

immagine

b

Time comparison results discussion

  • As we expect, the fastest configuration is the Local mode and this because in this configuration Spark runs in a single VM and using multiple threads to do all the computations, and it is why Local configuration is used particularly for the code debugging and testing.

  • On the other way the Standalone-Cluster mode is used to simulate the possible behavior of distributed computation, but all the part of the managing of the resources is done by Spark internally in different VM instances on a single machine.

  • These times really increase when the job is ran using YARN because the computational times sum up to the times due to resource managing.

Support

For any support, error corrections, etc. please email me at domenico.elicio13@gmail.com