Covid-19 Spark project

Designing and the implementation of different Spark applications written in Python language to accomplish different jobs used to analyze a dataset on Covid-19 disease created by Our World In Data.

Design and implementation in Spark of:

an application returning the classification of continents in order of decreasing total_cases
an application returning, for each location, the average number of new-tests per day
an application returning the 5 days in which the total number of patients in Intensive care units (ICUs) and hospital was highest (icu_patients+ hosp_patients), by decreasing order of this total number.

All the Spark applications will be executed with different size of input values and with different configurations (local, Standalone cluster and with YARN) to evaluate the different execution times.

🚀 About Me

I'm a computer science Master's Degree student and this is one of my university project. See my other projects here on GitHub!

💻 The project

Dataset

The dataset is a collection of the COVID-19 data manteined and updated daily by Our World In Data and contains data on confirmed cases, deaths, hospitalizations, and other variables of potential interest.

The dataset available in different formats can be found here, while the data dictionary useful to understand the meaning of all the dataset's columns is available here.

Job 1

Implementation of a Spark application that returns the classification of continents in order of decreasing total_cases.

An illustrative graph of the various steps performed to solve the problem is shown:

After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:

Job 1 results:

The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Job 2

Implementation of a Spark application that returns for each location the average number of new-tests per day

An illustrative graph of the various steps performed to solve the problem is shown:

After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:

Job 2 results:

The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Job 3

Implementation of a Spark application that returns the classification of the 5 days in which the total number of patients in Intensive Care Unit (ICUs) and hospital (icu_patients + hosp_patients) was highest, by decreasing order of this total number

An illustrative graph of the various steps performed to solve the problem is shown:

After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:

Job 3 results:

The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).

Three different types of input data have been used. Results changes with respect to the inputs:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Spark configurations time comparison

For every job some tabular and graphical comparison of job's execution times in local, standalone-cluster and with YARN configurations have been computed. Obviously all these execution times are taken even considering the different input sizes used.

Time comparison results discussion

As we expect, the fastest configuration is the Local mode and this because in this configuration Spark runs in a single VM and using multiple threads to do all the computations, and it is why Local configuration is used particularly for the code debugging and testing.
On the other way the Standalone-Cluster mode is used to simulate the possible behavior of distributed computation, but all the part of the managing of the resources is done by Spark internally in different VM instances on a single machine.
These times really increase when the job is ran using YARN because the computational times sum up to the times due to resource managing.

Support

For any support, error corrections, etc. please email me at domenico.elicio13@gmail.com

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Covid-19 Spark project

🚀 About Me

💻 The project

Dataset

Job 1

Job 1 results:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Job 2

Job 2 results:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Job 3

Job 3 results:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Spark configurations time comparison

Time comparison results discussion

Support

Files

README.md

Latest commit

History

README.md

File metadata and controls

Covid-19 Spark project

🚀 About Me

💻 The project

Dataset

Job 1

Job 1 results:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Job 2

Job 2 results:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Job 3

Job 3 results:

Input: MARCH-APRIL data

Input: MARCH-AUGUST data

Input: MARCH-OCTOBER data

Spark configurations time comparison

Time comparison results discussion

Support