Designing and the implementation of different Spark applications written in Python language to accomplish different jobs used to analyze a dataset on Covid-19 disease created by Our World In Data.
Design and implementation in Spark of:
- an application returning the classification of continents in order of decreasing total_cases
- an application returning, for each location, the average number of new-tests per day
- an application returning the 5 days in which the total number of patients in Intensive care units (ICUs) and hospital was highest (icu_patients+ hosp_patients), by decreasing order of this total number.
All the Spark applications will be executed with different size of input values and with different configurations (local, Standalone cluster and with YARN) to evaluate the different execution times.
I'm a computer science Master's Degree student and this is one of my university project. See my other projects here on GitHub!
The dataset is a collection of the COVID-19 data manteined and updated daily by Our World In Data and contains data on confirmed cases, deaths, hospitalizations, and other variables of potential interest.
The dataset available in different formats can be found here, while the data dictionary useful to understand the meaning of all the dataset's columns is available here.
Implementation of a Spark application that returns the classification of continents in order of decreasing total_cases.
An illustrative graph of the various steps performed to solve the problem is shown:
After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:
The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).
Three different types of input data have been used. Results changes with respect to the inputs:
Implementation of a Spark application that returns for each location the average number of new-tests per day
An illustrative graph of the various steps performed to solve the problem is shown:
After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:
The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).
Three different types of input data have been used. Results changes with respect to the inputs:
Implementation of a Spark application that returns the classification of the 5 days in which the total number of patients in Intensive Care Unit (ICUs) and hospital (icu_patients + hosp_patients) was highest, by decreasing order of this total number
An illustrative graph of the various steps performed to solve the problem is shown:
After importing the functions we need from the different libraries, the first thing to do to permit a Spark application to run, is to create a SparkSession which has an appName to identify it. After reading the csv file, all the various filtering and required operations are computed to reach our goal:
The execution of this job has been done in all 3 running configurations (Local mode, Standalone Cluster and withYARN).
Three different types of input data have been used. Results changes with respect to the inputs:
For every job some tabular and graphical comparison of job's execution times in local, standalone-cluster and with YARN configurations have been computed. Obviously all these execution times are taken even considering the different input sizes used.
-
As we expect, the fastest configuration is the Local mode and this because in this configuration Spark runs in a single VM and using multiple threads to do all the computations, and it is why Local configuration is used particularly for the code debugging and testing.
-
On the other way the Standalone-Cluster mode is used to simulate the possible behavior of distributed computation, but all the part of the managing of the resources is done by Spark internally in different VM instances on a single machine.
-
These times really increase when the job is ran using YARN because the computational times sum up to the times due to resource managing.
For any support, error corrections, etc. please email me at domenico.elicio13@gmail.com