CCT College Dublin Continuous Assessment
Learning Outcomes:
Please note this is not the assessment task. The task to be completed is detailed on the next page.
This CA will assess student attainment of the following minimum intended learning outcomes:
Programming for DA
-
Critically appraise aggregation methods to process and manipulate data from multiple data structures. (Linked to PLO 4).
-
Formulate and evaluate a testing and optimisation strategy for programmatic solutions. (Linked to PLO 5).
-
Compare, contrast and select relevant libraries / techniques to process data from diverse sources. (Linked to PLO 5).
Statistics for Data Analytics
-
Formulate and test hypotheses using appropriate statistical techniques and evaluate and communicate the result effectively. (Linked to PLO 2, PLO 3, PLO6).
-
Utilise current software and language to produce the results of your analysis from existing data. (Linked to PLO 1, PLO 4).
-
Apply statistical analysis to appropriate datasets and critique the limitations of the model. (Linked to PLO 2, PLO4).
Machine Learning for Data Analysis
1.Modify and implement Machine Learning Algorithms to solve analytical problems. (Linked to
PLO 1, PLO 2, PLO 5)
-
Develop a machine learning strategy for a given domain and communicate effectively to team members, peers and project stakeholders the insight to be gained from the interpreted results. (Linked to PLO 1, PLO 4, PLO.
-
Formulate and evaluate a test and optimisation strategy for programmatic solutions. (Linked to PLO 5).
Data Preparation & Visualisation
-
Programmatically Implement graphical methods to identify issues within a data set (missing, out of range, dirty data)(linked to PLO 3, PLO 5)
-
Propose, design, develop, and implement an interactive data visualisation solution, for a given data set and potential audience, detailing the rationale for approach and visualisation choices made during development for a given use case, data characteristics and multiple transmission media (linked to PLO 2, PLO 5)
-
Perform a critical analysis of a data set to optimise the data for a given problem space. Document the rationale behind the decisions to peers and stakeholders.(linked to PLO 5, PLO 6)
Attainment of the learning outcomes is the minimum requirement to achieve a Pass mark (40%). Higher marks are awarded where there is evidence of achievement beyond this, in accordance with QQI Assessment and Standards, Revised 2013, and summarised in the following table:
Percentage Range | CCT Performance Description | QQI Description of Attainment |
---|---|---|
Level 9 awards | ||
90% + | Exceptional | Achievement includes that required for a Pass and in most respects is significantly and consistently beyond this |
80 – 89% | Outstanding | |
70 – 79% | Excellent | |
60 – 69% | Very Good | Achievement includes that required for a Pass and in many respects is significantly beyond this |
50 – 59% | Good | Attains all the minimum intended programme learning outcomes |
40 – 49% | ||
Acceptable | ||
35 – 39% | Fail | Nearly (but not quite) attains the relevant minimum intended learning outcomes |
0 – 34% | Fail | Does not attain some or all of the minimum intended learning outcomes |
Please review the CCT Grade Descriptor available on the module Moodle page for a detailed description of the standard of work required for each grade band.
The grading system in CCT is the QQI percentage grading system and is in common use in higher education institutions in Ireland. The pass mark and thresholds for different grade bands may be different from what you have experienced in the higher education system in other countries. CCT grades must be considered in the context of the grading system in Irish higher education and not assumed to represent the same standard the percentage grade reflects when awarded in an international context.
Assessment Task
Students are advised to review and adhere to the submission requirements documented after the assessment task.
Scenario
Transport in Ireland
“Smartcard ticketing systems currently available in many cities’ public transport systems enable a vast amount of data.
These datasets reflect how people behave, which in turn means it is possible to assess their transport needs and provide an accurate picture of their habits, either as groups (based on fare types, such as student, elderly and so on) or at an individual level.
Thanks to big data analytics, we can adapt public transport to these needs, plan new services, minimise walking time and so on.”— https://www.engineersireland.ie/Brexit/big-data-enables-better-urban-transport-networks
You have been tasked with analysing Ireland's Transport data and comparing the Irish Transport sector with other countries (of your choosing) worldwide. This analysis should also include forecasting, sentiment analysis and evidence-based recommendations for the sector as well as a complete rationale of the entire process used to discover your findings. Your Research could include vehicle traffic, type of vehicles used, freight transport, air traffic, transport infrastructure etc. (or any other relevant topic EXCEPT Climate change) with Ireland as your base line.
Note:
- While topical, Construction impact on Climate Change SHOULD NOT be chosen as an area of research for this assessment.
You must source appropriate data sets from any available repository (eg. But not restricted to https://data.gov.ie/organization/transport-infrastructure-ireland) to inform your research (all datasets MUST be referenced and the relevant licence/permissions detailed).
Criteria of Analysis
Discuss the choice of project management framework you have deemed suitable for this project.
It is expected that you use some type of version control software eg: GitHub, Gitlab, BitBucket etc with regular commits of code and report versions. Please include the address of your version control repository in your report.
**Programming for DA Tasks **[0-100]
-
Programming: The project must be explored programmatically: this means that you must implement suitable Python tools (code and/or libraries) to complete the analysis required. All of this is to be implemented in a Jupyter Notebook. [0-20]
-
Data structures: You are required to gather and process data that has been stored in at least two distinct formats. For example, this can be data in a CSV file, from a MySQL database or from a web API in JSON format.** [0-20]**
-
Documentation: The project documentation must include sound justifications and explanation of your code choices. Code quality standards should also be applied. [0-20]
-
Testing & Optimisation: You are required to document and evaluate a testing and optimisation strategy for your analysis. As part of this, you may want to plan and document how you ensured your code is doing what it is meant to, as well as ensuring that the code is making good use of your resources (eg computing, time etc). Note any trade-offs that you've made in these areas. [0-20]
-
Data manipulation: For each of the different data sources, compare and contrast at least two relevant libraries and techniques for a) processing and b) aggregating the respective data, in order to justify your chosen libraries/techniques. [0-20]
Total Mark = 20+20+20+20+20=100
Statistics for Data Analytics Tasks
-
Use descriptive statistics and appropriate visualisations in order to summarise the dataset(s) used, and to help justify the chosen models. [0-20]
-
Analyse the variables in your dataset(s) and use appropriate inferential statistics to gain insights on possible population values (e.g., if you were working with public transport, you could find a confidence interval for the population proportion of users commuting to Dublin by train). [0-20]
-
Undertake research to find similarities between some country(s) against Ireland and apply parametric and non-parametric inferential statistical techniques to compare them (e.g., t-test, analysis of variance, Wilcoxon test, chi-squared test, among others). You must justify your choices and verify the applicability of the tests. Hypotheses and conclusions must be clearly stated. You are expected to use at least 5 different inferential statistics tests. [0-40]
-
Use the outcome of your analysis to deepen your research. Indicate the challenges you faced in the process. [0-20]
Note: All your calculations and reasoning behind your models must be documented in the report and/or the appendix.
Total Mark = 20+20+40+20=100
Machine Learning Tasks
Use of multiple models (at least two) to compare and contrast results and insights gained.
- Describe the rationale and justification for the choice of machine learning models for the above-mentioned scenario. Machine Learning models can be used for Prediction, Classification, Clustering, sentiment analysis, recommendation systems and Time series analysis. You should plan on trying multiple approaches (at least two) with proper selection of hyperparameters using GridSearchCV method. You can choose appropriate features from the datasets and a target feature to answer the question asked in the scenario in the case of supervised learning.
[0 - 30]
- Collect and develop a dataset based on the transport topic related to Ireland as well as other parts of the world. Perform a sentimental analysis for an appropriate transport topic (e.g., public transport, freight movement etc…) for producers and consumers point of view in Ireland.
[0 - 25]
- You should train and test for Supervised Learning and other appropriate metrics for unsupervised/ semi-supervised machine learning models that you have chosen. Use cross validation to provide authenticity of the modelling outcomes. You can apply dimensionality reduction methods to prepare the dataset based on your machine learning modelling requirements.
[0 - 30]
- A Table or graphics should be provided to illustrate the similarities and contrast of the Machine Learning modelling outcomes based on the scoring metric used for the analysis of the above-mentioned scenario. Discuss and elaborate your understanding clearly.
[0 - 15]
Total Mark = 30+25+30+15=100
Data Preparation & Visualisation Tasks
-
Discuss in detail the process of acquiring your raw data, detailing the positive and/or negative aspects of your research and acquisition. This should include the relevance and implications of any and all licensing/permissions associated with the data. [0-15]
-
Exploratory Data Analysis helps to identify patterns, inconsistencies, anomalies, missing data, and other attributes and issues in data sets so problems can be addressed. Evaluate your raw data and detail, in depth, the various attributes and issues that you find. Your evaluation should reference evidence to support your chosen methodology and use visualizations to illustrate your findings.[0-25]
-
Taking into consideration the tasks required in the machine learning section, use appropriate data cleaning, engineering, extraction and/or other techniques to structure and enrich your data. Rationalize your decisions and implementation, including evidence of how your process has addressed the problems identified in the EDA (Exploratory Data Analysis) stage and how your structured data will assist in the analysis stage. This should include visualizations to illustrate your work and evidence to support your methodology.[0-30]
-
Modern Transport planning has a great dependence on technology and relies upon visualizations to communicate information, this includes web based, mobile based and many other digital transmission formats. Develop an interactive dashboard tailored to modern Transport planning, using tufts principles, to showcase the information/evidence gathered following your Machine Learning Analysis. Detail the rationale for approach and visualisation choices made during development. Note you may not use Powerbi, rapidminer, tableau or other such tools to accomplish this (at this stage).[0-30]
Total Mark = 15+25+30+30=100
Marks and feedback will be provided for each module separately based upon the learning outcomes for each of the modules.
All documentation, code, examples, and any other files MUST be evidenced in your Version Control repository. (Git or Similar) using your CCT email address ONLY. Your repository MUST show your continual development of the project throughout its lifecycle and include notes detailing progress at each commit.
You may be called to a Viva to defend your work.
Additional notes :
All:
-
Your documentation should present your approach to the project, including elements of project planning ( timelines).
-
Ensure that your documentation follows a logical sequence through the planning / research / justification / implementation phases of the project.
-
Ensure that your final upload contains a** maximum of 1 jupyter notebook per module.**
-
Please ensure that additional resources are placed and linked to a logical file structure eg, Scripts, Images, Report, Data etc…
-
Ensure that you include your raw and structured datasets in your submission
-
3000(+/- 10%) words in report (not including code, code comments, titles, references, or citations)
-
Your Word count MUST be included
(it is expected that research be carried out beyond class material)
**Submission Requirements **All assessment submissions must meet the minimum requirements listed below. Failure to do so may have implications for the mark awarded.
Submissions that are suspected of plagiarism and/or inclusion of AI (CHATGPT, BARD etc…) Generated content will be referred to the college authorities.
All assessment submissions must:
-
Jupyter Notebook (.ipynb), Word Document ONLY, Dashboard, and version control address
-
Be submitted by the deadline date specified or be subject to late submission penalties
-
Be submitted via Moodle upload
-
Use Harvard Referencing when citing third party material
-
Be the student’s own work.
-
Include the CCT assessment cover page.
Additional Information
-
Lecturers are not required to review draft assessment submissions. This may be offered at the lecturer’s discretion.
-
In accordance with CCT policy, feedback to learners may be provided in written, audio or video format and can be provided as individual learner feedback, small group feedback or whole class feedback.
-
Results and feedback will only be issued when assessments have been marked and moderated / reviewed by a second examiner.
-
Additional feedback may be requested by contacting Your Lecturer, Additional feedback may be provided as individual, small group or whole class feedback. Lecturers are not obliged to respond to email requests for additional feedback where this is not the specified process or to respond to further requests for feedback following the additional feedback.
-
Following receipt of feedback, where a student believes there has been an error in the marks or feedback received, they should avail of the recheck and review process and should not attempt to get a revised mark / feedback by directly approaching the lecturer. Lecturers are not authorised to amend published marks outside of the recheck and review process or the Board of Examiners process.
-
Students are advised that disagreement with an academic judgement is not grounds for review.
-
For additional support with academic writing and referencing students are advised to contact the CCT Library Service or access the CCT Learning Space.
-
For additional support with subject matter content students are advised to contact the CCT Student Mentoring Academy
-
For additional support with IT subject content, students are advised to access the CCT Support Hub.
Analysis of the Irish Transport Sector: A Comparative Study with EU Countries
MSc in Data Analytics (SB+) - Sept 2023 - 2024 - YR1
Author: Jose Maria Rico Leal
e-mail: sba23021@student.cct.ie
Student ID: sba23021
GitHub Link: https://github.com/JoseRicoCct/CA2_Integrated_Assesment_MSc_Data_Analytics_CCT_Semester_1.git
This paper compares some key areas of the Republic of Ireland transportation sector with
some European Countries. The study employs, statistical analysis, machine learning and data
visualisation. We completed the research through programming, testing, optimization, and
sentiment analysis. Results highlight the sector's performance, challenges faced, and offer
insights for future research. The combination of rigorous statistical techniques and machine
learning enhances the understanding of Ireland's transport landscape, contributing valuable
perspectives for policy and industry stakeholders.
- Abstract
- 1. Introduction
- 2. Programming
- 3. Statistical Analysis
- 4. ML
- 5. Data Preparation and Visualization
- 6. Conclusion
- 7. References
- 8. Annex
- 9. License
In this assignment, we deep dive into some aspects of the Irish transport sector, conducting a multifaceted analysis that extends to various key dimensions. With a specific focus on programming, statistical analysis, machine learning, and data visualization, our examination goes beyond the surface, comparing Ireland's transport landscape with selected EU nations. By employing a diverse array of methodologies, we aim to provide clarity in some of the aspects of the sector's strengths, challenges, and opportunities for improvement. This introduction sets the stage for a thorough exploration, offering valuable insights into the complex interplay of factors shaping the Irish transport sector within the broader European context.
I organized the project into five Jupyter Notebooks:
CA2_ML_Code.ipynb, CA2_Programming_Code.ipynb, CA2_Statistics_Code.ipynb and CA2_Data_Visualisation.ipynb. The project is programmatically explored using Python, emphasizing modularity and narrative-style programming.
For this project I gathered and processed data from:
CSV file:
Figure 1
Web API in CSV format:
Figure 2
Eurostat module:
Figure 3
Also, I explored gathering data from a:
Web API in JSON format:
Figure 4
The entire parsing process is documented in "CA2_Programming_Code_sba23021.ipynb" from cell 8 to 17 (Goel, 2020).
MySQL database:
Figure 5
As I could not find an open MySQL database to connect, I will be demonstrating how to extract data from a local MySQL database. I created a table named "tran_hv_psmod" within a schema called "eurostat" and imported the previously fetched file from Eurostat, "TRAN_HV_PSMOD.csv":
Figure 6
Once I had the records in the table, I was able to retrieve them using the Python package "sqlalchemy" (Oyama, 2022):
Figure 7
Complete instructions on creating the table, pushing, and fetching records are provided in the annex (8.1. MySQL Setup) (dev.mysql.com, n.d.).
In terms of documentation, I endeavoured to provide comments explaining the rationale behind each line of code. This includes details about the purpose of each line and the workflow for data manipulation and visualization.
I maintained code quality standards by adhering to the "PEP 8" styling guidelines (van Rossum, Warsaw and Coghlan, 2001). Given the project's nature, which primarily involves data manipulation and visualization, there was no need for the introduction of complex programming constructs or advanced "OOP" principles. My objective was to ensure clean and modularized code.
I have conducted extensive testing in the statistical section, examining each scenario to ensure the statistical model fits appropriately. Furthermore, I refined the testing process when accepting the null hypothesis. All statistical models in "CA2_Statistics_Code_sba23021.ipynb" are implemented after thorough testing.
Now, we are going to explore a testing example for a hypothesis test involving two populations.
This table will help us formulate the hypothesis:
Figure 8
H0: mu IE_BUS = mu EU_Country_BUS; There is no significant difference between the percentage average on passenger-kilometres for Vehicle BUS in Ireland and EU_Country.
H1: mu IE_BUS != mu EU_Country_BUS; There is a significant difference between the percentage average on passenger-kilometres for Vehicle BUS in Ireland and EU_Country.
Let us start the testing and choose Italy to determine whether we accept or reject the null hypothesis:
Figure 9
Figure 10
We can clearly see that H0 is rejected therefore there is a significant difference in the average percentage of BUS passengers between Ireland and Italy.
Now we are going to select Slovenia as its mu is closer to the Irish one:
Figure 11
Figure 12
We accept H0 and conclude that there is not enough evidence to say that there is a significant difference between the percentage average on passenger-kilometres for vehicle BUS in Ireland and Slovenia.
There is another testing scenario in annex 8.1.2.
Good examples of optimization can be found in "CA2_ML_Code_sba23021.ipynb", where I enriched the datasets to achieve better accuracy in machine learning models.
I observed a low accuracy of 69.23% for SVC and the best score at 90.88% (C: 1000, Gamma=0.01). I identified that the model was not fitting properly at Gamma = 0.0001.
After enriching the dataset, I achieved an accuracy of 87.01% for SVC, with the best score at 91.88% (C: 1000, Gamma=0.001). The results indicate a less overfitted model as the best score is obtained at a higher Gamma. Additionally, the graph at Gamma=0.0001 shows that the test and train sets are more fitted after enrichment.
In the annex 8.1.2 we can find other two scenarios for optimization.
Figure 13
In analyzing different data sources, I utilized various libraries and techniques for both processing and aggregating data (GeeksforGeeks, 2020). The comparisons and contrasts for each data source are summarized in Figure X, and the detailed implementation and explanation of the code can be found in "CA2_Programming_Code_sba23021.ipynb."
Figure 14
Dataset used is "tii03-passenger-journeys-by-luas". We aim to estimate the range of potential values for the parameter "LUAS average passenger number".
Below are the descriptive statistics for the total LUAS passenger numbers (green and red lines) for the years 2019, 2020, 2021 and 2022:
Figure 15
We will be examining the mean values to establish the confidence intervals.
It is interesting to note that both lines are quite balanced in terms of usage.
Figure 16
In Section 3.2, a detailed analysis of confidence intervals will be conducted for this dataset.
In this section we are comparing Ireland with some European countries, formulating hypothesis to assess if there are statistically significant differences in the "percentage average passenger-kilometres based on type of transport (Bus, Car and Train".
Since we will be using Ireland mean against other countries mean, this plot will assist us in constructing the hypothesis:
Figure 17
The dataset used is "Modal split of inland passenger transport" with the Eurostat code "TRAN_HV_PSMOD."
Using Irish airports as a reference, an ANOVA will be conducted to test whether there are any statistically significant differences in the means compared to other European airports. The dataset used is "Air passenger transport by main airports in each reporting country" with the Eurostat code "AVIA_PAOA."
In the first scenario Dublin Airport is considered, and below a graph of the subset of airports selected for this case:
Figure 18
In the second scenario, Shannon Airport is considered, and below is a graph showing the subset of airports selected for this case:
Figure 19
This test will be performed to examine the association between the categorical variable "Motor_energy_type" for Ireland and Austria. The dataset used is "New passenger cars by type of motor energy" with the Eurostat code "road_eqr_carpda."
This model is highly sensitive to the frequency of the variables. I have presented two cases: one where we reject H0, and the second one where I manually changed values to accept H0. I will now illustrate how the categorical variable numbers will look for each case, with further analysis to follow in "3.3.1.3. Chi-squared test."
Scenario 1: Rejecting H0.
Figure 20
Scenario 2: Accepting H0.
Figure 21
For this test, I utilized the same dataset as for ANOVA. Some of the airports violated the assumptions of normality required for ANOVA. The advantage of using the Kruskal-Wallis test is that it does not require normality to perform the test. I will present two scenarios: one to accept H0 and another to reject H0.
Scenario 1: Accepting H0.
Figure 22
Scenario 2: Rejecting H0.
Figure 23
I used a new dataset for this test, "Passengers transported (Railway transport)", with the Eurostat code "rail_pa_total" because the data did not follow a normal distribution. This choice allowed me to demonstrate the test's capability to handle non-normally distributed data.
Scenario 1: Accepting H0.
Figure 24
Scenario 2: Rejecting H0.
Figure 25
The task is to determine the weekly LUAS average for the total number of passengers in the years 2019, 2020, 2021, and 2022. It is important to note that both LUAS lines (red and green) are in scope. The analysis will be conducted with a 90% confidence level, and here are the results:
Figure 26
E.g. At a 90% confidence level, for 2019 the weekly LUAS number of passengers average is between 453K and 472K. Subsequently the same formulation for the rest of the years.
Plotting the confidence intervals:
Figure 27
After performing confidence intervals, the next natural step is to verify if the weekly averages are the same for both the red and green LUAS lines. We will use a t-test (www.statisticslectures.com, n.d.).
Figure 28
Hypothesis:
H0: μ green line = μ red line.
H1: μ green line != μ red line.
Results:
Figure 29
Figure 30
At a 5% significance level, we accept the Null Hypothesis; there is not enough evidence to conclude that the weekly mean values for the LUAS green line are different from those of the red line.
To perform this test, we will compare Ireland with three different countries, each having a distinct transportation method. The first scenario involves comparing Ireland with Slovenia for cars, the second compares Ireland with Denmark for buses, and the third compares Ireland with Slovenia again, but this time for trains.
Hypothesis:
H0: μ Ireland = μ EU Country.
H1: μ Ireland != μ EU Country.
Results:
Figure 31
Since the p-value is greater than alpha, we accept the null hypothesis (H0). There is not enough evidence to conclude that there is a significant difference between the percentage average of passenger-kilometres for the BUS vehicle in Ireland and Slovenia.
Figure 32
As the p-value is less than alpha, we reject the null hypothesis (H0), providing sufficient evidence to conclude a significant difference in the percentage average of passenger-kilometres for the Car vehicle between Ireland and Denmark.
Figure 33
As the p-value is less than alpha, we reject the null hypothesis (H0), indicating sufficient evidence to conclude a significant difference in the percentage average of passenger-kilometres for the Train vehicle between Ireland and Slovenia.
Figure 34
In the first scenario, we examine Dublin, Zurich, and Copenhagen airports to verify whether the yearly average passenger numbers for the period from 2003 to 2022 are the same or not (Laerd Statistics, 2018).
To perform ANOVA, we need both Shapiro-Wilk and Levene tests to have p-values greater than 5% alpha.
Figure 35
Now we can perform ANOVA:
H0: μ IE_EIDW = μ CH_LSZH = μ DK_EKCH.
H1: there are at least 2 μ that are different one to another.
Result p-value = 0.889
Figure 36
There is no reason to reject the Null Hypothesis; therefore, we can conclude that, with a 5% alpha, the mean of annual passengers carried (2003-2022) for Dublin, Zurich and Copenhagen airports is quite similar.
Second scenario: Shannon, Billund, and Treviso airports. We want to verify if the yearly average passenger numbers for the period 2003 to 2022 are the same or not.
Shapiro-Wilk and Levene tests:
Figure 37
ANOVA hypothesis:
H0: μ IE_EINN= μ IT_LIPH= μ DK_EKBI.
H1: there are at least 2 μ that are different one to another.
Result p-value = 0.04
Figure 38
We fail to accept the Null Hypothesis; therefore, we can state that with a 5% alpha the mean of annual passengers carried (2003-2022) for Shannon, Treviso-Sant'Angelo and Billund airports is different.
First Scenario:
Hypothesis:
H0: There is no significant difference between the observed and expected frequencies. Ireland and Austria are independent with no association or relationship.
Figure 39
H1: There is a significant difference between the observed and expected frequencies, indicating a non-independent relationship between Ireland and Austria
Result p-value = 1.64e-14, we fail to accept H0.
Figure 40
In the second scenario, I manually adjusted values for Austria to align frequencies more closely with those of Ireland; this model is highly sensitive to substantial differences between categorical variables.
Figure 41
After applying the Chi-Square test, we obtained a p-value of 0.59, leading us to accept H0.
Figure 42
Scenario 1:
We do not have normality for our samples (Xia, 2020):
Figure 43
Hypothesis:
H0: μ IE_EIKN= μ FR_LFBP= μ SE_ESGP.
H1: there are at least 2 μ that are different one to another.
Result, p-value = 0.23511.
Figure 44
We accept H0; the means of annual passengers carried (2003-2022) for Ireland West Knock, Pau Pyrenees, and Goteborg airports are quite similar.
Scenario 2:
We do not have normality for our samples:
Figure 45
Hypothesis:
H0: μ IE_EIKY= μ DE_EDSB= μ PL_EPRZ.
H1: there are at least 2 μ that are different one to another.
Result, p-value = 8.48e-08
Figure 46
We fail to accept H0; the means of annual passengers carried (2003-2022) for Kerry, Karlsruhe/Baden, and Rzeszow-Jasionka airports are different.
Scenario 1:
Checking normality:
Figure 47
Hypothesis:
H0: μ Ireland = μ Croatia.
H1: μ Ireland != μ Croatia.
Result p-value = 0.3068
Figure 48
As the p-value is greater than alpha, we accept H0, indicating that there is no significant difference between the average number of train passengers in Ireland and Croatia.
Scenario 2:
Checking normality:
Figure 49
Hypothesis:
H0: μ Ireland = μ Slovakia.
H1: μ Ireland != μ Slovakia.
Result p-value = 3.45e-06
Figure 50
We reject H0 as p-value is lower than alpha, there is a significant difference between the average number of train passengers between Ireland and Slovakia.
The tests I have conducted above reveal interesting findings:
"Number of public transport journeys at highest level since the beginning of the pandemic" (BreakingNews.ie, 2022), based on the confidence intervals, we can confirm that the number of passengers is recovering:
Figure 51
"Paris Charles De Gaulle recorded the highest number of air passengers" (ec.europa.eu, n.d.). "Dublin Airport Was EU's 11th Largest Airport in 2018" (DublinAirport, n.d.):
Figure 52
Analysing means in the ANOVA section we can see that those headings are highly correlated with "FR_LFPG" and "IE_EIDW."
The challenges faced included gathering the data and establishing the scenarios to perform the tests.
The choice of the dataset "TRAN_HV_PSMOD" and the selection of supervised ML models (Decision Tree, Random Forest, K-Nearest Neighbours, and Support Vector Machine) are purely matters of modelling. After numerous attempts, this combination has proven effective (Thanh Noi and Kappas, 2017).
We will be modelling the dataset to see how models react having these classes:
Figure 54
I paired Ireland with Hungary because it showed the best performance association for selected ML models, here are the results:
Figure 55
Overall model is performing well however for Gamma = 0.0001, models seem to be overfitted (Kilic, 2023) as this score is lower compared to Gamma 0.01 and 0.001:
Figure 56
To address the issue of overfitting (Ghosh, 2023), I enriched the dataset by quarterly weighting yearly values:
Figure 57
Figure 58
The results are as follows:
Figure 59
Figure 60
Now we have a better fit, with higher scores observed at Gamma = 0.0001.
Ireland will be our target variable for each transportation method. In the first attempt with yearly data, we obtained the following results:
Figure 61
After quarterly enrichment, increasing the number of rows from 32 to 128, we obtained the following results:
Figure 62
Linear regression estimation has performed much better with enriched data.
Dataset it is split into each mode of transportation having Ireland as a reference.
First attempt with quarterly data we get the following results:
Figure 63
Figure 64
To improve the results, we are going to enhance the dataset by breaking down each year into monthly values. The results are as follows:
Figure 65
Figure 66
KMeans and PCA have performed better with more data (L, 2020).
We are going to conduct sentiment analysis using Ryanair reviews and a set of tweets related to USA airlines. Results:
Figure 67
Tweets dataset is larger than Ryanair one that explains more correctly classified inputs:
Figure 68
Figure 69
Model accuracy for Ryanair is 73% and 75% for USA Airlines. ROC results as it follows:
Figure 70
USA Airlines seem to be classifying bad inputs better, while the other two are closer to each other. Let us test the classifier by adding reviews:
Figure 71
Result:
Figure 72
Ryanair's sentiment analysis performs well even though it has fewer inputs than USA Airlines. This difference could be attributed to the collection of reviews from Tripadvisor, where I ensured capturing opinions across good, bad, and neutral categories.
Let us compare each model after enriching the dataset.
Decision Trees CM:
Figure 73
Random Forrest:
Figure 74
KNN:
Figure 75
GridSearchCV CM:
Figure 76
Gamma Accuracy:
Figure 77
Accuracy Table:
Figure 78
By adding more values, models have performed better, with the exception of RF. However, we solved overfitting in the hyperparameter tuning phase, achieving higher accuracy at a lower level of gamma.
Same approach as we followed previously, we increased dataset row count by breaking down years into quarters, here the results:
Figure 79
Summary table:
Figure 80
Optimal feature selection and improved accuracy after enriching the dataset, undoubtedly, this method requires a larger dataset to perform well.
PCA Variance:
Figure 81
PCA Scatter Plot:
Figure 82
PCA Heatmap:
Figure 83
Elbow Method:
Figure 84
Silhouette Score:
Figure 85
Tables PCA and Silhouette Score:
Figure 86
Figure 87
After enriching the data, the first principal component explains almost 97% of the variability for the CAR and BUS datasets, while the TRN dataset exhibits 76.86% variability on the first component. This demonstrates a clear improvement.
The silhouette score has improved, indicating that the clusters are now closer to being well-defined (closer to +1). Additionally, all plots demonstrate a clear improvement after enlarging the dataset.
Undoubtedly, this was the most challenging part of the assignment. However, I discovered the Eurostat website, which provides extensive content on data transportation. Additionally, the CSO offers interesting datasets related to the Irish transportation sector. Other platforms utilized included GitHub, Kaggle, and TripAdvisor.
A positive aspect of this research is that I discovered platforms such as Eurostat, which even has a Python package to download its datasets. On the negative side, the length of transportation datasets I found was a challenge for the machine learning part. There was not enough data to create accurate and consistent models. However, I found a solution to overcome this issue.
In terms of licenses, we are mostly covered by Creative Commons, allowing us to use these datasets. Only for Tripadvisor Ryanair reviews, I do not have licenses. After thorough research, I believe I am not infringing any law, as I am not collecting any personal data. I am compliant with GDPR (Your Europe, 2019), and Tripadvisor does not state that you cannot collect their reviews for research purposes.
Figure 88
One of the issues was that Eurostat uses its nomenclature, and each dataset needed to be crosschecked to obtain accurate numbers. However, the main challenge was adapting datasets for the statistical models. EDA for this section can be found in "CA2_Data_Visualisation_Code_sba23021.ipynb" and some explanation in the annex (8.3.1). Nevertheless, EDA is also prevalent throughout the entire assignment. Here is a summary of the EDA performed:
Figure 89
For ML I utilized "TRAN_HV_PSMOD" due to the versatility this dataset provides for modeling both supervised and unsupervised learning, as well as for cross-validation and feature reduction. Additionally, the dataset enabled a comparison of Ireland's modal split transport with that of other European countries.
The challenge lay in improving accuracy. The solution involved breaking down years into quarters for both supervised and cross-validation/feature reduction. For unsupervised learning, I subdivided years into months to leverage the increased data availability. Undoubtedly, the models performed better with the additional data.
All EDA for ML can be seen in "CA2_ML_Code_sba23021.ipynb". I also created a more concise version "CA2_Data_Visualisation_Code_sba23021.ipynb", which emphasizes the relevant aspects of the EDA. Additional information is available in the annex (8.3.2).
The rationale and visualization choices can be seen in the table below, as follows:
As mentioned above I developed a dashboard in "CA2_Data_Visualisation_Code_sba23021.ipynb" outlining the relevant results for ML models. To comply with point "Modern Transport planning has a great dependence... web based" I adapted the dashboard to a streamlit (Parker, 2023) web dashboard. You can visit it at: https://datavisualisationmoderntransportdashstudentsba23021.streamlit.app/.
"CA2_Data_Visualisation_Code_sba23021.ipynb" dashboard:
Streamlit dashboard:
Complete dashboards can also be viewed in the annex (8.3.3).
In conclusion, the assignment most formidable phase involved the collection and adaptation of data for diverse methodologies. While my findings may not offer a panacea for the challenges faced by the Irish Transportation sector, I am confident that the statistical tests and Machine Learning models employed are presented in an accessible and pragmatic manner. This journey underscored the intricacies of data integration and analysis, highlighting the importance of methodological clarity. As I reflect on the complexities encountered, I recognize that this study contributes to a nuanced understanding of the sector. While not providing all-encompassing solutions, it lays the groundwork for informed discussions and potential pathways toward addressing the intricacies of the Irish Transportation landscape. I believe that the transparent modeling and interpretability fostered in this study pave the way for future endeavors in unraveling the challenges inherent in the field of transportation.
Word Count: 3,297 words
Programming:
Goel, A. (2020). How to parse JSON data with Python Pandas? [online] Medium. Available at: https://towardsdatascience.com/how-to-parse-json-data-with-python-pandas-f84fbd0b1025 [Accessed 23 Dec. 2023].
Oyama, F.K. (2022). How to Connect MySQL Database in Jupyter Notebook. [online] Faith Kinkema Oyama. Available at: https://kema.hashnode.dev/how-to-connect-mysql-database-in-jupyter-notebook [Accessed 7 Jan. 2024].
dev.mysql.com. (n.d.). MySQL :: Getting Started with MySQL. [online] Available at: https://dev.mysql.com/doc/mysql-getting-started/en/.
van Rossum, G., Warsaw, B. and Coghlan, N. (2001). PEP 8 -- Style Guide for Python Code | peps.python.org. [online] peps.python.org. Available at: https://peps.python.org/pep-0008/.
GeeksforGeeks. (2020). Data Manipulation in Python using Pandas. [online] Available at: https://www.geeksforgeeks.org/data-manipulattion-in-python-using-pandas/.
www.statisticslectures.com. (n.d.). Confidence Intervals for Independent Samples t-Test. [online] Available at: http://www.statisticslectures.com/topics/ciindependentsamplest/.
Laerd Statistics (2018). One-way ANOVA - An introduction to when you should run this test and the test hypothesis. [online] Laerd.com. Available at: https://statistics.laerd.com/statistical-guides/one-way-anova-statistical-guide.php.
Xia, Y. (2020). Kruskal Wallis Test - an overview | ScienceDirect Topics. [online] www.sciencedirect.com. Available at: https://www.sciencedirect.com/topics/medicine-and-dentistry/kruskal-wallis-test.
BreakingNews.ie. (2022). Number of public transport journeys at highest level since the beginning of the pandemic. [online] Available at: https://www.breakingnews.ie/ireland/number-of-public-transport-journeys-at-highest-level-since-the-beginning-of-the-pandemic-1304260.html [Accessed 11 Dec. 2023].
ec.europa.eu. (n.d.). Air transport statistics. [online] Available at: https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Air_transport_statistics#:\~:text=In%202022%2C%20820%20million%20people%20in%20the%20EU%20travelled%20by%20air.&text=In%202022%2C%20Paris%20Charles%20De [Accessed 11 Dec. 2023].
DublinAirport. (n.d.). Dublin Airport Was EU's 11th Largest Airport in 2018. [online] Available at: https://www.dublinairport.com/latest-news/2019/05/31/dublin-airport-was-eu-s-11th-largest-airport-in-2018.
docs.aws.amazon.com. (n.d.). Machine learning lifecycle - Machine Learning Best Practices in Healthcare and Life Sciences. [online] Available at: https://docs.aws.amazon.com/whitepapers/latest/ml-best-practices-healthcare-life-sciences/machine-learning-lifecycle.html [Accessed 7 Jan. 2024>.
Thanh Noi, P. and Kappas, M. (2017). Comparison of Random Forest, k-Nearest Neighbor, and Support Vector Machine Classifiers for Land Cover Classification Using Sentinel-2 Imagery. Sensors (Basel, Switzerland), [online] 18(1), p.18. doi: https://doi.org/10.3390/s18010018.
Kilic, D.K. (2023). Tuning Hyperparameters in Machine Learning Models 101. [online] Medium. Available at: https://medium.com/@deniz.kenan.kilic/changing-hyperparameters-in-machine-learning-models-101-d969f51fe414 [Accessed 7 Jan. 2024>.
Ghosh, B. (2023). What Matters More --- Data Size or Model Size. [online] Medium. Available at: https://medium.com/@bijit211987/what-matters-more-data-size-or-model-size-31cb004d7209\.
L, P. (2020). Intro to PCA, K-means Clustering | Unsupervised Learning Algorithms. [online] Medium. Available at: https://towardsdatascience.com/into-to-pca-k-means-clustering-unsupervised-learning-algorithms-5cc5acea274d\.
Your Europe (2019). Data protection. [online] Your Europe - Business. Available at: https://europa.eu/youreurope/business/dealing-with-customers/data-protection/data-protection-gdpr/index_en.htm\.
Parker, K. (2023). Creating a Streamlit-Powered Data Dashboard: A Step-by-Step Guide. [online] Medium. Available at: https://medium.com/@parktwin2/creating-a-streamlit-powered-data-dashboard-a-step-by-step-guide-e24fa016eadd\ [Accessed 7 Jan. 2024>.
I used MySQL Workbench to simulate the process of gathering data from a MySQL database.
Steps:
-
For schema creation, click on the database icon, name it "eurostat" in this case, and then click on Apply:
Figure 99
-
Click Apply:
Figure 100
-
Click finish:
Figure 101
-
Our schema eurostat will appear:
Figure 102
-
To create the table within the schema for inserting "TRAN_HV_PSMOD.csv", load the "TRAN_HV_PSMOD_table_creation.sql" file.
Click on File and select Open SQL Script:Figure 103
-
Execute the code from "TRAN_HV_PSMOD_table_creation.sql":
Figure 104
-
After successful execution, the table will appear under the "eurostat" schema. Please refresh to view the table:
Figure 105
-
Hover over the table, right-click, and select Table Data Import Wizard:
Figure 106
-
Browse and locate "TRAN_HV_PSMOD.csv", the click on open and next:
Figure 107
-
Next:
Figure 108
-
Next:
Figure 109
-
Next:
Figure 110
-
Next:
Figure 111
-
Finish:
Figure 112
-
Checking the results, we should have same number of columns and row count:
Figure 113
Figure 114
-
Connecting to MySQL and fetching the table "tran_hv_psmod":
Figure 115
This is how we can connect to a MySQL database and pull data from there.
Testing:
The choice of selecting Hungary, along with Ireland, for the Supervised Learning models was not a random decision. I tested the models with each possible combination involving Ireland and another country, and the pair of Ireland and Hungary offered the best outcome. In "CA2_Programming_Code_sba23021.ipynb" I implemented the models for Ireland and the Netherlands here the results:
Results for Ireland and the Netherlands:
Figure 116
Results for Ireland and Hungary:
Figure 117
Optimization:
For cross validation, feature reduction and unsupervised ML models I enlarged the datasets to optimize accuracy scores.
Initial data frame for cross validation and feature reduction:
Figure 118
After data enrichment:
Figure 119
We went from 32 rows to 128, that helped ML performance.
Initial data frame for unsupervised ML models:
Figure 120
After data enrichment:
Figure 121
We went from 32 rows to 384, that helped ML performance.
Full implementation of this code can be found in "CA2_ML_Code_sba23021.ipynb." And a concise implementation of quarterly and monthly functions for data enrichment can be found in "CA2_Programming_Code_sba23021.ipynb."
A summary of all statistical models and their results can be seen as it follows:
Figure 122
The Python calculations for each of the statistical tests can be seen as follows:
Year 2019:
Figure 123
Year 2020:
Figure 124
Year 2021:
Figure 125
Year 2022:
Figure 126
Summary:
Figure 127
Figure 128
Year 2019:
Figure 129
Year 2020:
Figure 130
Year 2021:
Figure 131
Year 2022:
Figure 132
Summary:
Figure 133
Figure 134
HT Test Ireland & Slovenia BUS % average on passenger-kilometres:
Figure 135
HT Test Ireland & Denmark CAR % average on passenger-kilometres:
Figure 136
HT Test Ireland & Slovenia TRN % average on passenger-kilometres:
Figure 137
Summary:
Figure 138
First Scenario Dublin Airport Anova test:
Summary before Anova:
![A screenshot of a test results Description automatically generated](img/image142.png)
Figure 142
We accept H0.
Second Scenario Shannon Airport Anova test:
Summary before Anova:
![A screenshot of a test Description automatically generated](img/image147.png)
Figure 147
We reject H0.
Scenario 1:
Figure 149
We reject H0.
Scenario 2:
Figure 150
We accept H0.
Scenario 1:
Step 1, Shapiro Wilk test:
Figure 151
Summary:
Figure 152
Step 2, Kruskal-Wallis test:
Figure 153
We accept H0.
Scenario 2:
Step 1, Shapiro Wilk test:
Figure 154
Summary:
Figure 155
Step 2, Kruskal-Wallis test:
Figure 156
We reject H0.
Scenario 1:
Step 1, checking normality:
Figure 157
Summary:
Figure 158
Step 2, U Mann-Whitney test:
Figure 159
We accept H0.
Scenario 2:
Step 1, checking normality:
Figure 160
Summary:
Figure 161
Step 2, U Mann-Whitney test:
Figure 162
We reject H0.
EDA Confidence Intervals.
We are dropping "All Luas lines" except for the red and green lines for modeling purposes:
Figure 163
We are also dropping the year 2023 due to missing data; modeling under such conditions is not feasible.
Figure 164
We have many unnecessary columns in this dataset for modeling purposes:
Figure 165
Organizing numbers for each modal split transport to fit the scenarios:
Figure 166
Quite large dataset that required dropping a significant number of columns:
Figure 167
We had to drop some columns for this dataset:
Figure 168
There are some columns in this dataset that are not needed:
Figure 169
Quarterly enrichment Supervised Learning:
Figure 170
Figure 171
After enrichment, the second confusion matrix shows a significant improvement in classification:
Figure 172
Gamma=0.0001 is more fitted after enlarging the dataset hyperparameters have changed from 90.88% (C: 1000, gamma: 0.01) to 91.04% (C: 1000, gamma: 0.001):
Figure 173
Quarterly enrichment Cross Validation and Feature Reduction:
Figure 174
Figure 175
After enrichment, the second confusion matrix shows a significant improvement in classification:
Figure 176
Monthly enrichment Unsupervised Learning:
Figure 177
Figure 178
Visually, we can observe improved results after breaking down the data monthly:
PCA Variance:
Figure 179
PCA Scatter Plot:
Figure 180
PCA Heatmap:
Figure 181
Elbow Method:
Figure 182
Silhouette Score:
Figure 183
First, I created a dashboard in the Jupyter notebook "CA2_Data_Visualisation_Code_sba23021.ipynb" let us explore all the different graphs it contains. You can choose from the dropdown menu among various options, as follows:
Figure 184
Supervised Learning:
Figure 185
Cross Validation and Feature Reduction:
Figure 186
Unsupervised Learning Silhouette Score:
Figure 187
Unsupervised Learning PCA Variance:
Figure 188
Sentiment Analysis:
Figure 189
We also have a Streamlit version that can be shared online, where we can select options from the dropdown menu:
Figure 190
Supervised Learning:
Figure 191
Cross Validation and Feature Reduction:
Figure 192
Unsupervised Learning Silhouette Score:
Figure 193
Unsupervised Learning PCA Variance:
Figure 194
Sentiment Analysis:
Figure 195
The above dashboard can be visited at: https://datavisualisationmoderntransportdashstudentsba23021.streamlit.app/
Copyright (c) 2024 josericodata. This project is made available under the MIT License - see the LICENSE file for more details.