This repository contains a data analysis project focused on examining air quality data from various geographical regions. By analyzing this data, we aim to identify areas of improvement in air quality, track air quality trends, and cluster regions with similar air quality patterns.
The primary goal of this project is to analyze air quality data from sensor readings across different regions, calculate Air Quality Index (AQI), and identify trends in air quality improvements. Using clustering, we categorize geographical regions based on air quality data, and visually represent findings through histograms and geographical mappings.
-
Datasets:
- Averaged Data from last 24 hours for each sensor: Visit Here
- Averaged Data from last 5 minutes for each sensor (for testing): Visit Here
-
Reports:
Report.pdf
: A comprehensive report detailing the analysis, visualizations, and insights.
-
Code:
code.py
: This script performs the entire analysis, from data ingestion and preprocessing to visualization and reporting.
-
Shapefiles:
ne_110m_admin_0_countries.shp
: The geometric data for the countries.ne_110m_admin_0_countries.shx
: The index of the geometric data.ne_110m_admin_0_countries.dbf
: Attribute data related to the countries (such as names and codes).ne_110m_admin_0_countries.prj
: The coordinate reference system for the shapefile.ne_110m_admin_0_countries.cpg
: Character encoding information.
- Data Acquisition: Fetches 24-hour air quality data and averaged data from the last 5 minutes for each sensor.
- Data Cleaning and Transformation: Prepares data using PySpark, ensuring correct formats and handling of missing values.
- Air Quality Index (AQI) Calculation: Computes AQI based on sensor data and classifies regions accordingly.
- Trend Analysis: Calculates daily AQI and compares trends to highlight improvements in air quality.
- K-Means Clustering: Groups regions into clusters based on geographical coordinates.
- Visualizations:
- Top Countries and Regions:
- Python: Core programming language for data processing and analysis.
- PySpark: For large-scale data processing and analysis.
- Geopandas: For handling geographical data.
- Matplotlib & Seaborn: For visualizations.
- Requests: For fetching data from APIs.
- KMeans Clustering: For geographical clustering of regions.
- Whenever you run this project, please ensure that the Shapefiles are kept in the same directory as the
code.py
file for proper execution of geographical visualizations.
- Tashfeen Abbasi
- Laiba Mazhar
For any questions or suggestions, feel free to contact at [abbasitashfeen7@gmail.com]