Skip to content

This project analyzes air quality data across regions to identify improvement areas, track trends, and classify similar regions using clustering. Leveraging PySpark, it processes sensor data, calculates Air Quality Index (AQI), and visualizes results with histograms and geographic maps to highlight areas with good air quality.

Notifications You must be signed in to change notification settings

tashi-2004/Geospatial-Air-Quality-Analysis-with-Apache-Spark

Repository files navigation

Air-Quality-Improvement-Data-Analysis

This repository contains a data analysis project focused on examining air quality data from various geographical regions. By analyzing this data, we aim to identify areas of improvement in air quality, track air quality trends, and cluster regions with similar air quality patterns.

Project Overview

The primary goal of this project is to analyze air quality data from sensor readings across different regions, calculate Air Quality Index (AQI), and identify trends in air quality improvements. Using clustering, we categorize geographical regions based on air quality data, and visually represent findings through histograms and geographical mappings.

Files Included

  1. Datasets:

    • Averaged Data from last 24 hours for each sensor: Visit Here
    • Averaged Data from last 5 minutes for each sensor (for testing): Visit Here
  2. Reports:

    • Report.pdf: A comprehensive report detailing the analysis, visualizations, and insights.
  3. Code:

    • code.py: This script performs the entire analysis, from data ingestion and preprocessing to visualization and reporting.
  4. Shapefiles:

    • ne_110m_admin_0_countries.shp: The geometric data for the countries.
    • ne_110m_admin_0_countries.shx: The index of the geometric data.
    • ne_110m_admin_0_countries.dbf: Attribute data related to the countries (such as names and codes).
    • ne_110m_admin_0_countries.prj: The coordinate reference system for the shapefile.
    • ne_110m_admin_0_countries.cpg: Character encoding information.

Features

  1. Data Acquisition: Fetches 24-hour air quality data and averaged data from the last 5 minutes for each sensor.
  2. Data Cleaning and Transformation: Prepares data using PySpark, ensuring correct formats and handling of missing values.
  3. Air Quality Index (AQI) Calculation: Computes AQI based on sensor data and classifies regions accordingly.
  4. Trend Analysis: Calculates daily AQI and compares trends to highlight improvements in air quality.
  5. K-Means Clustering: Groups regions into clusters based on geographical coordinates.
  6. Visualizations:
    • Histogram of longest streaks of good air quality. 3
    • Geographical map displaying air quality data points. 4
  7. Top Countries and Regions:
    • Lists top 10 countries with the best air quality. 1
    • Lists top 50 regions with the best air quality. 2a 2b

Technologies Used

  • Python: Core programming language for data processing and analysis.
  • PySpark: For large-scale data processing and analysis.
  • Geopandas: For handling geographical data.
  • Matplotlib & Seaborn: For visualizations.
  • Requests: For fetching data from APIs.
  • KMeans Clustering: For geographical clustering of regions.

Note

  • Whenever you run this project, please ensure that the Shapefiles are kept in the same directory as the code.py file for proper execution of geographical visualizations.

Contributors

Contact

For any questions or suggestions, feel free to contact at [abbasitashfeen7@gmail.com]

About

This project analyzes air quality data across regions to identify improvement areas, track trends, and classify similar regions using clustering. Leveraging PySpark, it processes sensor data, calculates Air Quality Index (AQI), and visualizes results with histograms and geographic maps to highlight areas with good air quality.

Topics

Resources

Stars

Watchers

Forks

Languages