Skip to content

A list of datasets aiming to enable Artificial Intelligence applications that use Copernicus data.

License

Notifications You must be signed in to change notification settings

Agri-Hub/Callisto-Dataset-Collection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AI for Copernicus - a data repository by CALLISTO

A list of datasets aiming to enable Artificial Intelligence applications that use Earth Observation, satellite and other data.

It will be continuously enhanced with more datasets, and we are also aiming to trigger innovation by matching each one with papers and implementations that we consider relevant and could be used together in future work!

We strongly encourage the community to provide contributions through pull requests!

Callisto Generated Datasets

Note that for some of the Callisto-generated datasets, AI models have been utilized to clean the data and/or generate labels. This is explicitly mentioned wherever it applies.

  • Annotated Street Level Images from Mapillary (published in MMM22)
    Crop type labels from the freely available Land Parcel Identification System (LPIS) of the Netherlands are matched with all available Mapillary street-level images for the year 2017.
    Mapillary Annotated - Dataset sample

    Data Source Type Area Task Paper Code Relevant implementations
    Street level images Parcel Netherlands Crop Classification (2022) GitHub Street2Sat, DenseASPP, Crop Phenology, Scene Segmentation

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    22/25 25/25 17/25 21/25 85% (link to score details)
  • Space2Ground dataset for Agriculture Monitoring (published in IVMSP-2022)
    Space2Ground is a multi-level, multi-sensor, multi-modal dataset, annotated with grassland/non-grassland labels for agriculture monitoring. We combine Sentinel-1 SAR data, Sentinel-2 multispectral data and street-level images for the purpose of grassland detection. In particular, the dataset consists of i) the Space component (Sentinel-1 monthly mean time-series and Sentinel-2 time-series), ii) the Ground Component (Street-level image patches obtained from Mapillary after appropriate processing), and iii) Labels/Annotations (Parcel IDs and Grassland/Non-Grassland labels, according to farmers' declarations (parcel_annotations.csv file). Note that for the cleaning of the street-level images in particular, AI models have been utilized. More details can be found in (the corresponding publication).
    Space2Ground Street-level image patch - example 1 Space2Ground Street-level image patch - example 1 Space2Ground Street-level image patch - example 2

    Data Source Type Area Task Paper Code
    Sentinel-1, Sentinel-2 and crowdsourced street-level images Parcel Netherlands Crop Classification (Grassland Detection mainly) (2022) GitHub

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    23/25 25/25 16/25 22/25 86% (link to score details)
  • Paddy Rice Maps South Korea (2017~2021)
    This dataset includes paddy rice maps in South Korea from 2017 to 2021 with 10m resolution. The paddy rice maps are a product of deep learning model predictions and DO NOT represent ground truth information. The predictions were made by analyzing time series Sentinel-1 images based on the deep learning architecture that integrates U-Net and RNNs layers designed by eGIS/RS lab, Korea University. The deep learning model has been trained with 7,762 patches and validated in 5,180 patches for each patch consists of 256 x 256 pixels, and can be found in h5 format here. The labels were acquired from the farm map produced by the Korean Ministry of Agriculture, Food and Rural Affairs (MAFRA). Moreover, the authors have made public a pre-trained model. The validation accuracy and Cohen's kappa value are 96.50%, 0.7857 each which were calculated from the 40% of the farm map. For more information please contact to the KU-eGIS/RS lab.
    Paddy Rice mapping (binary) with DL

    Data Source Type Area Task Paper Code
    Sentinel 2 GeoTIFF South Korea Paddy Rice Mapping - GitHub

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    24/25 25/25 16/25 18/25 83% (link to score details)
  • Paddy Rice Labeling Sites in South Korea (2018)
    The paddy rice was visually interpreted at 30 sites in South Korea. The sites were selected at each province by a proportional stratified sampling method according to the paddy rice area statistics (Statistics Korea), so the dataset can be used for the validation on model generalization over the entire country. The paddy rice areas were visually interpreted by using Google Earth Pro and street view services (https://map.naver.com, https://map.kakao.com) and updated to the state of 2018.
    Paddy Rice Labelling Sites (Visual Interpretation)

    Data Source Type Area Task Paper Code
    Sentinel 2 GeoTIFF South Korea Paddy Rice Validation - -

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    22/25 23/25 17/25 14/25 76% (link to score details)
  • Water quality in a basin for drinking water in the North-West of Italy (2022-2023)
    Information about water quality in La Loggia basin was collected in different seasons in 2022 and 2023, both on the basin surface and at different depths.
    Samples were collected and analysed in lab, for the following parameters: Total chlorophyll, Blue-green algae, Diatoms, Green algae, Planktothrix, Transparency, Temperature, Dissolved oxygen, pH, Conductivity, Turbidity, Bromide, Bromate, Chloride, Chlorite, Chlorate, Fluoride, Nitrite, Nitrate, Orthophosphate, Sulfates

    Data Source Type Area Task Paper Code
    Water quality data from periodic sampling csv Italy - La Loggia lagoon basin Drinking water quality estimation - -

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    20/25 25/25 16/25 15/25 76% (link to score details)
  • Geotagged tweets in German about air quality (published in IVMSP2022)
    This dataset consists of 2,948 georeferenced tweets in the German language, which concern the topic of air quality and have been retrieved with the Twitter Standard Streaming API. The tweets have been posted from September 6, 2021 to February 16, 2022 (near six months) and contain air-quality-related keywords in their text, e.g. Luftqualität, Städtische Luftverschmutzung, Luftschadstoff, etc. The provided geoinformation has been extracted from the tweets' text with a state-of-the-art NER implementation that is based on the XLM-RoBERTa (XLM-R) language model, while OpenStreetMap API has been used for retrieving the coordinates of each detected location.

    Data Source Type Area Task Paper Code
    Twitter data json Germany Air quality estimation (2022) -

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    23/25 25/25 18/25 23/25 89% (link to score details)
  • Ontology and Geospatial Knowledge Graph in RDF format (2023)
    This dataset provides the geospatial semantic representation of the CALLISTO project and the domain knowledge of the pilot use cases, in the form of knowledge graph. This semantic representation contains a wide range of data categories, related to transformation and integration of PUCs' datasets to the ontology, including: agricultural data, water quality indexes, air quality information, and tweets, along with geo-relationship information. Around 9 million triples were generated from CALLISTO pilot usecases [PUC1,PUC2,PUC3] in RDF format.

    Data Source Type Area Task Paper Code
    Ontology and Geospatial Knowledge Graph in RDF format RDF triples Netherlands, Germany, Italy Semantic integration of CALLISTO datasets to represent their relationships - -

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    25/25 25/25 23/25 22/25 95% (link to score details)
  • Syntactic Geospatial data generated in RDF format (2022)
    This dataset represents synthetic generated data from CALLISTO data in RDF format. It contains the equivalent of 2 billion triples in Terse RDF Triple Language (Turtle) format. Samples were generated syntactically from CALLISTO PUC1 data in RDF format with Generative Adversarial Network (GAN). Each entity contains: Crop category: "Grasland" and "Bouwland", Geo information: as Multipolygon in Well Known Text (WKT) format, Geometry area, Geometry length, Object id, Parcel, Rdf:type owl:NamedIndividual

    Data Source Type Area Task Paper Code
    Syntactic Geospatial data generated in RDF format TTL Netherlands Semantic complex querying, inference, and analytics from heterogeneous data sources - -

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    18/25 25/25 20/25 15/25 78% (link to score details)
  • Air quality trends for Berlin and Hamburg (2021-2022)
    In the context of CALLISTO and its pilot use case "Sensor Journalism", air pollution is studied from a journalistic point of view through an integrated solution that comprises multiple data visualisation tools. One of these tools is CALLISTO's Geospatial Business Intelligence (GeoBI) tool, which provides various visualisations of primarily air quality data and its purpose is to enable journalists identify air quality events and trends to build their stories. The data provided here include trends of concentrations of specific air pollutants for the areas of Berlin and Hamburg during 2021 and 2022 (graphs & csv format) coming from the official air quality monitoring stations DEBE065 and DEHH008, in Berlin and Hamburg respectively, and are taken from OpenAQ.
    Air Quality - Berlin - 2021 - NO2

    Data Source Type Area Task Paper Code
    Air Quality monitoring stations csv Berlin & Hamburg Air Quality Monitoring, Sensor Journalism - -

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    24/25 25/25 18/25 21/25 88% (link to score details)
  • HYPSTAR water reflectance and derived water quality products at the Blankaart surface water reservoir (BE)
    This dataset consists of 2988 hyperspectral water leaving reflectance spectra measured between 2021-02-03 and 2022-08-03 at the Blankaart Surface Water Reservoir (Belgium, 50.98857N, 2.835213E) with the HYPSTAR®. From this dataset Chlorophyll-a concentration and suspended particulate matter were derived for monitoring water quality over the surface water reservoir.
    HYPSTAR water reflectance data

    Data Source Type Area Task Paper Code
    Sensor data from HYPSTAR csv Belgium Water quality estimation (2022) GitHub

    FAIRness evaluation — (link to framework definition)

    Findable Accessible Interoperable Reusable FAIRness score Score details
    24/25 25/25 21/25 21/25 91% (link to score details)

Existing Datasets

Agriculture

Analysis Ready Remote Sensing Data with labels

  • CropHarvest: a global satellite dataset for crop type classification
    The CropHarvest dataset is a crop dataset of geo-referenced labels with satellite data inputs, each consisting of latitude, longitude, the associated agricultural label, and a satellite pixel time series. It contains 90,480 datapoints from 20 datasets; some datasets come from existing public sources while some (e.g., Rwanda) are being made public with this publication. The datasets include 3 different types of labels: i) binary labels (crop/non crop) ii) FAO’s indicative crop classification labels, whcih resulted to 9 crop type groupings: cereals, vegetables and melons, fruits and nuts, oilseed crops, root/tuber crops, beverage and spice crops, leguminous crops, sugar crops, and other crops iii) crop-type labels, if available.
    These labels are also accomompanied by Remote sensing data. More specifically, for each point/polygon in the dataset there is also 12-timestep signature of:

    • Sentinel-2 monthly aggregated values (all bands except B1 and B10 + NDVI)

    • Sentinel-1 monthly aggregated values (VV and VH)

    • Meteorological monthly aggragated data (total precipitation and ground temperature at 2 m height from the ERA5 dataset with a spatial analysis of 31 km/px)

    • Topographic Data from the Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM) with 30m/px Shuttle Radar Topography Mission (SRTM) Digital Elevation Model (DEM) analysis.

      Data Source Type Area Task Paper Code
      Sentinel 1-2/ERA5/DEM Pixel Global Crop Classification (2021) GitHub
  • EuroCrops: A pan-european datasets for time series crop type classification (Demo)
    EUROCROPS is a dataset based on self-declared field annotations for training and evaluating methods for crop type classification and mapping, together with its process of acquisition and harmonisation. The aim of EUROCROPS is to enrich the research efforts and discussion for data-driven land cover classification via Earth observation and remote sensing. The dataset is published in different formats for researchers in remote sensing, computer vision and machine learning fields. EUROCROPS Demo Dataset contains harmonised agricultural parcel information data from 3 regions, namely Austria, Denmark and Slovenia which allows for a better representation of Europe’s agricultural diversity. Specifically, it contains 396,600 parcels for Austria in 2020, 310,236 parcels for Slovenia in 2020 and 98,565 parcels for Denmark in 2019. The dataset has been split into training and test sets as earth observation data is influenced by spatial auto-correlation, implying that using adjacent parcels for machine learning or remote sensing should be refrained from. Here you can find more details about the dataset.

    Data Source Type Area Task Paper Code
    Sentinel 2 Pixel Europe (DK,SL,AT) Crop Classification (2021) GitHub
  • BigEarthNet dataset

    • BigEarthNet is a benchmark archive, consisting of 590,326 pairs of Sentinel-1 and Sentinel-2 image patches.

    • To construct BigEarthNet with Sentinel-2 image patches (called as BigEarthNet-S2 now, previously BigEarthNet), 125 Sentinel-2 tiles acquired between June 2017 and May 2018 over the 10 countries (Austria, Belgium, Finland, Ireland, Kosovo, Lithuania, Luxembourg, Portugal, Serbia, Switzerland) of Europe were initially selected. All the tiles were atmospherically corrected by the Sentinel-2 Level 2A product generation and formatting tool (sen2cor). Then, they were divided into 590,326 non-overlapping image patches. Each image patch was annotated by the multiple land-cover classes (i.e., multi-labels) that were provided from the CORINE Land Cover database of the year 2018 (CLC 2018). The labels in BigEarthNet belong to the initial release of the labels in 2018.

    • To construct BigEarthNet with Sentinel-1 image patches (called as BigEarthNet-S1), 321 Sentinel-1 scenes acquired between June 2017 and May 2018 that jointly cover the area of all original 125 Sentinel-2 tiles with close temporal proximity were selected and processed. BigEarthNet-S1 consists of 590,326 preprocessed Sentinel-1 image patches - one for each Sentinel-2 patch. A more detailed explanation on the processing is given in its dataset description document.

      Data Source Type Area Task Paper Code Relevant Datasets
      Sentinel 1/2 Patch Europe Land Cover Classification (2019) (2021) GitHub Belgium LPIS/GSAA Luxembours LPIS
  • EuroSAT dataset
    27000 labeled and geo-referenced Sentinel 2 satellite image patches (i.e., 64 64 pixels). Although the classification scheme is made up of 10 different classes, including land covers having peculiar temporal patterns (i.e., annual crops, permanent crops), the dataset is based on single time images.

    Data Source Type Area Task Paper Code Relevant implementations
    Sentinel 2 Patch Europe Land Cover Classification (2018) (2019) GitHub EfficientNet EfficientNetV2 Vision Transformers
  • The Canadian Cropland Dataset
    The Canadian Cropland Dataset is a temporal patch-based dataset of Canadian croplands, enriched with labels retrieved from the Canadian Annual Crop Inventory. The dataset contains 78,536 manually verified and curated high-resolution (10 m/pixel, 640 x 640 m) geo-referenced images from 10 crop classes (barley, canola, corn, mixedwood, oat, orchard crops, pasture, potatoes, soybeans and spring wheat) collected over four crop production years (2017-2020) and five months (June-October). Each instance contains 12 spectral bands, a RGB image, and additional bands corresponding to commonly used vegetation indices (NDVI, NDVI45, GNDVI, PSRI and OSAVI). Individually, each category contains at least 4,800 images.

    Data Source Type Area Task Paper Code
    Sentinel 2 Patch Canada Crop Classification (2022) GitHub
  • Sen12MS
    The SEN12MS dataset contains 180,662 patch triplets of corresponding Sentinel-1 dual-pol SAR data, Sentinel-2 multi-spectral images, and MODIS-derived land cover maps. The patches are distributed across the land masses of the Earth and spread over all four meteorological seasons. This is reflected by the dataset structure. All patches are provided in the form of 16-bit GeoTiffs containing the following specific information:

    • Sentinel-1 SAR: 2 channels corresponding to sigma nought backscatter values in dB scale for VV and VH polarization.

    • Sentinel-2 Multi-Spectral: 13 channels corresponding to the 13 spectral bands (B1, B2, B3, B4, B5, B6, B7, B8, B8a, B9, B10, B11, B12).

    • MODIS Land Cover: 4 channels corresponding to IGBP, LCCS Land Cover, LCCS Land Use, and LCCS Surface Hydrology layers.

      Data Source Type Area Task Paper Code Relevant Implementations
      Sentinel 1/2 Patch Global Land Cover Classification (2019) (2021) GitHub Image Classification: EfficientNet Transformer Vision Transformers
      Semantic Segmentation: U-Net DeepLab Transformer
  • SAT-4 and SAT-6
    SAT-4: Originally, images were extracted from the National Agriculture Imagery Program (NAIP) dataset. The SAT-4 contains 500,000 RGB images. Each sample image is 28x28 pixels (1m spatial resolution) and consists of 4 bands - red, green, blue and near infrared. Each image is annotated with one of the four classes that represent four broad land covers which include barren land, trees, grassland and a class that consists of all land cover classes other than the above three.
    SAT-6: Originally, images were extracted from the National Agriculture Imagery Program (NAIP) dataset. The SAT-6 contains 405,000 RGB images. Each sample image is 28x28 pixels (1m spatial resolution) and consists of 4 bands - red, green, blue and near infrared. Each image is annotated with one of the six classes that represent six broad land covers which include barren land, trees, grassland, roads, buildings and water bodies.

    This dataset could potentially be used for Super-Resolution tasks. For example, by matching this dataset with corresponing Sentinel-2 images. In the table below, we propose indicatively a list of implementations for this task on the PROBA-V dataset available on the paperwithcode website.

    Data Source Type Area Task Paper Code Relevant Implementations
    Aerial (R,G,B,NIR) Patch California Land Cover Classification (2015) - Super-Resolution
  • ZueriCrop
    The ZueriCrop dataset contains ground truth labels of 116,000 field instances. Each field instance consists of a polygon representing the borders of the field, and its dominant crop label in 2019. The ground truth labels of all 48 crop classes are provided by the Swiss Federal Office for Agriculture (FOAG) and correspond to the primary crop grown per field during the year. The input data is a time series of 71 multi-spectral Sentinel-2 Level-2A bottom-of-atmosphere reflectance images with a ground sampling distance (GSD) of 10 meters. All input images are atmospherically corrected using the Sen2Cor v2.8 software package. The dataset is collected over a 50 km × 48 km area in the Swiss Cantons of Zurich and Thurgau between January 2019 and December 2019. The entire scene is subdivided into smaller patches of 24 px×24 px. Patches without any ground-truth information are discarded. In the remaining patches the fraction of pixels without reference label is ≈48%. Only those four spectral channels available at the highest, 10 m resolution (Red, Green, Blue, and Near-Infrared) are used.

    Data Source Type Area Task Paper Code Relevant Implementations
    Sentinel 2 Patch Zurich (Switzerland) Crop Classification (2021) GitHub U-TAE
  • PASTIS
    PASTIS is a benchmark dataset for panoptic and semantic segmentation of agricultural parcels from satellite time series. It contains 2,433 patches within the French metropolitan territory with panoptic annotations (instance index + semantic label for each pixel). Each patch is a Sentinel-2 multispectral image time series of variable length.
    PASTIS dataset has been extended from the initial publication with aligned radar Sentinel-1 observations for all 2,433 patches in addition to the Sentinel-2 images. For each patch, approximately 70 observations of Sentinel-1 have been added in ascending orbit, and 70 observations in descending orbit. PASTIS-R can be used to evaluate optical-radar fusion methods for parcel-based classification, semantic segmentation, and panoptic segmentation.

    Data Source Type Area Task Paper Code
    Sentinel 2 Pixel France Semantic and Panoptic Crop Segmentation (2021) (2022) GitHub
  • CV4A Kenya
    This dataset was produced as part of the Crop Type Detection competition at the Computer Vision for Agriculture (CV4A) Workshop at the ICLR 2020 conference. The ground reference data were collected by the PlantVillage team, and Radiant Earth Foundation curated the training dataset after inspecting and selecting more than 4,000 fields from the original ground reference data. The dataset has been split into training and test sets (3,286 in the train and 1,402 in the test). The dataset is cataloged in four tiles. These tiles are smaller than the original Sentinel-2 tile that has been clipped and chipped to the geographical area that labels have been collected. Each tile has a) 13 multi-band observations throughout the growing season. Each observation includes 12 bands from Sentinel-2 L2A product, and a cloud probability layer. The twelve bands are [B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12]. The cloud probability layer is a product of the Sentinel-2 atmospheric correction algorithm (Sen2Cor) and provides an estimated cloud probability (0-100%) per pixel. All of the bands are mapped to a common 10 m spatial resolution grid.; b) A raster layer indicating the crop ID for the fields in the training set; and c) A raster layer indicating field IDs for the fields (both training and test sets). Fields with a crop ID of 0 are the test fields.

    Data Source Type Area Task Paper Code
    Sentinel 2 Sentinel tiles (Images) Kenya Crop Classification (2020) GitHub
  • TimeSen2Crop
    A pixel based dataset made up of more than 1 million samples of Sentinel 2 Time Series (TSs) associated to 16 crop types. This dataset includes atmospherically corrected images and reports the snow, shadows and clouds information per labeled unit. The provided TSs represent an agronomic year ranging from September 2017 to August 2018, using the publicly available Austrian crop type map based on farmer's declarations. TimeSen2Crop also includes a TS of Sentinel 2 images acquired in the following agronomic year (i.e., from September 2018 to August 2019).

    Data Source Type Area Task Paper Code
    Sentinel 2 Pixel Austria Crop Classification (2020) -
  • Sen4AgriNet
    The Sen4AgriNet dataset is built using Sentinel-2 images from different timestamps include all spectral bands that have different spatial resolution. On top of the dataset, it has been developed a series of functions such as spatio-temporal aggregations, to transform the original dataset according to the different AI problems.

    • 5-year multitemporal Sentinel-2 patches

    • Sentinel-1/2 data

    • The initial version of Sen4AgriNet consists of approximately 225,000. Corregistered with open LPIS data for regions in Spain and France with a total size of 10TB

      Data Source Type Area Task Paper Code
      Sentinel 2 Patch Europe Crop Classification (2021) GitHub
  • BreizhCrops
    BreizhCrops is a novel benchmark dataset for the supervised classification of field crops from satellite time series. It contains aggregated label data and Sentinel-2 top-of-atmosphere as well as bottom-of-atmosphere time series in the region of Brittany (Breizh in local language), north-east France.

    Data Source Type Area Task Paper Code
    Sentinel 2 Object Brittany (France) Crop Classification (2020) GitHub
  • Crop Type Mapping - Semantic Segmentation Datasets in Ghana and South Sudan
    The datasets include time series of satellite imagery from Sentinel-1, Sentinel-2, and PlanetScope satellites throughout 2016 and 2017. For each tile/chip in the dataset, there are time series of imagery from each of the satellites, as well as a corresponding label that defines the crop type at each pixel. The label has only one value at each pixel location, and assumes that the crop type remains the same across the full time span of the satellite image time series. In many cases where ground truth was not available, pixels have no label and are set to a value of 0.

    Data Source Type Area Task Paper Code
    Sentinel 1/2 & Planetscope GeoTIFF Ghanna & South Sudan Crop Classification (2019) GitHub
  • CaneSat dataset
    This dataset contains 1627 multispectral high resolution image patches of size 10 x 10 pixels with each pixel size of 10mx10m. These patches are generated from the Sentinel-2 (A/B) satellite images acquired during the period of October 2018 to May 2019. It covered one life cycle (12 months) of the sugarcane crop in the region of the Karnataka, India. Along with sugarcane crop field areas, other land covers are also included for classification purpose. The dataset provides two formats: jpg and tif. Former format includes images with RGB channels and later format includes six bands namely, Red, Green, Blue, Near Infrared, Red Edge and Short-wave infrared. Dataset also provides 3 vegetation indices .tif images such as enhanced vegetation index (EVI), normalized difference vegetation index (NDVI) and green normalized difference vegetation index (GNDVI) separately. All tif image patches are georeferenced and labeled. The focus of this dataset is to support further research in sugarcane crop classification especially in India.

    Data Source Type Area Task Paper Code
    Sentinel 1/2 GeoTIFF, JPG Karnataka, India Sugarcane Classification (2020) -
  • Spot the Crop Challenge
    The dataset contains a time-series of satellite imagery and labels for crop type that have been collected through aerial and ground survey. Labels are derived from the survey conducted by the Western Cape Department of Agriculture, for the period of 04-01-2017 to 11-31-2017 and the area of Western Cape, South Africa. Satellite data including multispectral Sentinel-2 are then matched with corresponding labels. The S2 time-series is provided every 5 days. Sentinel-1 data include VV and VH backscatter with a time window of 12 days. The label chips contain the mapping of pixel to crop type label. The following pixel values correspond to the following crop types.

    • 0 - No Data
    • 1 - Lucerne/Medics
    • 2 - Planted pastures (perennial)
    • 3 - Fallow
    • 4 - Wine grapes
    • 5 - Weeds
    • 6 - Small grain grazing
    • 7 - Wheat
    • 8 - Canola
    • 9 - Rooibos
      Data Source Type Area Task Paper Code
      Sentinel 1/2 GeoTIFF South Africa Crop Classification - GitHub
  • DENETHOR dataset (password: dailycrops)
    DENETHOR: The DynamicEarthNET dataset for Harmonized, inter-Operabel, analysis-Ready, daily crop monitoring from space. Our dataset contains daily, analysis-ready Planet Fusion data together with Sentinel-1 radar and Sentinel-2 optical time-series for crop type classification in Northern Germany. The dataset includes: i) The Planet Fusion Monitoring product, which consists of clean (i.e. free from clouds and shadows), daily gap-filled, high resolution (3m), temporally consistent, radiometrically robust, harmonized and sensor agnostic surface reflectance time series, featuring and synergizing inputs from both public and private sensor sources and directly interoperable with HLS (harmonized Landsat Sentinel) surface reflectance products. ii) Sentinel-1 (S1) imagery, which contains 3 channels in total: [VV, VH, ANGLE] where V and H stand for vertical and horizontal orientations, respectively, and ANGLE stores the angle of observation to the earth surface as described here. The data is collected in Interferometric Wide (IW) swath mode and it includes both ascending and descending orbit directions. and iii) Sentinel-2 (S2) imagery, which includes all L2A bands in the following order [B01, B02, B03, B04, B05, B06, B07, B08, B8A, B09, B11, B12]. The bands that have original spatial resolution of 20m and 60m are interpolated with a nearest-neighbour method to a 10m resolution.

    Data Source Type Area Task Paper Code
    Sentinel 1/2 & Planet Fusion Patch Northern Germany Crop Classification (2021) GitHub
  • Agriculture-Vision: Challenges & Opportunities for Computer Vision in Agriculture
    The dataset contains 21,061 aerial farmland images captured throughout 2019 across the US. Each image consists of four 512x512 color channels, which are RGB and Near Infra-red (NIR). Each image also has a boundary map and a mask. The boundary map indicates the region of the farmland, and the mask indicates valid pixels in the image. Regions outside of either the boundary map or the mask are not evaluated. This dataset contains six types of annotations: Cloud shadow, Double plant, Planter skip, Standing Water, Waterway and Weed cluster. These types of field anomalies have great impacts on the potential yield of farmlands, therefore it is extremely important to accurately locate them. In the Agriculture-Vision dataset, these six patterns are stored separately as binary masks due to potential overlaps between patterns. Users are free to decide how to use these annotations.

    Data Source Type Area Task Paper Code
    Aerial Images (RGB + NIR) USA Scene Classification (2020) GitHub
  • UAV-based Multispectral & Thermal dataset for exploring the diurnal variability, radiometric & geometric accuracy for precision agriculture
    To explore the diurnal variations, radiometric and geometric accuracy of UAV-based data for precision agriculture, a comprehensive dataset was created in a one-day field campaign (21 June 2017). The multi-sensor data set covers wheat, barley & potato experimental fields, located in Wageningen University and Research (WUR) farm maintained by Unifarm. UAV-based images were collected with several sensors over the experimental area, starting from 7:25am and ending at 20:00pm local solar time. The dataset consists of images collected by 9 flights with senseFly MSP4C, 9 with Parrot Sequoia, 2 with Slant Range P3, 5 with DJI Zenmuse X3 NIR, 4 with the senseFly Thermo-map and 1 with the RGB Sony WX-220. Additionally, validation measurements at radiometric calibration plates and plant sample locations were taken with a Cropscan handheld spectrometer and a tec5 Handyspec spectrometer. The dataset consists of the validation measurements, the raw images and the processed orthomosaics (both with and without geometric correction).

    Data Source Type Area Task Paper Code
    UAV Images (Green, Blue, Red, Red Edge, NIR, Thermal Infrared) Wageningen, Netherlands Crop Classification (2020) -

Analysis Ready Remote Sensing Data without labels

In-situ & Ground-level datasets

  • PlantVillage Dataset - Healthy and Unhealthy leaf images
    In this data-set, 39 different classes of plant leaf and background images are available. The data-set containing 61,486 images. The authors used six different augmentation techniques for increasing the data-set size. The techniques are image flipping, Gamma correction, noise injection, PCA color augmentation, rotation, and Scaling.

    Data Source Type Area Task Paper Code
    Crowdsource Grayscale/RGB Images USA Image Classification (healthy/unhealthy leaves) (2015) GitHub
  • iCrop Dataset - Street-level Imagery for Crop Classification
    It is the first large, public, multiclass road view crop photo dataset, for the development of crop type detection with deep learning.

    Data Source Type Area Task Paper Code
    Streel-level RGB Images China Crop Classification (2021) -
  • A Crop/Weed Field Image Dataset (CWFID) This dataset comprises field images, vegetation segmentation masks and crop/weed plant type annotations. The paper provides details, e.g. on the field setting, acquisition conditions, image and ground truth data format.

    Data Source Type Area Task Paper Code
    Field robot RGB Images Northern Germany Crop / Weed Discrimination (2015) Github

Geo-referenced labels

  • Hand Labelled Crop/No-Crop dataset
    This dataset provides the hand-labelled crop / non-crop points used for training, which were created by labelling high-resolution satellite imagery in QGIS and Google Earth Pro. Data is available for Ethiopia, Sudan, Togo and Kenya.

    Data Source Type Area Task Paper Code
    Photo-interpretation Shapefiles Africa Crop Discrimination (2021) Github
  • LEM+ dataset
    The dataset, in ESRI shapefile format (spatial reference system: WGS 84, EPSG: 4326), provides monthly land use information about 1854 fields from October 2019 to September 2020 from Luís Eduardo Magalhães (LEM) and other municipalities in the west of Bahia state, Brazil. The majority of the 16 land uses classes are related to crops.

    Data Source Type Area Task Paper Code
    Field visits Shapefiles Brazil Crop Monitoring (2020) -
  • Land Cover Map (Korean Ministry of Environment)
    Korean Ministry of Environment provides three types of land cover map(level-1, level-2, level-3) according to its scale. Level-3 land cover map, the most detailed product, provides approximately 1m resolution by interpreting aerial photo(0.25m), Kompsat-2(1m) and Kompsat-3(0.7m) satellite images. It classifies 7 major land covers (Used area, Agricultural Area, Forest, Grassland, Wet land, Bareland, Water) and subdivides them into 41 classes. The level-3 product was produced at each province with several years of interval until 2018, and the most recent product was released at 2019 covering the entire nation with the imageries of 2018. The data is available only for the registered domestic researchers. Therefore, please ask for cooperation to the Korean researcher in order to use it for the research.

    • Level-1 product: 30m resolution, raster format
    • Level-2 product: 5m resolution, shape format
    • Level-3 product: 1m resolution, shape format
    Data Source Type Area Task Paper Code
    Korean Ministry of Environment Shapefiles South Korea Crop Monitoring - -
  • Open Labelled Data (France)
    The graphic parcel register (RPG) is a geographical database used as a reference for the instruction of the aids of the common agricultural policy (CAP). Here you can find LPIS data for France together with the crop type declaration of the farmers. These data have been produced by the Services and Payment Agency (ASP) and it contatins data from 2010 and so forth. Anonymous RPG data are vintage and contain plots corresponding to those declared for campaign N in their known situation and approved by the administration, generally on 1 January of year N+1. These data cover the entire French territory, including Mayotte and Saint-Martin, but excluding Saint-Barthélemy. More information about the crop type labels and the files' format can be found here.
    Note: the site and all files are written in Frence and the files are located in an ftp server.

    Data Source Type Area Task Paper Code
    National Institute for Geographic and Forestry Information (IGN) and France Paying Agency (ASP) Shapefile France Crop Monitoring (2016) -
  • Open Labelled Data (Catalonia)
    The Department of Climate Action, Agriculture and Rural Agenda (DACC) of Catalonia makes available to the public the data from the crop map of Catalonia. This map allows you to locate the crops declared in the Agrarian Declaration - DUN submitted to the DACC. The DUN is the tool for making the declarations of agricultural holdings in Catalonia. It is also used to apply for aid and to carry out certain procedures with the DACC in an integrated way. The geographical basis of declaration is the SIGPAC area. Owners of agricultural holdings that have productive agricultural surface (excluding those for own consumption) are required to declare annually. Data from the DUN and the SIGPAC have been used to draw up this crop map. As the data declared are georeferenced, they can be located on the ground and this makes it possible to know, among other things, the identification of crops on each plot, the irrigation system and, depending on the cases, the second crop that is grown in the plot. This information makes it possible to make an economic assessment of the impact that hailstorms have on crops, the effects of pests, fires, etc. and also lets you know the historical evolution of crops in the territory. In the site you can find data from 2015 until now. Finally, in the document Origin of the crop map data , you can consult the details of the data that have been used to make the map.
    Note: all files and the site are written in Catalan.

    Data Source Type Area Task Paper Code
    Department of Climate Action, Agriculture and Rural Agenda (DACC) of Catalonia Shapefile Catalonia Crop Monitoring - -
  • Open Labelled Data (Sweden)

    LPIS can be found under in the Agricultural block section. Agricultural block is a dataset that contains information on maximum eligible agricultural land according to EU definitions. The agricultural blocks are used by the Swedish Agency for Agriculture to administer support to farmers, for example to check the area data in the farmers' applications and to inform the farmers about current data. The dataset does not contain all agricultural land in Sweden, but only the parts for which a farmer has applied for support at some point. A block is a polygon/surface that delimits an area of agricultural land. A block is delimited by fixed boundaries. Examples of fixed boundaries are roads, stone walls, forests and buildings. A block can also be delimited by regional boundaries (parish boundaries from 2000). A block must, with few exceptions, be at least 0.1 hectares. On a block, only one farmer can have agricultural land (exception for pasture that is cultivated together). The dataset Agricultural Blocks contains approximately 1,143,000 blocks. Of these, approximately 891,000 are arable land blocks and approximately 252,000 pasture land blocks. The total area is 3.2 million hectares, of which 2.7 million hectares are arable land and 510,000 hectares are pasture land. The average area for the arable land blocks is 3.03 ha and for the pasture blocks the corresponding figure is 2.03 ha.

    The corresponding file with the LPIS and the crop type labels can be found under the Agricultural shifts section. Info:
    A parcel is a contiguous area of land within a block where a farmer grows a crop or otherwise manages the land. To receive compensation for agricultural support (EU support), farmers apply for support from the Swedish Agency for Agriculture via a SAM application. Each parcel contains the attribute "Blockid" which shows which agricultural block it belongs to. "Shift designation" that names the specific shift. In addition, there are the attributes "EFA", "Grodkod" and "land name" which show what is grown or what the land is used for. The amount of information contains parcels where the applied area and decided area are the same. The data applies to the previous year (2021). The amount of information is limited to Gårdstöd. The crop code list can be accessed here.
    Note: all files and the site are written in Swedish.

    Data Source Type Area Task Paper Code
    The Swedish Agency for Agriculture .gml file Sweden Crop Monitoring - -
  • Open Labelled Data (The Netherlands)
    The National Georegister focuses primarily on the professional user. This can be a Geo- ICT specialist looking for datasets, services or other geo-information elements. But also a policy officer who wants to consult a map, a web developer or a student who develops a website or application and is looking for geo-information for it.

    Data Source Type Area Task Paper Code
    National GeoRegistry of The Netherlands GeoDatabase The Netherlands Crop Monitoring - -
  • Open Labelled Data (Flanders, Belgium)
    Overview of the parcels in agricultural use on the final date of submission of the single application that year. The inventory also includes pools, wooded areas and agricultural production facilities (yards with stables and buildings).

    Data Source Type Area Task Paper Code
    Agency for Agriculture and Fisheries of Belgium Shapefile, Gml (2.1.2) Flanders, Belgium Crop Monitoring - -
  • Open Labelled Data (Denmark)
    This data collection contatins a plethora of map data that the Danish Agriculture Authority has made openly avaialble. Specifically, under the Markblokke you can find the Land parcel Identification System (LPIS) data collection and under the Marker section you can find the Geo-spatial Aid Application (GSAA) data collection which contains parcel geometries accompanied by their crop type, from 2018 to today. More information are avaialble about the GSAA files where uou can also find you can find a description of crop names CropDescription.

    Data Source Type Area Task Paper Code
    Danish Agriculture Authority Shapefile Denmark Crop Monitoring - -
  • Land parcel Identification System (LPIS) - Luxembourg
    This dataset contatins agricultural and wine-growing parcels used as a basis for declarations within the framework of the common agricultural policy.

    Data Source Type Area Task Paper Code
    Administration of agricultural technical services of Luxembourg GML Luxembourg Vineyard Mapping - -
  • DWD_RECENT
    DWD Climate Data Center (CDC): Phenological observations of crops from sowing to harvest, in Germany. The temporal coverage is rolling, with a window of 500 days (ending always yesterday), and the crops of interest are: meadows, winter wheat, winter rye, winter barley, winter oilseed rape, summer wheat, spring barley, oat, sunflower, maize, beet, sugar beet, fodder beet. For more information click here.

    Data Source Type Area Task Paper Code
    Field Observations CSV files Germany Crop Phenology - -
  • DWD_ARCHIVE
    DWD Climate Data Center (CDC): Historical phenological observations of crops from sowing to harvest, in Germany. It contatins data from 1951-01-01 until 2017-12-31 for dozins of crops (meadows, winter wheat, winter rye, winter barley, winter oilseed rape, summer wheat, spring barley, oat, sunflower, maize, potato, early potato (pregerminated), early potato (non pregerminated), late potato, green bean, green pea, tomato, white cabbage, alfalfa, red clover, beet, sugar beet, fodder beet). For more information click here.

    Data Source Type Area Task Paper Code
    Field Observations CSV files Germany Crop Phenology - -

Land change

Analysis Ready Remote Sensing Data with labels

  • RapidAI4EO: A Corpus of Dense Time Series Satellite Imagery
    The RapidAI4EO corpus is a dataset of dense time series satellite imagery sampled at 500,000 locations across Europe. Sample locations are non-overlapping with a footprint of 600×600 metres. At each location the corpus contains datacubes of two cloud-free, regular-cadence image products and corresponding land cover labels:

    • Planet Fusion three-metre, five-day cadence radiometrically harmonized and gap-filled imagery for 2018–2019
    • Sentinel-2 L2A monthly image mosaics at 10-metre resolution for 2018
    • CORINE Land Cover multiclass labels for 2018 Originally designed to train deep learning models for land use and land cover (LULC) classification and change detection, the corpus is being released as open data to support research in these domains as well as others that could benefit from dense time series satellite imagery. The corpus was created under the RapidAI4EO project.
    Data Source Type Area Task Paper Relevant Implementations
    PlanetFusion, Sentinel-2, CORINE Sattelite Image Time Series Europe LULC Classification, Change Detection, and more (2021) (Tutorial)
  • Onera Dataset
    The Onera Satellite Change Detection dataset addresses the issue of detecting changes between satellite images from different dates. It comprises 24 pairs of multispectral images taken from the Sentinel-2 satellites between 2015 and 2018. Locations are picked all over the world, in Brazil, USA, Europe, Middle-East and Asia. For each location, registered pairs of 13-band multispectral satellite images obtained by the Sentinel-2 satellites are provided. Images vary in spatial resolution between 10m, 20m and 60m. Pixel-level change ground truth is provided for 14 of the image pairs. The annotated changes focus on urban changes, such as new buildings or new roads. These data can be used for training and setting parameters of change detection algorithms.

    Data Source Type Area Task Paper Relevant Implementations
    Sentinel-2 RGB Images with tif and png labels Worldwide (Asia, Brazil, Europe, Middle East, USA) Change Detection (2018) (Fully Convolutional Change Detection), (Patch-based Change Detection)
  • Urban Building Classification Dataset (UBC)
    UBC is a dataset aimed for the downstream task of building detection and classification from very high-resolution satellite imagery. The focus is on object-level interpretation of individual buildings. It is meant to provide not only a flexible test platform for object detection algorithms but also a solid basis for the comparison of city morphologies and the investigation of urban planning. As is stated in the linked paper, "UBC represents individual buildings using in-depth object-level descriptions concerning geometry as well as functionality. Buildings are treated as objects with individual ID and boundary. Adjacent building blocks are also separated according to house numbers making a subsequent high-level classification of individual buildings possible. The buildings are classified into predefined roof types, such as flat, gable and hipped roof as well as functional purposes, i.e., residential, commercial, industrial, public, and their sub-classes, e.g., single-family house, office building and school".

    Data Source Type Area Task Paper Relevant Implementations
    SuperView (a.k.a. GaoJing) and Gaofen-2 TIF Images, XML metadata and json annotations Beijing and Munich Building Detection, Change Detection (2022) -
  • xBD Dataset
    xDB is a large-scale dataset for the advancement of change detection and building damage assessment for humanitarian assistance and disaster recovery research. It contains more than 850,000 satellite images (Maxar satellites) of buildings before and after a variety of natural disasters, along with corresponding annotations of damage level and relevant metadata. Furthermore, xBD contains bounding boxes and labels for environmental factors such as fire, water, and smoke.

    Data Source Type Area Task Paper Relevant Implementations
    Maxar satellites RGB Images (png) and json annotations and metadata Various areas worldwide (a total of around 45500 km^2) Change Detection (2019) (xView Baseline)
  • SZTAKI AirChange Benchmark set
    A Ground truth collection for change detection in optical aerial images taken with several years time differences. It contains 13 aerial image pairs of size 952x640 and resolution 1.5m/pixel and binary change masks (drawn by hand), which were used for evaluation of the relevant papers (check table below). Each record constains a pair of preliminary registered input images and a mask of the 'relevant' changes. The input images are taken with 5, 7 resp. 23 years time differences. During the generation of the change mask, the creators have considered the following differences as relevant changes: (a) new built-up regions (b) building operations (c) planting of large group of trees (d) fresh plough-land (e) groundwork before building over. Note that the ground truth does NOT contain change classification, only binary change-no change decision for each pixel.

    Data Source Type Area Task Relevant Papers Relevant Presentation
    Aerial Photos Both images and ground truth in bmp format Szada and Tiszadob (Hungary) Change Detection (2009/1), (2009/2) (2008)

Analysis Ready Remote Sensing Data without labels

  • EarthNet2021 dataset

  • Sentinel-2 Multitemporal Cities Pairs (S2MTCP)

    • This dataset contains N=1520 Sentinel-2 level 1C image pairs focused on urban areas around the world. Bands with a spatial resolution smaller than 10 m are resampled to 10 m and images are cropped to approximately 600x600 pixels. The size of some images is smaller than 600x600 pixels as result of the fact that some coordinates were located close to the edge of a Sentinel tile, the images were then cropped to the tile border. Geometric or radiometric corrections are not performed. The dataset has been testes with multiple self-supervised learning methods for pre-training models for change detection.

      Data Source Type Area Task Paper Code Relevant Implementations
      Sentinel-2 image patches Worldwide Self Supervised Learning (2020) - -

In-situ & Ground-level datasets

Geo-referenced labels

Water quality

Analysis Ready Remote Sensing Data with labels

  • AquaSat
    AquaSat contains more than 600,000 matchups, covering 1984–2019, of ground-based total suspended sediment, dissolved organic carbon, chlorophyll-a, and SDDSecchi disk depth measurements paired with spectral reflectance from Landsat 5, 7, and 8 collected within ±1 day of each other. To build AquaSat, the authors developed open source tools in R and Python and applied them to existing public data sets covering the contiguous United States, including the Water Quality Portal, LAGOS-NE, and the Landsat archive.

    Data Source Type Area Task Paper Code
    Landsat 5,7,8 and in-situ (WQP and LAGOS-NE) csv Water bodies across USA (1984-2019) Water Quality estimation (2019) (Code used for dataset generation)
  • A dataset of remote-sensed Forel-Ule Index for global inland waters during 2000–2018
    This dataset provides significant information on spatial and temporal changes of water colour for global large lakes from 2000–2018 based on MODIS observations. It will be valuable to studies in search of the drivers of global and regional lake colour change, and the interaction mechanisms between water colour, hydrological factors, climate change, and anthropogenic activities.

    Data Source Type Area Task Paper Code
    MODIS csv Global (Large lakes) Water quality estimation & Water colour variability (2021) (IDL Code used to calculate FUI from MOD09A1 data)

Analysis Ready Remote Sensing Data without labels

In-situ & Ground-level datasets

Geo-referenced labels

Air quality

Analysis Ready Remote Sensing Data with labels

Analysis Ready Remote Sensing Data without labels

In-situ & Ground-level datasets

  • Air Quality e-Reporting (AQ e-Reporting)
    European air quality information reported by EEA member countries, including all EU Member States, as well as EEA cooperating and other reporting countries. The EEA’s air quality database consists of a multi-annual time series of air quality measurement data and calculated statistics for a number of air pollutants. It also contains meta-information on the monitoring networks involved, their stations and measurements, air quality modelling techniques, as well as air quality zones, assessment regimes, compliance attainments and air quality plans and programmes reported by the EU Member States and European Economic Area countries.
    Data Source Type Area Task Paper Code
    Air quality monitoring stations csv European Union Member States Air Quality modelling - -

Geo-referenced labels

  • NO2 Air Pollution Data
    With support from NASA, the Holloway Group at SAGE has developed a set of user-friendly datasets to support wider utilization of remote sensing data for air quality and health. This growing inventory of data includes:

    • Shapefiles of NO2 air pollution from satellite for use in GIS platforms, including the EPA’s EJSCREEN platform for environmental justice
    • 12 km x 12 km daily gridded data of NO2 air pollution from satellite for comparison with photochemical grid model output or other data sources

    Moreover, this dataset contains daily gridded DOMINO NO2 data, zipped into monthly files. These data were generated from Level-2 satellite data (on swaths) and gridded to a 12 km x 12 km horizontal resolution over the continental United States using the Wisconsin Horizontal Interpolation Program for Satellites (WHIPS) for ease of comparison with photochemical grid model output.

    Data Source Type Area Task Paper Code
    Satellite nc (NetCDF) - Can be opened through python, excel, etc USA (some states) Air Quality and Health (Paper) -
  • CAMS reanalysis data
    The CAMS reanalysis is the latest global reanalysis data set of atmospheric composition (AC) produced by the Copernicus Atmosphere Monitoring Service (CAMS), consisting of 3-dimensional time-consistent AC fields, including aerosols, chemical species and greenhouse gases (GHGs) through the separate CAMS global greenhouse gas reanalysis (EGG4). The CAMS global reanalysis (EAC4) currently covers the period 2003-June 2021 and CAMS global greenhouse gas reanalysis (EGG4) currently covers the period 2003-2020.

    Data Source Type Area Task Paper Code
    Μοdel data with observations GRIB or NetCDF files Globally Air Quality (2019) -

Other

Analysis Ready Remote Sensing Data with labels

  • The WorldStrat Dataset
    Nearly 10,000 km² of free high-resolution and matched low-resolution satellite imagery of unique locations which ensure stratified representation of all types of land-use across the world: from agriculture to ice caps, from forests to multiple urbanization densities. Those locations are also enriched with typically under-represented locations in ML datasets: sites of humanitarian interest, illegal mining sites, and settlements of persons at risk. Each high-resolution image (1.5 m/pixel) comes with multiple temporally-matched low-resolution images from the freely accessible lower-resolution Sentinel-2 satellites (10 m/pixel).

    Data Source Type Area Task Paper Code
    SPOT 6/7 and Sentinel-2 Image patches Worldwide Super-Resolution (2022) (GitHub)
  • Hephaestus
    Hephaestus is the first manually annotated dataset that consists of 19,919 individual Sentinel-1 interferograms acquired over 44 different volcanoes globally, which are split into 216,106 InSAR patches. The annotated dataset is designed to address different computer vision problems, including volcano state classification, semantic segmentation of ground deformation, detection and classification of atmospheric signals in InSAR imagery, interferogram captioning, text to InSAR generation, and InSAR image quality assessment.

    Data Source Type Area Task Paper Code
    Sentinel-1 Image patches Worldwide Computer Vision (e.g. Volcanic deformation classification) (2022) (GitHub - same as for dataset access)

Analysis Ready Remote Sensing Data with labels

  • Sen1Floods11
    A surface water dataset including raw Sentinel-1 imagery and classified permanent water and flood water. This dataset consists of 4,831 512x512 chips covering 120,406 km2 and spans all 14 biomes, 357 ecoregions, and 6 continents of the world across 11 flood events.

    Data Source Type Area Task Paper Code
    Sentinel-1 GeoTIFF Worldwide Flood water analysis (2020) (GitHub - same as for dataset access)
  • Labeled SAR imagery dataset of ten geophysical phenomena from Sentinel-1 wave mode (TenGeoP-SARwv)
    The TenGeoP-SARwv dataset is established based on the acquisitions of Sentinel-1A wave mode (WV) in VV polarization. This dataset consists of more than 37,000 SAR vignettes divided into ten defined geophysical categories, including both oceanic and meteorologic features. These images cover the entire open ocean and are manually selected from Sentinel-1A WV acquisitions in 2016. For each image, only one prevalent geophysical phenomena with its prescribed signature and texture is selected for labeling. The SAR images are processed into a quick-look image provided in the formats of PNG and GeoTIFF as well as the associated labels. They are convenient for both visual inspection and machine-learning-based methods exploitation.

    Data Source Type Area Task Paper Code
    Sentinel-1 PNG and GeoTIFF Globally (open ocean) Modeling of Oceanographic and Atmospheric phenomena (Paper) -
  • VisDrone dataset
    From the description of the dataset repository: Drones, or general UAVs, equipped with cameras have been fast deployed to a wide range of applications, including agricultural, aerial photography, fast delivery, and surveillance. Consequently, automatic understanding of visual data collected from these platforms become highly demanding, which brings computer vision to drones more and more closely. We are excited to present a large-scale benchmark with carefully annotated ground-truth for various important computer vision tasks, named VisDrone, to make vision meet drones. The VisDrone2019 dataset is collected by the AISKYEYE team at Lab of Machine Learning and Data Mining , Tianjin University, China. The benchmark dataset consists of 288 video clips formed by 261,908 frames and 10,209 static images, captured by various drone-mounted cameras, covering a wide range of aspects including location (taken from 14 different cities separated by thousands of kilometers in China), environment (urban and country), objects (pedestrian, vehicles, bicycles, etc.), and density (sparse and crowded scenes). Note that, the dataset was collected using various drone platforms (i.e., drones with different models), in different scenarios, and under various weather and lighting conditions. These frames are manually annotated with more than 2.6 million bounding boxes of targets of frequent interests, such as pedestrians, cars, bicycles, and tricycles. Some important attributes including scene visibility, object class and occlusion, are also provided for better data utilization.

    Data Source Type Area Task Paper Code Relevant Implementations
    Drones/UAVs JPG images and txt annotations 14 cities in China Object detection and tracking in images and videos, Crowd counting (Paper) (GitHub - Dataset access), (GitHub - Documentation) (GitHub)
  • AU-AIR Dataset
    AU-AIR dataset is the first multi-modal UAV dataset for object detection. It meets vision and robotics for UAVs having the multi-modal data from different on-board sensors, and pushes forward the development of computer vision and robotic algorithms targeted at autonomous aerial surveillance. AU-AIR has several features:

    • Object detection in aerial images
    • more than 2 hours raw videos
    • 32,823 labelled frames
    • 132,034 object instances
    • 8 object categories related to traffic surveillance
    • Frames are also labelled with time, GPS, IMU, altitude, linear velocities of the UAV
    Data Source Type Area Task Paper Code
    Drones/UAVs JPG (images) and json (annotations) Aarhus, Denmark Object detection (Paper) (GitHub - Dataset), (GitHub - Tools/API)
  • LandCover.ai: Dataset for Automatic Mapping of Buildings, Woodlands, Water and Roads from Aerial Imagery
    Semantic segmentation dataset for land cover classification based on aerial RGB images. Contains four manually annotated land cover classes: buildings, woodlands, water, roads. It covers 216 km² over Poland, with 25 cm / 50 cm resolution. [Paper]

    Data Source Type Area Task Paper Code
    Aerial Photos (RGB) GeoTiffs Poland Land Cover Mapping (2021) -

Analysis Ready Remote Sensing Data without labels

In-situ & Ground-level datasets

Geo-referenced labels

Web Application / Websites with labelled data

  • Mapillary Street Level Images
    A web platform/application where crowdsourced map data and street level imagery are available to everyone. Computer vision is used to combine those images and create immersive street-level views. Among many other features, Mapillary offers:

    • A quite extended coverage for Europe
    • Integration with OpenStreetMap, ArcGIS tools, and HERE Map Creator
    • The ability to request imagery for areas that, either don’t already have images, or just to get a more recent version of them
    • Navigation in a Google Street View style for easy visual interpretation
    • Filter imagery by capture time
    • Filter imagery by the types of objects that appear in the images (not an extended list of agriculture-specific objects yet though - mainly focused on city infrastructure and traffic lights/signs for now)
      Data Source Type Area Task Relevant Papers Relevant Implementations
      Mobile phones, action cameras etc. on the street-level JPG and json annotations (through web or API) Worldwide (crowdsourced) Computer Vision (2016), (2016), (2017), and many more - find full list here (GitHub - Space2Ground), (GitHub - Agricultural annotations)
  • Eden Library
    Eden Library is a collection of high value plant datasets embedding agricultural domain knowledge produced in an academic environment. Eden Library includes a wide range of agrifood datasets such as:

    • Plant pests
    • Plant diseases
    • Weeds
    • Healthy plants
      That were acquired using:
    • Various styles (Proximal, UAV upon request)
    • Various sensors (RGB, thermal, multispectral & hyperspectral upon request)
    Data Source Type Area Task Paper Code
    Mobile phones, cameras (in-situ and UAV-mounted) Images Greece Precision agriculture tasks (Paper) (GitHub - Notebooks)
  • senseFly
    Explore how senseFly drone solutions are employed around the globe — from topographic mapping and site surveys to stockpile monitoring, crop scouting, earthworks, climate change research and much more. The main domains that are included in this dataset are:

    • Tactical Mapping
    • Surveying & Mapping
    • Mining, Quarries & Aggregates
    • Engineering & Construction
    • Agriculture
    • Environmental Monitoring
    • Humanitarian
    Data Source Type Area Task Paper Code
    Drones/UAVs Images Worldwide Various (topographic mapping, crop scouting, climate change, etc.) - -

European projects

  • Global Earth Monitor (GEM)

    • Most of datasets that were either used to produce GEM results, or are results of GEM project, are made available publicly.
    • With the objective being the uptake of GEM framework, they share example code on how to access data using notebooks available at https://github.com/sentinel-hub/eo-learn-examples/.
    • The introductory notebook gives an overview of the data used and produced within GEM framework. The examples, presented in Jupyter Notebooks, are structured according to data type:
      • EO data: Earth Observation data (e.g., Sentinel and LandSat missions)
      • EO derived data: data, derived from EO data (e.g., Global Land Cover)
      • EO commercial data: commercial EO data (e.g., Maxar imagery)
      • weather/climate data: weather data, accessible through meteoblue services
      • GEM ML ready data-cubes: analysis/machine-learning ready datacubes, created within GEM project
      • GEM datasets: The GEM datasets facilitate easier navigation and clearer overview of data produced in various use-cases from GEM project, and is further structured into several notebooks. The use cases are:
        • Built-up areas use-case
        • Map making use-case
        • Land Cover - Continuous Monitoring Servece (LC-CMS)
  • DeepCube

    • mesogeos: A Daily Datacube for the Modeling and Analysis of Wildfires in the Mediterranean

      • mesogeos is meant to be used to develop models for next-day fire hazard forecasting in the Mediterranean. The dataset contains satellite data from MODIS, weather variables from ERA5-Land, soil moisture index from JRC European Drought Observatory, population count & distance to roads from worldpop.org, land cover from Copernicus Climate Change Service, elevation, aspect, slope and curvature from Copernicus EU-DEM, and burned areas and ignition points from EFFIS.
      • Available at: https://doi.org/10.5281/zenodo.7473332
      • More information and link for downloading the dataset can be found in https://github.com/Orion-AI-Lab/mesogeos
    • Hephaestus: A large scale multitask dataset towards InSAR understanding

      • Hephaestus is the first of its kind, manually annotated dataset that consists of 19,919 individual Sentinel-1 interferograms acquired over 44 different volcanoes globally, which are split into 216,106 InSAR patches. The annotated dataset is designed to address different computer vision problems, including volcano state classification, semantic segmentation of ground deformation, detection and classification of atmospheric signals in InSAR imagery, interferogram captioning, text to InSAR generation, and InSAR image quality assessment.
      • Available at: https://github.com/Orion-AI-Lab/Hephaestus
    • Annotated InSAR datasets for volcanic unrest detection

    • Africa minicubes dataset

      • The DeepCube Africa Minicubes dataset has been designed for prototyping models for forecasting drought impacts in Africa. It is an open-access dataset consisting of 50.000 spatio-temporal minicubes (13 Tb, 2017-2022). It pairs high resolution remotely sensed spectral bands with weather observations into sparsely sampled minicubes of 3.84×3.84Km. The data is shaped into minicubes in order to facilitate the training of deep learning spatio-temporal models that make use of both spatial and temporal dependencies (convolutions and recurrency, e.g., video prediction models).
      • Access the dataset on Zenodo
    • Somalia EO data cube for drought displacement

      • EO data cube consisting climate variables between 2010 and 2022 at 0.1 degree resolution (roughly 1 GB) over Somalia. The data cube includes soil moisture at 4 different levels, total precipitation, 2m air temperature, potential evaporation captured from ERA5 land as well as precipitation obtained from the CHIRPS. This data cube can be served serves as the basic input resource for understanding the climate effect on drought displacement by considering various spatial and temporal aggregation levels.
      • A tabular dataset is available consisting all the variables (Climate and vegetation related variables, Fatality related variables, Social economic variables, Internally Displaced Persons (IDPs)) for analyzing the climate induced migration in Africa.
      • Can be accessed via Zenodo
    • Data cube for the wildfire research community

      • Meant to be used to develop models for next day fire hazard forecasting in Greece. The dataset includes dynamic variables, such as previous day Leaf Area Index, evapotransiration, Land Surface Temperature, meteorological data, fire variables and Fire Weather Index, resampled at daily temporal resolution and 1km spatial resolution. It also includes static variables, such as roads density, population density and topography layers.
      • To download and directly access the data cube please visit http://doi.org/10.5281/zenodo.4943354
      • To import and analyse the data within the data cube, this Jupyter Notebook can be used.
    • Data cube to calculate the environmental impact of tourism in Brazil

      • The aim of the dataset is to use models to isolate the impact of a travel package offered by a tourism stakeholder, so that the virtual extra cost of a tourist on the local environment can be calculated. The datacube includes dynamic variables such as the land surface temperature, the soil moisture condition, the Normalized Difference Vegetation Index, the atmospheric composition through different variables (carbon monoxide, nitrogen dioxide, zone, sulphure dioxide, etc), the thermal comfort index of the Tourism Sustainable Development Index, as well as other static variables such as topographic layers or land cover maps.
      • To access and download the datacube visit https://doi.org/10.5281/zenodo.5076076.
  • NextGEOSS Data Catalog (DaaS)

    • Archived data: YES
    • Real Time data or NEAR REAL TIME: YES
    • Data harvesting policy: Initially driven by the needs of the pilots, now opened to all European project.
    • Data are accessible via an Opensearch standard API with OGC Opensearch GEO and Time Extensions (http://www.opengeospatial.org/standards/opensearchgeo)
    • The Catalog GUI allows generating the requests that can be then be reused in a M2M dialogue. Compliancy has been validated with NASA Validation tools.
  • Geocradle Data Catalog (DaaS)

    • GEO-CRADLE PILOT 1 datasets: Adaptation to Climate Change (ACC): ACC-DUST
    • GEO-CRADLE PILOT 2 datasets: Improved Food Security – Water Extremes Management (IFS-WEM): Regional Soil Spectral Library
    • GEO-CRADLE PILOT 3 datasets: Access to Raw Materials (ARM)
    • The Regional Data Hub provides access to millions of regional datasets, and thus fosters further data sharing and EO service development for the benefit of the relevant science and geo-information sector. It includes 26.623.346 datasets and keeps growing.
  • Copernicus

Other Useful Data Collections

  • Radiant MLHub: Radiant MLHub hosts open ML training datasets and models generated by Radiant Earth Foundation, partners, and community. Radiant MLHub allows anyone to access, store, register, and share open training datasets and models for high-quality Earth observations, and it’s designed to encourage widespread collaboration and development of trustworthy applications.
  • Satellite Image Deep Learning: This page lists resources for performing deep learning on satellite imagery. To a lesser extent classical Machine learning (e.g. random forests) are also discussed, as are classical image processing techniques.
  • Awesome Remote Sensing Change Detection: A list of datasets, codes, and contests related to remote sensing change detection.
  • Satellite Image Time Series Datasets: A list of satellite imagery datasets with a temporal dimension, mainly satellite image time series (SITS) and satellite videos, for various computer vision and deep learning tasks. It covers multi-temporal datasets with more than two acquisitions but not bi-temporal datasets. By corentin-dfg.
  • IEEE GRSS Earth Observation Database: This webpage provides an interactive and searchable catalog of public benchmark datasets for remote sensing and earth observation with the aim to support researchers in the fields of geoscience, remote sensing, and machine learning.
  • Awesome-Remote-Sensing-Dataset: This github repository contatins a plethora of remote sensing datasets datasets, with categorization per downstream task (Image classification, Object detection, Semantic Segmentation, Building Detection, Road Detection, Ship Detection, Change Detection, Super Resolution, Stereo Matching, Lidar and Other data)
  • AiTLAS: Benchmark Arena: AiTLAS: Benchmark Arena is an open-source benchmark framework for evaluating state-of-the-art deep learning approaches for image classification in Earth Observation (EO). Here one can find a comprehensive comparative analysis of more than 400 models derived from nine different state-of-the-art architectures, and compare them to a variety of multi-class and multi-label classification tasks from 22 datasets with different sizes and properties. More details can be also found in the corresponding paper.

Contact

Acknowledgements

This work has been supported by the CALLISTO project which has been funded by EU's Horizon 2020 research and innovation programme under grant agreement No. 101004152.

Curated by the Beyond Center of EO Research and Satellite Remote Sensing, IAASARS, National Observatory of Athens