- @Saverio Mancino - PhD Student (Dept. Geo-enviromental science - University of Bari).
- @Anna Sblano - Research Fellow (Dept. Geo-enviromental science - University of Bari).
- @Francesco Paolo Lovergine - Researcher (Institute for the Electromagnetic Survey of the Atmosphere - National Research Council of Italy).
- @Giuseppe Amatulli - PhD Researcher (School of the Environment - Yale University).
- Domenico Capolongo - PhD Professor (Dept. Geo-enviromental science - University of Bari).
The main purpose of this project is to generate a single global-scale landslide dataset that collects and standardizes within it all open source global, national, and sub-national landslide datasets, provided with spatial and temporal accuracy details, along with several general information about each single record.
The whole code is published under the MIT License.
The catalogue, along with its associated analyses, methodologies, and results, is planned for publication by Spring 2025.
This release will include comprehensive documentation and datasets, making the information fully accessible to the
scientific community and the public.
Stay tuned for updates as we approach the release date.
The catalogues data are available in these googledrive repositories, as full catalogue in .csv format (with '|' as separator) and as a tile in .gpkg format (see tiling scheme). They're distributed under the licence Creative Commons Attribution 4.0 International (CC BY 4.0)
CATALOGUE TYPE | NUMBER OF RECORDS | FILES REPOSITORY |
---|---|---|
UGLC Point Catalogue files | 1061450 points | FULL CATALOGUE download ⬇️ |
TILED CATALOGUE download ⬇️ | ||
UGLC Polygonal Catalogue files | 984126 polygons | POLYGONAL CATALOGUE |
ATTRIBUTE | TYPE |
---|---|
WKT_GEOM | Well known text |
NEW DATASET | String |
ID | Int |
OLD DATASET | String |
OLD ID | String |
VERSION | String |
COUNTRY | String |
ACCURACY | Int |
START DATE | Date |
END DATE | Date |
TYPE | String |
PHYSICAL FACTORS | String |
RELIABILITY | Int |
RECORD TYPE | String |
FATALITIES | Int |
INJURIES | Int |
NOTES | String |
LINK | String |
-
WKT_GEOM: The contents of this field contain information about the georeferencing of each point described in the dataframe using the WGS84 reference system.
-
NEW DATASET: the content of this field represents the name of the new dataframe's identifying abbreviation: "UGLC".
-
ID: the content of this field contains a unique ID for each landslide event included into the UGLC dataset.
-
OLD DATASET: the contents of this field represent the name of the native dataset used into the UGLC creation:
POINT DATASET REFERING NAME N° POINTS LICENSE DOWNLOAD IMPLEMENTED 01_COOLR Cooperative Open Online Landslide Repository (NASA) Event + Report points (with no duplicates) 49718 LICENSE free ✔️ 02_GFLD Global fatal landslide occurrence from 2004 to 2016 5490 LICENSE free ✔️ 03_ITALICA ITAlian rainfall-induced LandslIdes CAtalogue (CNR - IRPI) 6312 LICENSE free ✔️ 04_UAP Landslide Inventories across the United States version2 (USGS) 176427 LICENSE free ✔️ 05_ALC Australia Landslide Catalogue 1653 LICENSE free ✔️ 06_PCLD Preliminary Canadian Landslide Database 10134 LICENSE free ✔️ 07_RBR Shallow Landslide Inventory for 2000-2019 (eastern DRC, Rwanda, Burundi) 7945 LICENSE free ✔️ 08_NZK Map of co-seismic Landslides for the 7.8 Kaikoura earthquake, New Zealand 7355 LICENSE free ✔️ 09_CA Mass Movements Information System (SIMMA) of the Colombian Geological Service 1065 LICENSE free ✔️ 10_BGS National Landslide Database - Index data (BGS) 15050 LICENSE on demand ✔️ 11_NTMI Landslide Events Data (GSI) 2811 LICENSE free ✔️ 12_VLS Vermont Geological Survey's preliminary landslide inventory 3049 LICENSE free ✔️ 13_SLIDO Statewide Landslide Information Database for Oregon (DOGAMI) 15378 LICENSE free ✔️ 14_1N 1N (2015-2027): French Landslide Observatory – OMIV (Temporary data) 194 LICENSE free ✔️ 15_CAFLAG The CAmpi Flegrei LAndslide Geodatabase 2302 LICENSE free ✔️ 16_ETGFI ETGFI - Earthquake-Triggered Ground-Failure Inventories (POINTS) - USGS 115402 LICENSE free ✔️ 17_IFFI IFFI - Inventario fenomeni franosi in Italia (ISPRA) 622447 LICENSE free ✔️ -
OLD ID: the contents of this field represent the identifying id assigned to this row in the source dataset (if any)
-
VERSION: the contents of this field represent the latest updated version of the original dataset used (if specified)
-
COUNTRY: the content of this field represents the country where the record was located (where missing it was derived using its coordinates)
-
ACCURACY: the content of this field represents the precision in meters of the relative deviation of the geo-referenced point from the actual landslide (if there is one), where it is not clearly specified is inferred based on the information present in the record. While the total absence of accuracy information becomes a NaN value for identify spatially uncertain records, represented by the value '-99999'.
-
START DATE: the contents of this field represent the date of the record (if specified exactly in the source dataset) and in that case it will coincide with the END DATE field (format:ISO 8601:YYYY/MM/DD). In case the record date is not present or clearly explicit, this field will contain the start date of the dataset acquisition time range; so the date inside this field will not be coincident with the END DATE field, implying the temporal uncertainty of that record. In case of records where start date could not be derived at all, or if the record start date is before '1677/12/31', this field will be set as '1678/01/01' due to pandas time limit.
-
END DATE: the contents of this field represent the date of the record (if specified exactly in the source dataset) and in that case it will coincide with the START DATE field (format:ISO 8601:YYYY/MM/DD). In case the record date is not present or clearly explicit, this field will contain the end date of the dataset acquisition time range; so the date inside this field will not be coincident with the START DATE field, implying the temporal uncertainty of that record.
-
TYPE: Contains information about the geological and kinematic type of the landslide record, standardized using the extended classification of Varnes including also other common gravitational surface instability phenomena (Hungr et al., 2014). These type categories are standardized using this reference table:
LANDSLIDE CATEGORY (description) complex soil creep debris flow earth flow lahar earth slide mudslide riverbank collapse rock slide rock fall rotational sliding translational sliding earth spreading rock spreading mud flow sinkhole ND -
PHYSICAL FACTORS: This field encompasses the physical factors contributing actively to the landslide activation, categorized into predisposing (PR), preparatory (P) and triggering (T) factors. Predisposing factors include invariant characteristics such as geology, topography, and land use; preparatory factors refer to monitorable cyclical changes like seasonal variations in saturation, weathering, or fire-induced alterations while triggering factors involve impulsive events such as earthquakes, intense rainfall, or volcanic activity. The category of Predisposing factors (PR) was not considered in our classification because it was absent in the native data. Therefore, only the categories of Preparatory (P) and Triggering (T) factors were considered in the classification of physical factors of landslides in this catalog. These categories are standardized using this reference table:
PHYSICAL FACTORS IDENTIFYING ABBREVIATION (description) (value) Rainfall activity rainfall (T) Seismic activity seismic (T) Volcanic activity volcanic (T) Human-induced factors anthropic (T,P) Climatic factors climate (T,P) Post-fire conditions postfire (P) Post-deforestation processes conditions deforestation (P) Erosional and biological factors natural (T,P) Not defined ND -
RELIABILITY: the content of this field represents the reliability of the data based on a decision table that takes into account spatial accuracy (ACCURACY) and temporal accuracy (START DATE, END DATE):
SPATIAL RELIABILITY TEMPORAL RELIABILITY RELIABILITY DESCRIPTION CLASS (meters) (START DATE = END DATE) (Description) (value) ( <100 m ) TRUE Exact point 1 ( <100 m ) FALSE Almost exact point 2 ( >100 m and <250 m ) TRUE Very high reliability point 3 ( >100 m and <250 m ) FALSE High reliability point 4 ( >250 m and <500 m ) TRUE Medium reliability point 5 ( >250 m and <500 m ) FALSE Low reliability point 6 ( >500 m and <1000 m ) TRUE Very low reliability point 7 ( >500 m and <1000 m ) FALSE Poor reliability point 8 ( >1000 m ) TRUE and FALSE Point with uncertain reliability 9 ( -99999) TRUE and FALSE Unreliable point 10 -
RECORD TYPE: The contents of this field contain information regarding the record type: report, event.
-
Report catalogs are usually landslide reports that typically collect a lot of detailed technical information about individual landslide events.
-
Event catalogs, on the other hand, generally focus on summarizing landslide events triggered by episodic events (such as heavy rains, earthquakes, eruptions, etc.) with less technical information and more statistical details, without delving into the specifics of each event.
-
-
FATALITIES: the content of this field contains the number of fatalities related to the record (if explicit), where the NaN values are represented by the value -99999
-
INJURIES: the content of this field contains the number of injuries related to the record (if explicit), where the NaN values are represented by the value -99999
-
NOTES: the content of this field contains the notes and information relate to the record (if explicit)
-
LINKS: the content of this field contains the link to the source of the record report or study (if explicit)
Folder Structure Scheme
The entire UGLC structure is allocated in 2 main repositories:
- GitHub Scripts Repository (GSR)
- Drive Files Repository (DFR)
The GSR contains 5 main folders :
-
/input
This folder contains the "native_datasets" subfolder, which contains the standardizer scripts ("N_DATAFRAME_standardizer.py") which read the downloaded files into the DFR 'input/download' subfolder (containing native datasets as .csv/.shp/.gpkg etc. downloaded from the source sites (Entities, Government agencies, Universities, Various repositories etc.) and create a standardized .csv ready to be converted into the UGLC format, and save it (as "N_DATAFRAME_native.csv") into the DFR 'input/native_dataset' subfolder.
-
/csv
This folder contains one subfolder named after each different native datasets ("N_DATAFRAME") contains the converting scripts ("N_DATAFRAME_converter.py") and the lookup tables ("NN_DATAFRAME_lookuptables.json") which read the native datasets from the DFR 'input/native_dataset' subfolder, then filter and convert each one into the UGLC standard format, using also the functions from the 'lib' folder, and save them (as "N_DATAFRAME_converted.csv") into the DFR 'output/converted_csv' subfolder.
-
/output
This folder contains the unifier script ("unifier.py") that read all the converted datasets from the DFR 'output/converted_csv' subfolder, then merge and filter them for generating the final UGLC dataframe ("UGLC_point_full.csv") and the tiled verion ("UGLC_point_tile_i_j.gpkg"), saving everything into the DFR 'output' folder.
-
/lib
This folder contains the functions script ("function_collection.py") which are called from the converter scripts into the GPR for various data conversion.
-
/files
This folder contains all the files used by this readme file, like pictures and the license file.
The UGLC catalog is also available in GeoPackage format, divided into 105 tiles that cover the entire Earth's surface. Each tile includes a Tile_ID attribute (_i_j) for unique identification within the grid:
i (longitude step) = [0-15]
j (latitude step) = [0-7]
Empty tiles are automatically excluded from storage, ensuring optimized file management and performance.
In order to better understand the information content of both catalogues (point and polygonal), several statistical analysis
were conducted to explore key aspects of the contained data. This information is essential to ensure appropriate and targeted
use of the catalogues, highlighting their potential for future scientific developments.
The analysis demonstrates a pronounced disparity in the geographical distribution of landslide records across continents.
Within the point catalogue, Europe exhibits the highest representation (61.55%) followed by North America (19.63%) and Asia
(10.17%). Africa, South America and Oceania collectively constitute a really low share (below 3.97%).
While, the polygonal catalogue presents a different distribution pattern, with Asia leading with Europe (45.09% and 43.40%),
followed by North America (8.73%). Also in this case, Africa, South America and Oceania collectively constitute a negligible
portion (1.43%).
This imbalance becomes more apparent by going into more detail with a state-by-state analysis, showing how native
datasets represent landslide records with an unbalanced distribution in both density and geographic distribution.
Particularly from the state-wise density data, it can be seen that some relatively small states like Italy, UK,
New Zealand, etc. lead the landslide data collection along with large countries such as the USA and China
(sometimes heavily surpassing them, as in the case of Italy, which alone contributes more than 57% of the whole catalogue).
This shows a different attention to landslide phenomena in more affected countries, also highlighting a different
socioeconomic influence devoted to the study and analysis of landslides in different countries. Although, in contrast to
the high density of studies available for these regions, much of the data (particularly from European and Asian areas)
are not openly accessible.
Consequently, the analysis was affected by restrictions applied to certain datasets that are not publicly available.
Temporal consistency analysis highlighted the heterogeneous data time consistency across datasets, that required a
significant effort to standardize and interpret temporal data while addressing discrepancies in formatting and granularity.
Native datasets varied widely in their time precision, ranging from exact event dates to broader temporal ranges
(e.g., decades or centuries). For records with incomplete or poorly formatted temporal data, standardization efforts
involved assigning representative time ranges based on the available context, ensuring logical alignment with the
recorded phenomena. This approach not only improved temporal consistency but also enhanced the utility of the catalogue
by preserving valuable, albeit imprecise, historical data. This aims to mitigate the risk of data misinterpretation
resulting from inconsistent native data formats, providing a temporal reliable catalogue.
Along with temporal accuracy, spatial accuracy is a critical factor in cataloguing these phenomena, as it determines the
geospatial reliability of each record.
Native data sets often presented difficulties, including poorly formatted coordinates, varying levels of precision,
and inconsistencies in georeferencing methods. To address these issues, a standardized spatial accuracy parameter was
established, allowing for a consistent representation on a meter scale of the spatial reliability of each record.
Accuracy was converted when native data provided were on other scales, while for records with incomplete or ambiguous
location data, an expert interpretation was employed to estimate the probable accuracy. This process involved
cross-referencing auxiliary information, such as nearby landmarks or descriptive metadata, to determine coordinates
that closely approximated the event's actual location.
This methodology ensured that even imprecise data could be meaningfully integrated, significantly reducing the
proportion of records categorized as no-data in spatial accuracy.
The resulting accuracy distribution ranges from highly precise values (<10 meters) to broader approximations
(>10 kilometers), reflecting the inherent variability in quality and reporting practices of the source
Therefore, the reliability attribute introduced in this catalogue, calculated on the basis of spatial and temporal
accuracy, reflects the general robustness of each individual record after the standardization processes.
Showing for both catalogues (point and polygonal), an extremely high record reliability (class 1 and 2), whereas
only in the point catalogue, the data with a lower reliability class together do not exceed 15% of the catalogue.
However, all the spatial and temporal standardization process establishes a reliable framework, summarized by the
reliability class parameter.
Making the catalogue suitable for future precise applications such as spatial modelling,
risk assessment, and policy development.
From further data analysis, it was also possible to highlight the distribution of the different standardized landslide
types found within the unified catalogue with detail also on the variance of each type based on the information in the
native record.
A major difficulty in the creation of this huge standardized catalogue was the condensation of heterogeneous data to
achieve information consistency. Especially in a context such as geology, where extreme variance in the nomenclature
of different types is often a stumbling block in data intercommunication.
In fact, the observed variance reflects the extent to which native data sources contributed to the different
interpretation of each standardized landslide type.
This diversity comes from the consolidation of extremely heterogeneous datasets, in which different terminologies,
classification schemes, and levels of granularity were harmonized into standardized categories, while also recovering
data on the large amount of typos and data entry errors. Types with greater variance, such as “complex” or “earth flow,”
therefore indicate the presence of a higher rate of interpreted data than data on the natively more unambiguous and
consistent and therefore easily interpreted typology such as for “rockfall” or “sinkhole” types.
It was also possible to analyse the distribution of various physical factors associated to each landslide catalogued record.
The graph reveals a higher prevalence of missing informations about physical factors, followed by Triggering factors (T)
and Preparatory factors (P), without representation of Predisposing factors (PR).
The dominance of common Triggering factors like rainfall and seismic activity, highlights the statistical prevalence of
these phenomena in native catalogues. However, this distribution is also clearly influenced by the uneven geographical
coverage of the data, where landslides tend to occur more frequently in regions where these triggering factors are more
prominent, underscoring the need to address spatial heterogeneity in future data collection to enhance global
representativeness.
Analyzing the overall distribution of the various standardized landslide types in the catalogue shows how the frequencies
of each type of landslide vary widely.
The graph reveals how the undefined categories ('ND') are the majority, showing native datasets lack of information on
the kinematics for each landslide record.
However, for non-null categories, the types 'complex', 'earth slide', 'rock fall' and 'soil creep' are the most prevalent,
while types such as 'lahar' and 'earth spreading' are minimally represented.
The spatial heterogeneity of the dataset is evident, with dense clusters in regions widely studied as more climatically
and geologically active, such as South Asia and Central America, and under-representation in areas such as Africa and
Russia due to data gaps related to likely difficulty in mapping or restrictions in data availability.
The impact of data availability and uneven data resolution on the global representation of landslides is highlighted
even more.