Skip to content

Comparing Machine Learning algorithms for assessing individual tree mortality: code and data

License

Notifications You must be signed in to change notification settings

aitorvv/ML_individual_tree_mortality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Comparing Machine Learning algorithms for assessing individual tree mortality: code and data

📂 Repository DOI: DOI


✨ Highlights

  • 6 different Machine Learning algorithms were compared in predicting individual tree mortality.
  • Effects of dataset size, variable set, thinning, inventory length, and cross-validation were studied.
  • Random Forest reached a higher performance level in all the case studies proposed except on cross-validation.
  • Logistic binomial Regression seems to be a more robust algorithm regarding cross-validation.

📖 Abstract

Tree mortality is a crucial process in forest dynamics and a key component of forest growth models and simulators. Factors like competition, drought and pathogens are driving tree mortality, but the underlying mechanism is difficult to model. The current environmental changes are even complicating model approaches as they influence alter all the factors involving mortality. However, innovative classification algorithms can go deep into data to find patterns able to model or even explain their relationship. We use Logistic binomial Regression as the reference algorithm for predicting individual tree mortality, but different Machine Learning alternatives already applied to other forest modelling topics can be used for this purpose. Here, we compare the performance of five different Machine Learning algorithms with Logistic binomial Regression in individual tree mortality classification: Decision Trees, Random Forest, Naive Bayes, K-Nearest Neighbour, and Support Vector Machine. Different case studies involving dataset size, number of variables, thinning degrees and inventory record length were conducted to assess the performance of each algorithm. Additionally, a cross-validation among thinning degrees was performed. Our results demonstrate the potential of Machine Learning in assessing tree mortality, while results interpretability can reduce its utility.


📁 Repository Contents

  • 💾 1_data:

    • ☀️ climate data obtained from WorldClim data (not provided here)

    • 🌳 tree and plot data (test data from DEN experimental plots, original data under serious request)

      • 0_raw contains the original data (DEN experiemtnal plots) regarding coordinates, tree inventory, plot age and thinning grades
      • tmp_DEN are different checkpoints saved after the use of each R code file, in order to explore the evolution of the data used. It works with the test information until the analysis (section 5)
      • 1_original_df contains the results of the original analysis (full dataset) with all the metrics tested in that study (but without tree and plot original data). It can be used from section 6 in advance
  • 💻 2_code: detailed R scripts, purpose, inputs and outputs are summarized below.

⚠️ Code was prepared to be used with the full dataset; code from section 5 to the end was not adapted to be used with the test dataset, but from section 6 in advance it can be used with the checkpoints of the original dataset (1_original_df) ⚠️

Script Name Purpose Input Output
0_data_curation.r Manage initial data: structure, IDs, input missing data 1_raw/final/VF_daten.xlsx 1_raw/final/Fi-Daten__age.xlsx 3_final/0_initial_df_clean/initial_df_clean.csv
1.0_neighborhood_main.r 1.1_neighborhood_functions.r Calculate variables needed for the analysis using a subplot of 0.33*h radii around each tree 1_data/3_final/0_initial_df_clean/initial_df_clean.csv 1_data/3_final/1_neighborhood/* trees_r33.csv subplot_stats_r33.csv neighborhood_stats_r33.csv
2_climate_data.r Calculate climate variables by plot location. ⚠️ It requires to download several climate information from WorldClim previously, write to the authors for more details ⚠️ 1_raw/final/Koordinaten.xlsx 1_data/3_final/2_clima/df_complete_r33.csv
3_feature_visualization.r Code to make graphs and manually study variable relationships 1_data/3_final/2_clima/df_complete_r33.csv None
4.0_split_dataset.r 4.1_split_variables.r 4.2_functions_var_combis.r Code to split datasets (size and thinning) and variables to develop case studies, excluding first plot measurement when needed. ⚠️ It doesn't work as desired with the test dataset, as most of the information is missing ⚠️ 1_data/3_final/2_clima/df_complete_r33.csv 1_data/3_final/4_datasets/*
5.0_run_analysis.r 5.1_LR_analysis.r 5.2_DT_analysis.r 5.3_RF_analysis.r 5.4_NB_analysis.r 5.5_KNN_analysis.r 5.6_SVM_analysis.r Code to run all the analysis (except ANN) on R for the different case studies 1_data/3_final/4_datasets/* 1_data/3_final/5_analysis/**case_study**/* metrics.RData models.RData
6_HPC Code used to run all the simulations on iuFOR HPC, splitted by study case 1_data/3_final/4_datasets/* 1_data/3_final/5_analysis/**case_study**/* metrics.RData models.RData
7_metrics_compilation.r Get metrics from R analysis, calculate ANN metrics, and create a checkpoint 1_data/3_final/5_analysis/* **case_study**/metrics.RData ann/preds/**case_study**/* ann/timer/**case_study**/* 1_data/3_final/6_final_results/**case_study**/final_metrics.RData
8.0_performance_graphs.r 8.1_functions_performance_graphs.r 8.2_classifiers_comparison.r 8.3_graph_functions.r 8.4_application_thinning.r 8.5_application_thinning_comparison.r 8.6_time_and_performance_graphs.r Code to compare the metrics for each analysis and case study using different graphs. Graphs of the original paper used code 8.5 and 8.6 1_data/3_final/6_final_results/**case_study**/final_metrics.RData 2_scripts/4_figures/*
9.0_location_map.r 9.1_neighbour_graphs.r 9.2_mortality_graphs.r 9.3_df_mortality_rates 9.4_paper_tables.r Code used to perform graphs of location, neighborhood and get information to provide tables in the original paper. 1_raw/final/Koordinaten.xlsx 1_data/3_final/2_clima/df_complete_r33.csv 1_data/3_final/1_neighborhood/trees_r33.csv 2_scripts/4_figures/*
  • 📊 3_figures: graphs and figures used in the article and additional information:

    • final_figures contains the figures used in the original paper
    • tmp_figures contains different graphs used to understand the results of each analysis and metric
  • 📚 4_bibliography: recompilation of all the references used in the article


🤔 How to Use

In order to reproduce the analysis, users must:

  • 💾 Data:

    • According to the original paper conditions (Due to the sensitive nature of the data, raw data would remain available only under serious requests.), data must be requested from the authors
    • WordClim data needed to develop simulations must be downloaded from its original website
  • 💻 Prerequisites: installation and code: R must be installed to run the code with the used libraries across each script (RStudio was also used to develop the code). Some analyses (specifically when training RF models) will request high computation power, which can provoke out-of-memory in a normal computer. Access to high-computing services is highly recommended in those cases.

  • 📜 Usage: The previous chapter details the purpose, input and output of each script.


📚 Additional Information

A flowchart detailing the training and testing process (scripts from groups 5 and 6) is shown here:

flowchart


ℹ️ License

The content of this repository is under the MIT license.


🔗 About the authors

Aitor Vázquez Veloso: ORCID \ UVa

Astor Toraño Caicoya: ORDID \ TUM

Felipe Bravo Oviedo: ORCID \ UVa

Peter Biber: ORDID \ TUM

Enno Uhl: ORDID \ TUM

Hans Pretzsch: ORDID \ TUM


Comparing Machine Learning algorithms for assessing individual tree mortality

About

Comparing Machine Learning algorithms for assessing individual tree mortality: code and data

Resources

License

Stars

Watchers

Forks

Packages

No packages published