From 64610b39d095cd31fd550deffb86cd90f4d02b13 Mon Sep 17 00:00:00 2001 From: ifeoluwaale Date: Thu, 21 Mar 2024 19:25:20 -0700 Subject: [PATCH] Updates to user and developer documentation --- README.md | 46 ++++- dev-instructions.md | 315 ++++++++++++++++++++++++++++- tests/integs/test_integ_autoreg.py | 0 3 files changed, 349 insertions(+), 12 deletions(-) create mode 100644 tests/integs/test_integ_autoreg.py diff --git a/README.md b/README.md index ab3de7b..945e651 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,5 @@

- + # dplPy -the Dendrochronology Program Library in Python The Dendrochronology Program Library (DPL) in Python has its roots in both the [original FORTRAN program](https://www.ltrr.arizona.edu/software.html) created by the [legendary Richard Holmes](https://arizona.aws.openrepository.com/handle/10150/262569?show=full) and the subsequent R Project package by Andy Bunn, [dplR](https://github.com/OpenDendro/dplR). Our aim is to provide researchers working with tree-ring data the necessary tools in open-source environments, promoting open science practices, enhancing rigor and transparency in dendrochronology, and eventually allowing reproducible research entirely in a single programming language. @@ -24,15 +24,17 @@ The Dendrochronology Program Library (DPL) in Python has its roots in both the [ - [Windows](#windows) - [Functionalities and Usage](#functionalities-and-usage) - [Loading data using `readers`](#loading-data-using--readers) + - [Loading data from online sources using `readers_url`](#loading-data-from-online-sources-using-readers_url) - [Data Summary from `summary`](#data-summary-from-summary) - [Data Stastics from `stats`](#data-stastics-from-stats) - [Data Report from `report`](#data-report-from-report) - - [Plotting](#plotting) + - [Plotting raw data with `plot`](#plotting-raw-data-with-plot) - [Detrending using `detrend`](#detrending-using-detrend) - [Autoregressive (AR) modeling](#autoregressive-ar-modeling) - [Build a chronology with `chron`](#build-a-chronology-with-chron) - [Build a variance stabilized chronology with `chron_stabilized`](#build-a-variance-stabilized-chronology-with-chron_stabilized) - [Crossdate with `xdate`](#crossdate-with-xdate) + - [Output data to files using `writers`](#output-data-to-files-using-writers) --- @@ -168,6 +170,18 @@ This will load the package and its functions, allowing them to be accessed with >>> data = dpl.readers("/path/to/file.rwl", header=True) ``` +### Loading data from online sources using `readers_url` +**Note: This function is still in development and has only been tested so far with `rwl` raw data files from the [NCEI website](https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/)** + +- Description: reads `rwl` formatted data directly from online sources. +- Options: + - `header`: rwl input files often have a header present; Default is `False`, use `True` if input has a header. +- Usage examples: + ``` + >>> data = dpl.readers_url("http://link/to/file.rwl") + >>> data = dpl.readers_url("http://link/to/file.rwl", header=True) + ``` + ### Data Summary from `summary` - Description: generates a summary of each series recorded in `rwl` and `csv` format files @@ -198,7 +212,7 @@ This will load the package and its functions, allowing them to be accessed with >>> dpl.report(data) ``` -### Plotting +### Plotting raw data with `plot` - Description: generates plots of tree ring with data from dataframes. Currently capable of generating `line`, `spag` (spaghetti) and `seg` (segment, default) plots. - Options: @@ -309,5 +323,29 @@ This will load the package and its functions, allowing them to be accessed with - `show_flags`: default `True`, determines whether to show flags in the function output to the console. - Usage examples: ``` - + >>> ca533_rwi = dpl.detrend(ca533, plot=False) + + # Crossdating of detrended data with default args + >>> dpl.xdate(ca533_rwi) + + # Crossdating with Pearson correlation and show flags + # (other options set to defaults when not specified). + >>> dpl.xdate(ca533_rwi, corr="Pearson" show_flags=True) + ``` + +### Output data to files using `writers` + +- Description: writes data from dataframe to supported file types (`csv`, `rwl`, `crn`, `txt`). +- Required parameters: + - `data`: dataframe with ring widths (presumably one read from `readers` or `readers_url`) + - `label`: name (can include file path) to give to the created file. **should not include file extension** + - `format`: extension for file to be created. Can be `'csv'`, `'rwl'`, `'crn'` or `'txt'`. + +- Usage examples: ``` + # Write data to file_name.csv in current working directory. + >>> dpl.writers(data, "file_name", "csv") + + # Write data to file_name.csv in ./path/to/ directory. + >>> dpl.writers(data, "./path/to/file_name", "csv") + ``` \ No newline at end of file diff --git a/dev-instructions.md b/dev-instructions.md index e158ba9..ead5120 100644 --- a/dev-instructions.md +++ b/dev-instructions.md @@ -2,8 +2,13 @@ Welcome to the dplPy developer manual. We welcome all code contributions, bug reports, bug fixes, documentation improvements, and suggestions. -## Environment setup +## Index +- [Environment setup](#environment-setup) +- [Making changes and submitting PRs](#making-changes-and-submitting-a-pull-request) +- [API Reference]() + +## Environment setup ### 1. GitHub setup #### 1.1 Create dplPy fork in github @@ -39,7 +44,7 @@ git checkout -b {feature_name} ### 2. Conda environment -The packages required to run dplPy are all specified in environment.yml. +The packages required to run dplPy are all specified in environment.yml, which can be used to install them in Conda ([Anaconda](https://docs.anaconda.com/anaconda/install/index.html) or [Miniconda](https://docs.conda.io/projects/continuumio-conda/en/latest/user-guide/install/index.html)) or [Mamba](https://mamba.readthedocs.io/en/latest/installation.html) environments. #### 2.1\. Create your environment with the required packages installed. @@ -58,7 +63,7 @@ $ mamba env create -f environment.yml If prompted for permission to install requred packages, select y. #### 2.2\. Activate your environment. -You will need to have the conda environment activated anytime you want to test code from the package. +You will need to have the conda environment activated anytime you want to run or test code from the package. ``` conda activate dplpy @@ -118,10 +123,7 @@ Go to the testing tab (on the left side of the VSCode display). With your enviro If `.vscode/settings.json` has not been created, create it and add the lines shown above. -Go back to the testing tab and verify that the dplpy unit tests are showing. They should look like this: - -![Screenshot of tests tab in VSCode](image.png) - +Go back to the testing tab and verify that the dplpy unit tests are showing. Run the tests by clicking the play button to the right of `tests`. @@ -179,4 +181,301 @@ Pull requests allow you to view a side-by-side diff comparison of all changed fi If you are satisfied with your changes, give the PR a descriptive title, and specify in the description what changes were made and what (if any) issues were addressed. Then, submit the pull request. -Your request will be reviewed by the repository maintainers. \ No newline at end of file +Your request will be reviewed by the repository maintainers. + +## API Reference + +Here is a list of functions (in alphabetical order) with descriptions: + +| Function | Description | +| --- | --- | +| [`ar_func`](#ar_funcdata-max_lag5-source) | Fits series or dataframe to autoregressive (AR) models and performs other operations on data with best model fit. | +| [`autoreg`](#autoregdata-max_lag5-source) | Fits series to autoregressive (AR) models and returns parameters of best model fit. | +| [`chron`](#chronrwi_data-biweighttrue-prewhitenfalse-plottrue-source) | Creates a mean value chronology for a dataset, typically the ring width indices of a detrended series | +| [`detrend`](#detrend) | Detrends a given series or data frame, first by fitting data to curve(s), with spline(s) as the default, and then by calculating residuals or differences compared to the original data. | +| [`help`](#help) | Displays help (alpha). | +| [`plot`](#plot) | Generates line, spaghetti or segment plots.| +| [`rbar`](#rbar) | Finds best interval of overlapping series over a period of years, and calculating rbar constant for a dataset over period of overlap. | +| [`readers`](#readers) | Reads data from supported file types (*.CSV and *.RWL) and stores them in dataframe. | +| [`readme`](#readme) | Goes to this website. | +| [`report`](#report) | Generates a report about absent rings in the data set. | +| [`series_corr`](#series_corr) | Crossdating function that focuses on the comparison of one series to the master chronology. | +| [`stats`](#stats) | Generates summary statistics for RWL and CSV format files. | +| [`summary`](#summary) | Generates a summary for RWL and CSV format files. | +| [`xdate`](#xdate) | Crossdating function for dplPy loaded datasets. | + + +### `ar_func(data, max_lag=5)` [[source]](https://github.com/OpenDendro/dplPy/blob/480973dc5f09f748271fb62a5ebd8ff5c88ac2dd/dplpy/autoreg.py#L36) + +Fits a given data to an the best-fit autoregressive model, then returns the residuals of AR fit relative to the original data + the mean of the original data. +- **Required Parameters**: + - **data** : ***pandas.DataFrame or pandas.Series***, a pandas dataframe imported from dpl.readers() or a series extracted from such a dataframe. +- **Optional Parameters**: + - **lag : _int_ default 5**, max lag to consider when selecting the best-fit AR model. +- **Returns:** + - **pandas.DataFrame or pandas.Series**, dataframe or series of AR-modeled data, depending on which was given as input. +- **Usage Examples:** + ``` + # ar_func with series + >>> dpl.ar_func(ca533["CAM191"], 10) + Year + 1190 0.711307 + 1191 -0.232047 + 1192 0.521210 + 1193 0.575975 + 1194 0.901084 + ... + 1966 0.296554 + 1967 0.384609 + 1968 0.397742 + 1969 0.427618 + 1970 0.383847 + Name: CAM191, Length: 781, dtype: float64 + + # ar_func with dataframe + >>> dpl.ar_func(ca533, 10) + CAM011 CAM021 CAM031 CAM032 CAM041 CAM042 CAM051 CAM061 CAM062 ... CAM152 CAM161 CAM162 CAM171 CAM172 CAM181 CAM191 CAM201 CAM211 + Year ... + 626 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + 627 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + 628 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + 629 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + 630 NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... + 1979 0.424423 0.404787 0.142900 0.378733 0.640022 0.369773 0.369770 0.347996 0.535881 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + 1980 0.486215 0.614051 0.658424 0.408298 0.898555 0.568861 0.440974 0.693782 0.661847 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + 1981 0.498586 0.505126 0.436690 0.260786 0.419491 0.438934 0.345517 0.544592 0.382856 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + 1982 0.455773 0.414212 0.485516 0.448526 0.792929 0.443559 0.261443 0.560291 0.510274 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + 1983 0.666578 0.520679 0.223995 0.277267 0.755711 0.456165 0.252873 0.583766 0.320921 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN + + [1358 rows x 34 columns] + ``` + + +### `autoreg(data, max_lag=5)` [[source]](https://github.com/OpenDendro/dplPy/blob/480973dc5f09f748271fb62a5ebd8ff5c88ac2dd/dplpy/autoreg.py#L103) + +Selects the best AR model with a specified maximum order for the given data, and returns the parameters for the model. The best model is selected based on the AIC value. +- **Required Parameters**: + - **data** : ***pandas.Series***, a pandas series extracted from a pandas dataframe containing tree ring widths (presumably imported from [`readers`]()) +- **Optional Parameters**: + - **lag : _int_ default 5**, max lag to consider when selecting the best-fit AR model. +- **Returns:** + - **array** containing the parameters of best-fit AR model in order. +- **Usage Examples:** + ``` + >>> dpl.autoreg(ca533["CAM191"], 10) + const 0.022210 + CAM191.L1 0.503373 + CAM191.L2 0.087230 + CAM191.L3 0.143716 + CAM191.L4 0.020119 + CAM191.L5 -0.027769 + CAM191.L6 -0.010029 + CAM191.L7 0.001373 + CAM191.L8 0.025588 + CAM191.L9 0.042340 + CAM191.L10 0.136916 + dtype: float64 + ``` + +### `chron(rwi_data, biweight=True, prewhiten=False, plot=True)` [[source]](https://github.com/OpenDendro/dplPy/blob/480973dc5f09f748271fb62a5ebd8ff5c88ac2dd/dplpy/chron.py#L44) + +Creates a mean value chronology for a dataset, typically the ring width indices of **a detrended series**. + +- **Required Parameters**: + - **rwi_data** : ***pandas.Dataframe***, a pandas dataframe containing (expected to be already detrended) tree ring widths. +- **Optional Parameters**: + - **biweight : _int_ default True**; when `True`, means will be calculated using Tukey's biweight robust mean. + - **prewhiten : _int_ default False**; when `True`, data is prewhitened by fitting to an AR model. + - **plot : _int_ default True**; when `True`, results are plotted. +- **Returns:** + - **pandas.Dataframe** of years, mean RWIs and sample depths for each year. +- **Usage Examples:** + ``` + >>> dpl.chron(ca533, plot=False) + Mean RWI Sample depth + Year + 626 0.170000 1 + 627 0.130000 1 + 628 0.140000 1 + 629 0.190000 1 + 630 0.220000 1 + ... ... ... + 1979 0.510581 21 + 1980 0.722784 21 + 1981 0.568495 21 + 1982 0.674211 21 + 1983 0.638166 21 + + [1358 rows x 2 columns] + ``` + + + + + + +### Loading data using `readers` + +- Description: reads data from supported file types (`csv` and `rwl`) and stores them in a dataframe. +- Options: + - `header`: rwl input files often have a header present; Default is `False`, use `True` if input has a header. +- Usage examples: + ``` + >>> data = dpl.readers("/path/to/file.csv") + # or + >>> data = dpl.readers("/path/to/file.rwl", header=True) + ``` + +### Loading data from online sources using `readers_url` +**Note: This function is still in development and has only been tested so far with `rwl` raw data files from the [NCEI website](https://www.ncei.noaa.gov/pub/data/paleo/treering/measurements/)** + +- Description: reads `rwl` formatted data directly from online sources. +- Options: + - `header`: rwl input files often have a header present; Default is `False`, use `True` if input has a header. +- Usage examples: + ``` + >>> data = dpl.readers_url("http://link/to/file.rwl") + >>> data = dpl.readers_url("http://link/to/file.rwl", header=True) + ``` + +### Data Summary from `summary` + +- Description: generates a summary of each series recorded in `rwl` and `csv` format files +- Usage examples: + ``` + >>> dpl.summary("/path/to/file.rwl") + # or + >>> dpl.summary(data) + ``` + +### Data Stastics from `stats` + +- Description: generates summary statistics for `rwl` and `csv` format files +- Usage Example: + ``` + >>> dpl.stats("/path/to/file.rwl") + # or + >>> dpl.stats(data) + ``` + +### Data Report from `report` + +- Description: generates a report about ring measurements and absent rings in the data set +- Usage Example: + ``` + >>> dpl.report("/path/to/file.rwl") + # or + >>> dpl.report(data) + ``` + +### Plotting raw data with `plot` + +- Description: generates plots of tree ring with data from dataframes. Currently capable of generating `line`, `spag` (spaghetti) and `seg` (segment, default) plots. +- Options: + - `type="line"`: creates a line plot (default) + - `type="spag"`: creates a spaghetti plot + - `type="seg"`: creates a segment plot +- Usage Example: + ``` + >>> dpl.report("/path/to/file.rwl") + # or + >>> dpl.plot(data) + + # User is able to select specific series of interests. + # In the example below, the user selects SERIES_1, SERIES_2, SERIES_3 + # from the "data" dataset and generates a spaghetti plot + >>> dpl.plot(data[[SERIES_1, SERIES_2, SERIES_3]], type="spag") + ``` + +### Detrending using `detrend` + +- Description: Detrends a given series or data frame, first by fitting data to curve(s), and then by calculating residuals or differences compared to the original data. +- Options: + - `fit="spline"`: default detrending method. + - `fit="ModNegEx"`: detrending using negative exponent method. + - `fit="Hugershoff"`: detrending using the Hugenshoff method. + - `fit="linear"`: detrending using the linear method. + - `fit="horizontal"`: detrending using the horizontal method. + - `method="residual"`: calculates residuals vs original data (default). + - `method="difference"`: calculates differences vs original data. + - `plot=True|False`: whether or not to plot results, default is `True`. +- Usage Example: + ``` + # detrend with default options + >>> dpl.detrend(data) + + # specify fit to hugershoff curve and detrend with difference + >>> dpl.detrend(data, fit="Hugershoff", method="difference") + + # detrend only SERIES_1, SERIES_2 and SERIES_3 + >>> dpl.detrend(data[[SERIES_1, SERIES_2, SERIES_3]], fit="Hugershoff", method="difference") + ``` + + + + + + +### Build a variance stabilized chronology with `chron_stabilized` + +- Description: Builds a variance stabilized mean-value chronology for a dataset of **detrended** ring width indices, by multiplying the chronology with the square root of the effective independent sample size, $ Neff $. + + Note: where n(t) is the number of series at time t, and rbar is the running interseries correlation, + + $$ Neff = { n(t) \over 1+(n(t)-1)rbar(t) } $$ + +- Options: + - `win_length`: an integer for specifying the window lengths where interseries correlations will be calculated (default `50`). Should not be greater than the number of years in the dataset, recommended to be between 30% and 50% of the number of years. + - `min_seg_ratio`: the minimum ratio of non-NA values to the window length for a series to be considered in an Neff calculation (default `0.33`). + - `biweight`: boolean indicating whether or not to use Tukey's bi-weight robust mean when calculating the mean-value chronology; default `True`. + - `running_rbar`: boolean indicating whether or not to return the running interseries correlations as part of chronology output; default `False`. +- Usage Example: + ``` + # Detrend data first! + >>> rwi_data = dpl.detrend(data) + + # Perform chronology with default args + >>> dpl.chron_stabilized(rwi_data) + + # Specify win_length, min_seg_ratio and running_rbar + >>> dpl.chron_stabilized(rwi_data, win_length=60, min_seg_ratio=0.5, running_rbar=True) + ``` + +### Crossdate with `xdate` +- Description: This function calculates correlation serially between each tree-ring series and a master chronology built from all the other series in the dataset (leave-one-out principle). +- Options: + - `prewhiten`: default `True`, determines whether or not to prewhiten series using AR modeling + - `corr`: default `'Spearman'`, the type of correlation to use. Can be `'Pearson'` or `'Spearman'`. + - `slide_period`: default `50`, the number of years to compare to the master chronology at a time. + - `bin_floor`: default `100`, determines the minimum bin year. The minimum bin year is calculated as $ \lceil (min\_yr/bin\_floor)\rceil*bin.floor $ where `min_yr` is the first year in the dataset. + - `p_val`: default `0.05`, determines the critical value below which interseries correlations are flagged. + - `show_flags`: default `True`, determines whether to show flags in the function output to the console. +- Usage examples: + ``` + >>> ca533_rwi = dpl.detrend(ca533, plot=False) + + # Crossdating of detrended data with default args + >>> dpl.xdate(ca533_rwi) + + # Crossdating with Pearson correlation and show flags + # (other options set to defaults when not specified). + >>> dpl.xdate(ca533_rwi, corr="Pearson" show_flags=True) + ``` + +### Output data to files using `writers` + +- Description: writes data from dataframe to supported file types (`csv`, `rwl`, `crn`, `txt`). +- Required parameters: + - `data`: dataframe with ring widths (presumably one read from `readers` or `readers_url`) + - `label`: name (can include file path) to give to the created file. **should not include file extension** + - `format`: extension for file to be created. Can be `'csv'`, `'rwl'`, `'crn'` or `'txt'`. + +- Usage examples: + ``` + # Write data to file_name.csv in current working directory. + >>> dpl.writers(data, "file_name", "csv") + + # Write data to file_name.csv in ./path/to/ directory. + >>> dpl.writers(data, "./path/to/file_name", "csv") + ``` \ No newline at end of file diff --git a/tests/integs/test_integ_autoreg.py b/tests/integs/test_integ_autoreg.py new file mode 100644 index 0000000..e69de29