preproposal.tex

\documentclass[letterpaper, parskip=half]{scrartcl}

% [Frew] look for comments that begin like this one...
% [Frew] general comment: the red boxes around the hyperlinks are REALLY distracting. Can you change them to something subtler, like blue text?

\input{context/preamble.tex}
\input{context/acronyms.tex}


\usepackage[
  backend=biber,
  bibstyle=reading,
  citestyle=authoryear,
  abstract=false,
  file=false,
  entryhead=full,    
  entrykey=false,  
  annotation=true,  
  library=true,
  loadfiles=true,
  natbib=true,
  hyperref=true,
  backref=true,  
  useauthor=true  
]{biblatex}


\renewbibmacro*{entryhead:full}{%
    \printnames[biblabel]{labelname},     
    \printfield{year}
}
  

\usepackage{tipa}

\addbibresource{context/snow.bib} 
\addbibresource{context/phd.bib} 
\addbibresource{context/scidb.bib} 
\addbibresource{context/HTM.bib} 
\addbibresource{context/data_citation.bib} 
\addbibresource{context/CS274.bib} 
\addbibresource{context/CS270.bib} 

\title{PhD dissertation proposal}
\subtitle{Towards the twilight of file-centricity}
\author{Niklas Griessbaum}
\date{\today}


\begin{document}
\maketitle

\newpage
\tableofcontents

\newpage
\printglossaries


\newpage

\section{Motivation}

Environmental informatics is the application of information technology to environmental sciences \citep{Frew2012}.
As such, it addresses the information infrastructure that environmental scientists leverage
to obtain knowledge from environmental data.

The appearance of cloud computing, characterized by scalability on one side, 
and by abstraction and stereotyping of interactions through services and \glspl{API} on the other side \citep{Foster2017}, 
provides for opportunities in environmental informatics. 
It opens up questions about what the workflow of environmental scientists in the 21st century should look like to fully exploit these opportunities.

My thesis is that the inherent heterogeneity of the flow from data to knowledge in the environmental sciences causes bottlenecks that can be unblocked through infrastructures leveraging cloud computing. 
In my dissertation, I want to address the ``twilight of file-centricity''\footnote{German: Dateidämmerung \textipa{[\textsubring{d}a'ta\textsubcircum{i}'dEm@\;RUN]}} and technologies required to transition from file-centricity to data-centricity.

Files package (i.e. chunk or aggregate) data into logical units and provide intelligible identities to their contents: their filenames\footnote{For example, the conventions of \gls{MODIS} filename as data identity was exploited by MODster \citep{Frew2005, Frew2002} to managed distributed data}.
Files further conceal the structure of data they are holding, which allows repositories to preserve and distribute data structure-agnostically. 
While this generalizes and simplifies the task for the repositories, it pushes the responsibility of acknowledging the data's structure to the users, who have to \gls{ETL} data prior to extracting knowledge \citep{Rilee2016, Szalay2009}.

The problem in this approach materializes in two bottlenecks: data movement and data alignment.

\paragraph{Data Movement:}
A system that is agnostic of the structure of the data it is holding is incapable of performing computations on the data. 
It can merely act as a point of preservation and distribution. Data therefore has to be moved (more acurately: copied) to the point of computation (Gray's 3rd Law, \cite{Szalay2009}).

Data movement however is undesired since it results in uncoordinated and unstructured duplication of data and therefore in storage waste.
Further, we are facing an increasing disparity between network speeds and compute power\footnote{User bandwidth speeds have been following Nielsen's Law \citep{Nielsen1998} and grew annually by \SI{50}{\percent} over the last 36 years while compute power has been following Moore's law \citep{Moore1975} and grew annually by \SI{60}{\percent} for the last 40 years. Source:\url{https://ourworldindata.org/grapher/transistors-per-microprocessor}}, making it less and less attractive and ultimately infeasible to move data to the point of computation \citep{Hey2009} as data volumes grow. While researchers may choose to copy gigabytes worth of data for their analysis, copying petabytes will not be an option in the near future \citep{Szalay2006}.

The pre-defined package size of files makes matters worse: If the package size does not exactly equal the area of interest for an analysis\footnote{A file might for example contain bands, areas, or time periods not needed for a given analysis.}, a transfer overhead is incurred \citep{Gray2002}.

\paragraph{Data Alignment:}
Files allow repostories to ignore data structures; therefore, repositories cannot be responsible for data alignment.
It becomes the task of the data user to align data during the \gls{ETL} process if various datasets are to be integrated.
Data alignment provides a common method to express (e.g., spatiotemporal) coincidence throughout all datasets.
It can be achieved through (re-)projection, aggregation and gridding, and/or indexing.
Because methods for alignment are not provided from a central point (i.e. the repository), every user has to manually align data, resulting in redundant effort.

In order to break the stated bottlenecks, the file-centric approach has to be replaced by a system that:
\begin{itemize}
    \item Is capable of providing identity to data independently of packaging.
    \item Does not require manual data alignment.
    \item Voids the necessity to transfer data by being aware of the data structure and therefore being capable of performing computations at the place of storage.
\end{itemize}


\newpage

\section{Work Plan}
I want to address the following challenges to the twilight of file-centricity.

\paragraph{Data Identity and Data Citations:}
By abandoning files as the package of data, a natural identity of data is lost. Identity, however, is needed to refer to and reason about data.
Regardless of previous subsetting and processing, data needs to be identifiable and citable. In my dissertation, I want to develop services providing identity and simplifying the creation of citations to data stored in, and served through, online repositories.

\paragraph{Data co-location and co-alignment:}
To avoid movement, data has to be readily co-located at the place of computation. To further make \gls{ETL} processes unnecessary, the co-located, and potentially heterogeneous data also has to be co-aligned. I.e. a common concept to address spatiotemporal coincidence needs to be established throughout all the data. In practice, this means storing data in a database providing indexes, contexts, and schemas. 
In my dissertation, I want to develop a scalable \gls{GIS} that leverages global indexes and array databases to enable \gls{ETL}-free workflow.

\paragraph{Use Case (Multi-sensor snow mapping):}
One of the five laws postulated by Jim Gray is that a database designed for a given discipline has to be capable to answer the 20 key questions a scientists may have \citep{Hey2009, Szalay2009}.
Following this spirit, I want to evaluate the previously proposed system with two remote sensing use cases: time series of night lights, and multi-sensor snow mapping.  

\newpage

\subsection{Data Identity and Data Citations}
Citations help to make data \acrfull{FAIR} \citep{Wilkinson2016}. In more abstract terms, data citations provide identity to data, which allows referencing and de-referencing.

Provision of identity is a key challenge in the twilight of file-centric workflows, since a natural addressable identity of data is lost as soon as files as a package of data are abandoned.

Data citations differ from citations of printed material in that the cited content (i.e. the data) may evolve over time and in that meta-information such as authorship or the provenance may vary within a continuous dataset \citep{Buneman2016}. Further, since generally speaking infinite ways to subset data are possible, data citations cannot be statically generated. They have to be machine-actionable, both in terms of dynamic creation (as a function of time and subsetting parameters) and in terms of resolving data citations back to the cited material.

% [Frew] WMS just retrieves pictures, not data
Technologies such as the \gls{WCS} and the \gls{OPeNDAP} \citep{Gallagher2005} play a key role in the twilight of file-centricity. Their ability to seamlessly provide access to data rather than to files provides an ideal starting point for theoretical and practical excursions on how to address identity and citations in a data-centric workflow. 


With the development of the web service
\gls{OCCUR}\footnote{\url{https://github.com/NiklasPhabian/OCCUR},
\url{http://occur.duckdns.org}}, I am exploring an approach for assigning identity and citations to dynamic data.
\Gls{OCCUR} is a web service that allows users to assign and store identities for data retrieved from an \gls{OPeNDAP} query. 
OCCUR creates identifiers for identities which can later be resolved through \gls{OCCUR}, whereby \gls{OCCUR} will verify that the data has not changed since the assignment of the identity. OCCUR further brokers identities by ensuring that identical data is sharing the same identity.

\gls{OCCUR} additionally allows users to generate formatted citation snippets for both OCCUR identities and for any \gls{OPeNDAP} query. It connects \gls{OPeNDAP} to \url{www.crosscite.org}, which allows the creation of citation snippets from \glspl{DOI}. 

The development of \gls{OCCUR} was supported by the \gls{ESIP} federation as a 2018 \gls{ESIP} lab project. The work on \gls{OCCUR} has been presented at the 2017 Bren PhD Symposium and the 2018 \gls{ESIP} summer meeting.

I plan to submit a paper on the findings of the development of OCCUR in 2019 to the CODATA Data Science Journal.
% 
\newpage

\subsection{Data co-location and co-alignment (EarthDB 2.0)}
To avoid moving data prior to knowledge generation, all datasets required for a given analysis have to be co-located at the place of computation. The co-located datasets further need to be co-aligned to allow interoperability \citep{Kuo2017, Rilee2016} and make \gls{ETL} processes unnecessary. In practical terms, this means storing the data in a database.

Co-alignment means that a common concept to address coincidence throughout all datasets exists. 
It can be achieved through a common indexing schema. A side-effect of co-alignment is that it can be exploited to improve physical data placement: The index can be used to store coinciding data in physical proximity. This will, for a lot of use-cases, allow processing data in parallel in a shared-nothing architecture \citep{Kuo2017}.
The prevalent dimensions for which data has to be aligned in environmental science are time and space.
 
The \gls{STARE}\footnote{STARE is a global indexing schema that is built upon the quadtree \gls{HTM}. A review of quadtrees in geospatial indexing is provided in section \ref{lit_index}.} \citep{Kuo2017} provides a common concept to address spatiotemporal coincidence for environmental data.

I am developing EarthDB 2.0, a database based \gls{GIS} capable of handling diverse data. 
EarthDB 2.0 is implemented in SciDB\footnote{\url{https://www.paradigm4.com/}} and therefore anticipated to scale both vertically and horizontally.
At first glance, EarthDB 2.0 tries to achieve similar functionality to Google Earth Engine \citep{Gorelick2017}: Both are scalable \gls{GIS} systems that shield users from \gls{ETL} and \gls{HPC}. However while Earth Engine is a tile database and thus bound to gridded data, EarthDB 2.0 uses \gls{STARE} to index individual pixels and therefore does not require data to be located in a regular grid. 

I have presented the findings of an early stage of EarthDB 2.0 at the 2017 \gls{AGU} Fall meeting. I want to continue the development of EarthDB 2.0 to create a system capable of solving common problems in environmental science. In particular:

\paragraph{Schema recommendations}
I want to identify appropriate SciDB array schemas for heterogeneous STARE-indexed remote-sensing data products as well as other commonly used geometry information and metdata.
%Of crucial importance is hereby the distinguishing between STARE as an index and STARE as a representation.
%\footnote{Pixels can be indexed through tessellation or through indexing of the centroids. However, how can we quantify the loss of accuracy for the different methods, and how can we quantify the storage and compute overhead for pixels indexed through tessellation?} 


\paragraph{Data Loading}
Loading data to SciDB is considered its Achilles' heel. Several attempts at loading remote sensing data into SciDB have been developed\footnote{scidb4gdal, modbase, scidb-hdf5, sls, modis2scidb, and arraybridge to name a few}. 
I will continue to work on a graceful method to import and STARE-index commonly used remote sensing products in SciDB. A speedup potential exists with SciDB's \textit{accelerated-io} plugin which allows data sharding prior to loading. 


%scidb4gdal\footnote{\url{https://github.com/appelmar/scidb4gdal}}, 
%modbase\footnote{\url{https://forum.paradigm4.com/t/modbase-using-scidb-for-modis-geospatial-data/188}}, 
%scidb-hdf5 \footnote{\url{https://github.com/wangd/SciDB-HDF5}}
%sls\footnote{\url{http://dbs.snu.ac.kr/papers/scalable17.pdf}}
%modis2scidb\footnote{\url{https://github.com/albhasan/modis2scidb}} 
%(a e-sensing/SciETL\footnote{\url{http://esensing.org/}} undertaking), and
%arraybridge\footnote{\url{https://code.osu.edu/arraybridge/scidb}, \url{https://github.com/hbsnmyj/arraybridge}} \citep{Xing2017}.

\paragraph{Spatiotemporal serverside functions}
I want to extend SciDB to use STARE to perform spatiotemporal operations, such as spatial joins, aggregation, and subsetting.

\paragraph{STARE in Python (pystare)}
Though the existence of STARE may be completely opaque to a user of EarthDB 2.0, exposing STARE functionality to a high-level programming language will be required for educational and debugging purposes. I therefore want to enable common Python geometry representations such as \textit{shapely} and \textit{geopandas} to be convertible to and from STARE representations.

\paragraph{Proxy objects}
Commonly, a user interacts with SciDB through the cumbersome \textit{iquery} shell in \gls{AFL} and \gls{AQL}. 
Apart from that, the Python library \textit{scidb-py} exposes SciDB arrays as Python proxy objects . 
I want to extend these proxy objects to allow invocation of common spatiotemporal operations\footnote{Seminally to the anticipated capabilities of STARS (\url{https://github.com/r-spatial/stars}) and openEO \url{http://openeo.org/}}.

\paragraph{Additional Server-Side Functions}
% [Frew] You can do ND[SV]I with the functions that are already in SciDB; maybe want some more complex example(s) here.
For more complex analysis, users may want to add additional server-side functions (e.g., to calculate \gls{NDVI} or \gls{NDSI}). I want to describe a pathway for users to add such custom server-side functions, while defining their limitations.

\paragraph{Database setup}
Due to its small userbase, no common best practices for setting up a SciDB cluster exist; let alone for its use in environmental sciences. I therefore want at the very least to describe the lessons learned from chunking/sharding strategies, RAID layouts, SciDB instance configuration, etc.


%\gls{STARE}-SciDB in queue-driven \gls{HPC} environments\footnote{Traditionally, SciDB is run on dedicated shared-nothing hardware. However, in \gls{HPC} environments such as UCSB's Knot cluster, hardware are shared among a wide community of users through queue managers that do not allow for resources to be reserved by a DBMS. I want to explore possibilities to execute (\gls{STARE})-SciDB queries inside such an environment.}.

\newpage


\subsection{Science use case}
In order to verify the usability of the previously proposed solution of EarthDB 2.0 in environmental science, I am going to use EarthDB 2.0 to solve environmental science domain use cases. Besides solving a domain problem itself, the work on these uses cases will enable me to exhibit typical interactions between environmental scientists and EarthDB 2.0.

I am intending to carry out two distinct use cases, the first one to verify general and simple spatiotemporal functionality, and the second one to demonstrate that complex environmental uses cases can be addressed with EarthDB 2.0.

\subsubsection{Night lights}
The \gls{VIIRS} onboard the Suomi \gls{NPP} has a \gls{DNB} that is sensitive to visible and near-infrared wavelengths, which enables it to observe nighttime lights on Earth at a significantly higher spatial and temporal resolution than its predecessors such as the \gls{DMSP}-\gls{OLS}. 

Prof. Mark Buntaine proposed to use timeseries of averaged nighttime light intensity during/after natural disasters over a given administrative area as a predictor of the area's resilience to disasters. 
As the \gls{VIIRS} \gls{DNB} product is only available as swath data (and hence each pixel is located in an irregular grid), a conventional \gls{GIS}, such as PostGIS, would require billions of point-in-polygon tests in order to associate individual DNB pixels with an administrative area. Even with R-tree indexing of the administrative areas, an on-the-fly workflow is impossible even for a relatively small spatial extent (e.g., Puerto Rico.)

In order to demonstrate EarthDB 2.0's capability, I want to import \textit{all} \gls{VIIRS} \gls{DNB} data for the continent of Africa and enable users to extract timeseries of averaged nighttime light intensities for arbitrary polygons (e.g., each level 2 administrative area) on-the-fly. 

I am intending to demonstrate my first findings of this approach during the 2020 ESIP winter meeting and refined findings during the 2020 ACM SigSpatial meeting.


\subsubsection{Multi sensor snow mapping}
I will implement a snow mapping system that uses low-level/swath data from multiple sensors, for two reasons. Firstly, it allows for cross verification of results. This is a requirement for the development and testing of any novel algorithm. Secondly, the use of multiple sensors increases the aggregated revisit rates, therefore increasing the temporal and/or spatial resolution of snowmaps\footnote{in the sense of \glspl{HRPP}, which are designed to provide the best precipitation estimate at any given time using data from multiple satellites \citep{Lettenmaier2015}}.
The use of lower level products will avoid artifacts caused by resampling and further increase the spatial and temporal resolution by circumventing the pre-defined gridding of higher level products\footnote{High level products use a referencing matrix that provides information on coordinates of an image corner, pixel spacing, and rotation and thereby allows calculation of the coordinates of any point in the image.  However, translating imagery acquired from satellites into such a coordinate system requires resampling of the measured radiances, thereby risking introduction of artifacts into the data.}. Additionally, it will allow for the integration of products that are only available at lower levels, such as the thermal bands from the MODIS calibrated radiance product (MOD02KM / MYD02KM).

Multi-sensor sub-pixel snowmapping is an interesting use case to test EarthDB 2.0 for two reasons.
Firstly, \gls{STARE} is intended to facilitate the integration of inhomogeneous data from various sensors at different spatial and temporal resolutions. 
Secondly, \gls{STARE} is intended to facilitate the integration of lower level swath products, which are otherwise cumbersome to work with. Both of these assumptions can be tested in this use case.

\paragraph{Use of swath data:}
\gls{MODSCAG}, as described by \citep{Painter2009}, is an algorithm based on linear endmember unmixing to retrieve sub-pixel snow cover data as well as grain size and albedo estimates from MOD09GA (level \gls{L2G}\footnote{L2G is the ``Level 2G'' format, which is geolocated, and gridded into a map projection.}) data. 
Using gridded data allows ignoring issues connected to the viewing geometry of MODIS, causing a) variations in pixel sizes in scan directions and b) anisotropic reflectances at shallow viewing angles. I  propose to relax these simplification by porting MODSCAG to use MODIS swath data (MOD09) within EarthDB 2.0.

\paragraph{Inclusion of other sensors:}
MODSCAG has already been adapted to VIIRS and Landsat TM. I propose to implement MODSCAG for these and other multispectral sensors, such as Landsat 8, \gls{GOES}-16/17, FengYun-4a, or Himawari-8/9 in EarthDB 2.0. I hereby expect EarthDB 2.0 to provide a technology facilitating the access and comparison of results of spatiotemporal coincidence. If possible, I will import in-situ measures from snow courses or snow pillows, as well as airborne measurements from \gls{AVIRIS} and the \gls{ASO} to EarthDB 2.0 for cross-verification.

\paragraph{From \gls{MODSCAG} to \gls{SCAGD}:}
% [Frew]: probably don't want to call it "God Damn"... is that really what Dozier calls it??
I propose to contribute to the development of \gls{SCAGD}, an advancement of \gls{MODSCAG}, which is currently under development by Jeff Dozier.

MODSCAG simultaneously estimates the fractional snow cover and the snow grain size \citep{Painter2009} from MODIS surface reflectance data. It does so by assuming that the signal that MODIS receives is a linear spectral mixture of endmembers within a pixel. The endmembers namely are snow, different types of rock, soil, vegetation, and lake ice.

\begin{equation}
 R_\lambda = \epsilon_{\lambda} + \sum_{k=1}^N f_k* R_{\lambda, l}
\end{equation}

MODSCAG then minimizes the least square error of this linear combination of the end-member spectra to determine the snow endmember.

% [Frew]: this doesn't make sense. If snow is *non*-Lambertian then the BRDF is *significant*
% [Frew]: "absolute reflectance" doesn't make sense: reflectance is by definition relative.
MODSCAG takes the assumption that snow does not behave as a lambertian surface and therefore the \gls{BRDF} is negligible. 
This in term allows for the assumption that the shape of the snow's reflectance spectrum is independent of the the solar angle. MODSCAG exploits this by solving for the shape of the spectrum rather than the absolute reflectance. This is advantageous since errors in co-registration of image and DEM (which would be required for the determination of the solar angle) are circumvented.

This major advantage causes a problem at the same time. As we know from \cite{Warren1982}, 

\begin{itemize}
 \item Reflectance of snow in the NIR decreases with increasing grain size
 \item Reflectance of snow in the VIS decreases with increasing impurities
\end{itemize}

This means that the shape of the reflectance spectrum of 
snow with large grain sizes and high impurity concentration will have a similar spectral shape as finegrained clean snow.

Consequently, \gls{MODSCAG} has challenges distinguishing between pure fine grained snow (lowest albedo) and dirty coarse grained snow (highest albedo) and will randomly pick one of these extremes in e.g. the snow-line area.

In order to avoid the confusion between fine-clean and coarse-dirty snow, the absolute reflectances have to be considered.
This in term will make it necessary to also control for the illumination angle $\phi$.
% [Frew]: what is "it"?
In order to increase the buffer of overdetermination, instead of solving for the non-snow endmembers, we can avoid solving for the non-snow endmembers by instead measured it during the summer through a continuum approach\footnote{Creating a Continuum approach should be greatly simplified in EarthDB 2.0}.

The reflectances at the MODIS sensor consequently are a function of

\begin{equation}
    R_{\lambda} + \epsilon_{\lambda} = F(f_{snow}, \lambda, r_{snow}, h, \phi, c_{dust}, r_{dust}, c_{soot}, r_{soot}, R_{NS})
\end{equation}

Where $f_{snow}$ is the snowcover area, 
$r_{snow}$ the grain size, 
$h$ the elevation,  
$\phi$ the illumination angle 
$c_{dust}$ and $c_{soot}$ the dust and soot concentration
$r_{soot}$ and $r_{soot}$ the dust and soot grain sizes
and $R_{NS}$ the background/non-snow reflectance.

With an appropriate reflectance and radiative transfer model, the above equation can be used to minimize the root least square error:
\begin{equation}
\begin{split}
min.\; RMSE \\
 RMSE = \sqrt{ \frac{1}{n_{\lambda}} \sum_{\lambda=1}^{n_\lambda} * \epsilon_{\lambda}^2}
 \end{split}
\end{equation}

 
As of now, the algorithm is implemented and tested on hyperspectral \gls{AVIRIS} and \gls{AVIRIS-NG} data, which is only sparsely available. I intend to implement \gls{SCAGD} in EarthDB 2.0 to work with multispectral data from \gls{MODIS}, \gls{VIIRS}, \gls{GOES} and/or Himawari 8.

I will carry out the above mentioned undertakings as part of NASA's \gls{ACCESS}\footnote{Funding Opportunity Number: NNH17ZDA001N-ACCESS} program project ''STARE: SpatioTemporal Adaptive-Resolution Encoding to Unify Diverse Earth Science Data for Integrative Analysis''\footnote{Proposal Number: '17-ACCESS17-0039'}. Experiments will be carried out on an appropriate region of interest (e.g., Tuolumne Basin, Merced Basin, or San Joaquin in the eastern Sierra Nevada or and/or a Himachal Pradesh in India). 


\newpage
\section{Timeline}

The major milestones for the upcoming two academic years are:

\begin{table}[ht]
 \centering


\begin{tabular}{l l l l}
\toprule
    Date        & Venue            & Type       & Subject                           \\ \midrule
    2019-12-31  & CODATA           & Submission & OCCUR paper                       \\
    2020-01-06  & ESIP             & Demo       & Nightlights 	                    \\
    2020-01-06  & IGARSS           & Tutorial   & EarthDB 2.0 / PyStare             \\
    2020-11-01  & SigSpatial	   & Oral       & Nightlights 	                    \\
    2020-12-31  & \textit{TBD}              & Submission & EarthDB 2.0                       \\
    2021-05-01  & \textit{TBD}              & Submission & SCAGD                             \\
    2021-06-01  & Defense	       & Defense    &  -                                \\
\bottomrule
    
    
\end{tabular}
\end{table}


\section{Funding}
During the Academic year of 2019/2020, I anticipate to be financially supported through a GSR position funded by the Rilee Systems Technologies; STARE, Project Code FJN04.

During the Academic year of 2020/2021, I anticipate to be financially supported through a Bren Fellowship and teaching assistantships.


\newpage
\section{Literature Review}

\subsection{Data citation}
In the following, I intend to explore three questions about data citations:

\begin{enumerate}
 \item Why do we need data citations?
 \item What are data citations?
 \item How have data citation systems been implemented?
\end{enumerate}

\subsubsection{Why do we need data citations}    
\cite{Hey2009} coin the term \textit{4th paradigm} as ``using computers to gain understanding from data created and stored in our electronic data stores [..]''.
The 4th paradigm arises from an environment in which large amounts of data are collected 24/7 and made publicly accessible. Research is not anymore just driven by empirical, theoretical and computational approaches, but also by the exploration of vast amounts of data collected from instruments and simulations. In this context, data collection and assembly itself is a significant research activity \citep{Frew2012}.
%We are just at the beginning of the era of the 4th paradigm and therefore a lot of challenges are still to be solved. One grave uncertainty is how we can maintain long-term data provenance and enable reproducibility throughout the end of time \citep{Hey2009}. 

A crucial step into the 4th paradigm is to acknowledge data as first class research products. As such, they have to be persistently available, documented, citable, reusable, and possibly peer-reviewed \citep{Callaghan2012, Kratz2014}. Consequently, the research community has to move from data sharing to data publishing \citep{Costello2009, Kratz2014} or, in other words, has to make data \gls{FAIR} \citep{Wilkinson2016}. 

Data citation are one of the required building blocks to achieve this goal and a common adaption of data citations and a uniform data access is expected to benefit the progress of science \citep{CODATA2013}. However, there does not seem to be a consensus on what data publication means \citep{Kratz2014} and how data citation mechanisms, which would sufficiently motivate throughout data publishing, are to be implemented \citep{Costello2009}.

The lack of data citation standards was already criticized more than a decade ago by \cite{AltKin07}. The authors proposed a set of standards to address the issues. However, years later, \cite{Altman2015} as well as \cite{Tenopir2011} find that even though required by publishers, researchers still too often do not make data publicly available, nor cite the data consequently. The reasons for this are both cultural and technical: 

\cite{Lawrence2011} find that traditionally only conclusions are judged. Only little attention is given to the fitness of the used data for re-interpretation. This in turn results in low appreciation for data production and publishing. \cite{Tenopir2011} additionally point out that organizations do not sufficiently provide support for data management to their researchers. Even more so, \cite{Tenopir2011} stress out that researchers may actually be motivated to purposefully withhold data in order to retain their own ability to publish findings.

On the technical side, robustness, openness, and uniformity in data publication are lacking \citep{Starr2015, Koltay2016}. Cost, not so much for the storage, but for curation efforts is another reason preventing data publication \citep{Gray2002}.
\citep{Tenopir2011} states that a major reason for data withholding is the effort required to publish data. 
The ability to receive credit for cited data may increase the motivation for researchers to publish their data \citep{Crosas2011, AltKin07}. However, there is no common agreement on the implementation of data citations, especially if subsets are to be cited or data is dynamic \citep{Kratz2014, Assante2016}. \cite{Belter2014} finds that even when used, data citation practices are inconsistent. \cite{Assante2016} illustrates the range of practices which span from exporting a formatted citation string or a generic format such as RIS or BibTex, to embedded links to the dataset, or sharing to social media. 

\cite{Silvello2017} provides an exhaustive review of the current state of data citations in terms of reasons for the necessity of data citations and on current examples of data citation implementations. Based on a meta-study, the author identified 6 main motivations for data citations: Attribution, Connection, Discovery, Sharing, Impact, Reproducability. 
Arguably, these motivations can be condensed to Identity, Attribution, and Access: 

\paragraph{Identity:} 
Citations provide an identity, enabling to reference and reason about data \citep{Bandrowski2016}, even if it does no longer exist or is behind paywalls. 
Distiguishing and uniquely identifying data also allows to evaluate its usage and hence its relevance and impact \citep{Honor2016}.

\paragraph{Attribution:}
Citations attribute data to authors and therfore allow to give credit.
The possibility to receive credit in term provides an incentive for sharing \citep{Niemeyer2016, Callaghan2012, Kratz2014}. 


%Initiatives such as ``Making Data Count''\citep{Kratz2015} recognize this and supply the community with principles and implementations to obtain data usage metrics.

\paragraph{Access:} 
A citation provides information on how to retrieve the cited material (e.g., the journal, year and pages). Persistent access to data is crucial since it is the foundation of reusability and reproducability \citep{Starr2015}. 

Data citations may provide an additional side-effect improving data accessibility: There is evidence suggesting that well curated data will be cited favorably \citep{Belter2014}. Citations therefore may be a motivating driver for researchers to sustainably curate their data and thus making them more accessible.


\subsubsection{What is a data citation?}
Data citation establish a link between published research results and data \citep{CODATA2013}. Minimally, a data citation consists of the following elements \citep{Cook2016}:

\begin{enumerate}
 \item Unique identity.
 \item Information about the creator.
 \item Information about how to access the data.
\end{enumerate}

Unique identities of data allow to reference and connect related work and therefore enhance discoverability. Further, they enable to evaluate the relevance and impact of the dataset. The information about the creator allows the author to claim attribution of the impact, which in turn provides incentive for the author to share the dataset. The information about how to access the data facilitates the sharing and allows reproducability.

Arguably, data citations should also include a fingerprint\footnote{i.e. a hash. \citep{Crosas2011} refers to them as \gls{UNF}} to enable data integrity verification \citep{Crosas2011} and that a given dataset equals the referenced data.

Data citations further should be machine-actionable \citep{Assante2016, Altman2015, Buneman2016}, both in terms of creation and resolving and in terms of identification.
They have to accommodate the structured nature of data, the fact that data may change over time and that users may need to refer to subsets of data \citep{Buneman2010}.

Even though data citations often rely on \gls{DOI} \citep{Castelli2013}, \cite{Buneman2010} stress out that a citation is more than a \gls{DOI}.
Literature frequently appears to intermix identity with access (locator vs identifier) \citep{ESIP2012a}: While actionable identifiers such as \glspl{DOI} may provide a unique identity and an access mechanism at the same time, identity and access remain two distinct facets of a citation. There is utility in data identity regardless of whether or not the data can be accessed or even still exists.
\Cite{Parsons2013a} criticizes another aspect of \gls{DOI} use in data citations: \Glspl{DOI} are misunderstood to provide imprimaturs and persistence. However, \gls{DOI} cannot provide persistence and should solely be understood as a locator and identifier, which are required long before an imprimatur can be issued.

\newpage
\subsubsection{Solutions and implementations:}
I want to distinguish the differences in data citation implementations in five different aspects:

\begin{enumerate}
 \item How are datasets and their subsets identified? 
 \item How is fixity is assured? 
 \item How are revisable datasets handled? 
 \item How do citations facilitate access to data? 
 \item How are human readable citation snippets/strings generated.
\end{enumerate}

The way these aspects are solved depend widely on the domain and the particular dataset characteristics in terms of data complexity (table/arrays vs graphs), data volume (\si{\kilo\byte} vs \si{\peta\byte}), as well as update frequency and the characteristics of the repository in terms of subsetting capabilities and the typical use.

%%%%%%%%%%%%%%%%%%%%%
\paragraph{Identity:}
A unique identity could be provided by any arbitrary string, given that it is unique \citep{CODATA2013}. In some contexts, filenames \citep{Buneman2016}, or already established identifiers such as accession numbers \citep{Bandrowski2016} may serve this purpose. 
In practical terms, \cite{AltKin07} suggest that the identity should double as a handle to the data by associating it with a naming resolution service, e.g. through the use of e.g. \gls{DOI},  \gls{LSID}, \gls{URN}, \gls{HNR}\footnote{\url{http://www.handle.net/}}, or \gls{URL}. 

In contrary to traditional publications, queries may produce an infinite number of subsets from a single source \citep{Davidson2017, CODATA2013}. It is therefore not only necessary to reference a dataset, but every possible subset of a dataset. \cite{AltKin07} uses the term ``deep citation'' to describe the ability to reference subsets of data. 
Further, data may evolve over time, which opens discussion of how to identify the varying states of a datset \citep{Huber2015}.

The question arises at which granularity a unique identity, and hence a \glspl{PID} should be minted. \cite{Buneman2010} therefore introduce the concept of a ``citable unit''. A citable unit is an object of interest which could be e.g. a fact stated in a scientific paper, or a subset within a dataset. The authors argue against creating identifiers for every object of interest, but rather to wisely choose the granularity of the citable unit. A citation of an object can then be created by appending the identity of the citable unit with information about the location of the object within the citable unit. In relational databases, citable units could be defined through views \citep{Buneman2016}.

The \gls{RDA} \gls{WGDC} \citep{Rauber2015a, Rauber2015, Proll2013} suggests to identify subsets by storing (normalized) queries and associating them with a \gls{PID}. However, since methods and syntax for subsetting depend on the individual repository technology, \cite{AltKin07} suggest to cite the entire dataset and append it with a textual description of how the dataset was subsetted. The authors further suggest that if significant pre-processing is done on the dataset, it should be stored and cited with its own identity individually. 

%%%%%%%%%%%%%%%%%%%%
\paragraph{Fixity:}
Datasets may change unintentionally through data rot or malicious manipulation or intentionally, due to \gls{CUD} operations. 
Data fixity is the property of data to remain unchanged. 
Fixity checking allows to verify that data has not changed.
If included into a data citation system, fixity checking can be used to verify if a dataset contains the same data as of the time of the creation of the citation.

\cite{AltKin07, Rauber2015} suggest to include \glspl{UNF} into data citations to allow fixity checking. Though mentioned in most papers addressing data citation systems (e.g. \citep{Buneman2016, Davidson2017}), only few data citation system implementations have addressed fixity so far. In this context, \cite{Klump2016} report that the STD-DOI metadata schema rejected the inclusion of hashes as they would be dependent on the form of representation (e.g. file formats or character encoding).

%%%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Revisable data:}
Datasets (especially in the earth/environmental sciences) may evolve over time through updates, appendation, or deletion \citep{Klump2016}. In the following, I will refer to these datasets as revisable data and want to delimit the term from datasets changing over time due to malicious manipulation or through data rot.

Literature frequently intermixes the term ``fixity'' and the ability to cite revisable datasets. A revisable dataset is anticipated and intended to change its \textit{state} over time. A citation system consequently has be able to distinguish between the states of a revisable dataset \citep{Rauber2015, Klump2016}. However, to achieve this, merely the abstract state a citation is referencing has to remain fixed (i.e. there cannot be ambiguity of the referenced state).
This is true independently of the ability to de-reference a citation to the referenced state (i.e. the actual state being fixed). Identifying and referencing an ephemeral state of a dataset is a necessary requirement for data citation. The ability to persistently retrieve this state of the data is a data publication, not data citation challenge. It hereby is up to the publisher and repository to choose an apt level of zeal:

\begin{description}
  \item[Pessimistic] Data is assumed to be ephemeral and consequently citations cannot ever be de-referenced.
  \item[Optimistic] Data is assumed to be fixed. Citations always de-reference to the current state of the data.
  \item[Opportunistic] Data is assumed to remained fixed for some time. Citations can be de-referenced only until the data changes.
  \item[Pedantic] Every state of the data is saved. Consequently citations can always be de-referenced to the referenced state.  
\end{description}

\citep{Klump2016} makes a hard distinguishing between growing, and updated dataset. Identifying a state of a growing dataset can simply be implemented through time ranges, given that records are timestamped. 

Versioning can be used to identify states of revisable data. Note that ``version'' is sometimes used synonymously with ``state'' (of a dataset or a single record). However, in the following, I will use the term ``version'' as a policy-prescribed reference to a state.

Versioning is intuitive and can trivially be implemented e.g. by appending version number suffixes to dataset names\footnote{\url{https://library.stanford.edu/research/data-management-services/data-best-practices/data-versioning}}. 
However, since versioning is merely a \textit{policy}, there is no guarantee for enforcement. Further, ``There is currently no agreed standard or recommendation among data communities as to why, how and when data should be versioned''\footnote{\url{https://www.ands.org.au/working-with-data/data-management/data-versioning}}.
%\Cite{Barkstrom2003} defines that within a dataset version, algorithms and input parameters used to produce the dataset are constant. 
The W3C\footnote{\url{https://www.w3.org/TR/dwbp/\#dataVersioning}} provides some guidance on how revisable data should be handled on the web, however, \gls{RDA}\footnote{\url{https://rd-alliance.org/data-versioning-rda-8th-plenary-bof-meeting}} considers theses standards too simple and not scalable.

A major concern in data versioning is how to reach ``consensus about when changes to a dataset should cause it to be considered a different dataset altogether rather than a new version''\footnote{\url{https://www.w3.org/TR/dwbp/\#dataVersioning}}.
From a pure identity perspective, this question is futile since every relevant state needs to be identifiable. The distinguishing between version versus state therefore is mainly connected to the nature of \glspl{PID} (and the costs associated with minting them) \citep{Klump2016} as well as the notion of hierarchical association and provenance. 

\cite{Barkstrom2003} attempts to structure the semantics of datasets versions by suggesting a 4-level tiering: Data products (Level 1) are collections of datasets (Level 2). Datasets within a data product represent homogeneous observations but may stem from different measurement sources (e.g. instruments). A dataset may have different dataset versions (Level 3). Within a dataset version, algorithms and input (e.g. calibration) parameters are fixed. Finally, a dataset version may have different datset version variants, within which all production configurations (OS, compiler, source code) are held constant.

For the WDC-RSAT data, \glspl{DOI} are assigned at the product level. The datasets in the product may be appended without issuing a new DOI. Only when data is reprocessed (new algorithm used), new DOI are assigned to the dataset \citep{Huber2015}.

\citep{AltKin07} suggest to treat different version of datasets as separate datasets and cite them independently. This suggestion is implemented by the \gls{SEDAC}\cite{Downs2013}. Every change in the dataset triggers a version increase which is associated with a new \gls{DOI}. The landing pages of newer versions reference to older versions. 

The data publishing services Zenodo, dryad, and Fighsare offer the option to mint a \gls{DOI} to represent all dataset versions (``base-DOI'') and additional \glspl{DOI} to point to specific versions\footnote{\url{http://help.zenodo.org/}} (``version-DOI''). Figshare will automatically trigger a new version when changes are made to the data. Figshares version-\glspl{DOI} are semantic and are created from the appendation of the base-DOI with $v<x>$, where $x$ is the version number. Zenodo, on the other side refers to semantic \glspl{DOI} as bad practice and mints semantically independent \glspl{DOI} for different versions and recommends to semantically link connected \glspl{DOI}.

The \gls{BCO-DMO}\footnote{\url{https://rd-alliance.org/sites/default/files/attachment/RDA_DataCitation_BCO-DMO_Chandler_Plenary.pdf}} employs more flexible rules: Changes to the data that result in different conclusions trigger a major version increment. 
Such changes include every record update and deletion as well as changes in the data schema.
A major version increment is treated as an independent entity with its own \gls{DOI} and landing page. Inserts will trigger metadata updates. A metadata update will cause a minor version increment. Minor version increments are not treated as separate entities and share the same \gls{DOI}.

As mentioned above, it is open for debate whether or not reproducability is a hard requirement for data citations of a revisable dataset. If not, resolving of citations could be deprecated altogether (pessimistic), or only allowed until the data has changed (opportunistic). The opportunistic approach could e.g. be implemented by timestamping modifications (last modification date) or through fixity checking. The \gls{RDA} \gls{WGDC} \citep{Rauber2015} elaborates on this approach: A dataset (or a subset) should be given a new identity when the data has changed since the last time the dataset (or subset) was requested. This is recommended to be implemented through the use of a normalized query store and checksums \citep{Ball2015}.

\cite{Gray2002} however advocate for all non-regeneratable data to remain available forever and  \cite{Buneman2010} suggest a dataset to remain accessible once it has been cited. In their implementation, a citation can include a version number, allowing the system to de-refrence the citation to the according state. A similar approach is implemented in the dataverse \citep{Crosas2011}, where citations optionally can contain a dataset version number. 

A simple way of keeping previous states availability is snapshotting. However, versioning and snapshotting at fixed points in time does not suit well for users of (near) real-time data  \citep{Huber2015}. Further, depending on the size and revision frequency, storing all revisions as separate exports may not be feasible \citep{Rauber2015}. To circumvent these issues \cite{AltKin07} as well as the \gls{RDA} \gls{WGDC} \citep{Rauber2015a, Rauber2015, Proll2013} recommend to store state changes on a record-level rather than on a dataset level: Every state of every record remains persistently stored and is associated with the \gls{CUD} operation that touched it. All \gls{CUD} operations are timestamped allowing to identify the validity period of every state.

This approach is implemented by \cite{Alawini2017}, who created a citation service for the eagle-iV database. Since this database itself does not version its data (only the most recent version is visible), the authors implemented an external service that versions eagle-iV data in order to provide fixity to the users. The service tracks and stores every change in the original dataset.  The authors note that this approach is viable for eagle-iV since the dataset changes very slowly.

The concept of versioning at record-level challenges the common understanding of a dataset-wide versioning and opens a debate about whether two subsets of two different dataset-level versions containing identical data should be identified as the same subset or not. 


%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Access:}
There is a common agreement, that data citations should make use of actionable \glspl{PID} as an access mechanism. E.g. \cite{AltKin07} suggest a citation to contain an identifier that can be resolved to a landing page (not to the data itself), a requirement also specified by the \gls{JDDCP} \citep{Fenner2016}.
The landing page in turn should contain a link to the data resource. The advantage is that the identifier can be resolved regardless of whether the data is behind a paywall, or does not exist anymore.

\glspl{DOI} appear the most commonly used actionable \gls{PID} in data citations, which can be explained by its maturity and the availability of a global resolving service.  \citep{Honor2016}. Alternatives are \gls{ARK}, \gls{PURL}, \gls{URL}/permalinks, or \gls{PURL} \citep{Klump2016, Starr2015}.

The question of data access is connected to reproducability and on what granularity level changing states of a revisable dataset should be stored.

% a) PID comparison
% b) Machine actionablility

There further seems to be agreement that identifiers should be resolvable to a machine-actionable representation of a landing page, implemented e.g. through Content Negotiation.

%%%%%%%%%%%%%%%%%%%%%%%
\paragraph{Format}
Data citation system should make use of metadata standards \citep{CODATA2013} and should be capable to generate human readable citation strings in order to facilitate the use of data citations \citep{Buneman2016, Rauber2015}. 

The requirement of the metadata to construct human readable strings vary between implementations but a common intersection can be found:

\begin{table}
    \caption{Comparison of Data Citation Standards. DataCite: \url{https://schema.datacite.org}; Dublin Core \url{https://schema.datacite.org}; Mendeley \url{https://data.mendeley.com}; Figshare \url{https://figshare.com/}; Zenodo \url{https://zenodo.org}; WGDC \citep{Rauber2015}} 
    \begin{tabularx}{\columnwidth}{lll lll ll}
        \toprule
                                & PID   & Creator & Title & Subj.   & Vers.   & Descr.    & UNF \\ \midrule
        DataCite                & DOI   & Yes     & Yes   & Yes     & Yes     & Yes       & No  \\
        Dublin Core             & Yes   & Yes     & Yes   & Yes     & Yes     & Yes       & No  \\ 
        Mendeley                & DOI   & Yes     & Yes   & No      & Yes     & Yes       & No  \\
        Figshare                & DOI   & Yes     & Yes   & Yes     & Yes     & Yes       & No  \\
        Dryad                   & DOI   & Yes     & Yes   & Yes     & Yes     & No        & No  \\
        Dataverse               & DOI   & Yes     & Yes   & No      & Yes     & No        & Yes \\
        Zenodo                  & DOI   & Yes     & Yes   & No      & Yes     & No        & No  \\
        WGDC                    & Yes   & Yes     & Yes   & No      & Yes     & No        & No  \\
        \bottomrule
    \end{tabularx}
\end{table}


Additional fields: 
Publish Date, distributor, subset definition, Institution, Related Links, File type, Type of Data, Licences, Project name, Keywords, Repository Name, location (if no resolvable identity is used).

The crosscite DOI Citation formatter \footnote{\url{https://citation.crosscite.org/}} allows to generate citation strings from metadata retrieved from landing pages through content negotiation.
The citation strings are formatted subject to styles defined through the \gls{CSL}\footnote{\url{https://citationstyles.org/}}.


\subsubsection{Recommendations}
A variety of data citation recommendations have been published within the last couple of years.
Following a list of the most notable ones:

\paragraph{\acrlong{WGDC}:}
\citep{Rauber2015} are the \gls{WGDC} recommendations for data citations.

\paragraph{\acrlong{DCC}:}
\citep{Ball2015} is the data citation recommendations of the Digital Curation Center. And mainly based on the \gls{WGDC} recommendations.

\paragraph{\acrlong{ESIP}:}
\citep{ESIP2012a} are the data citation recommendations by \gls{ESIP}

\paragraph{\acrlong{COS}:}
\citep{COS2015} are the \gls{COS} data citation recommendations

\paragraph{\acrlong{JDDCP}:}
\citep{Altman2015, Rauber2015, Fenner2016, Starr2015} are the FORCE11 \gls{JDDCP}.

\paragraph{\acrlong{CODATA}:}
\citep{CODATA2013} Are the \gls{CODATA} recommendations.

%%%%%%%%%%%%%%%%%%%

\subsection{Related work}
Several services for automized data citation creation have been presented in the past. 

MatDB\footnote{\url{http://doi.org/10.17616/R3J917}} implements a data publication and citation service for engineering materials. Dataset are made citable by enforcing minimal discipline-specific metadata and minting DataCite DOIs. Fixity is assured by snapshotting the dataset at the time of DOI minting. Revisability of data is made possible through policy enforced versioning \citep{Austin2016}.

\cite{Alawini2017}, created a citation service for the \gls{RDF} eagle-i database. Since this database itself does not version its data (only the most recent version is available), the authors implemented an external service that versions eagle-i data in order to provide access to revised data to the users. The service tracks and stores every change in the original dataset.  The authors note that this approach is viable for eagle-i since the dataset changes very slowly.

In the dataverse networks software \citep{Crosas2011}, data is aggregated in 'studies'. Studies may contain several datasets. Each study shares a common persistent identifier. Citation to a dataset (and subsets) is implemented as the combination of the studie's \gls{PID} appended with the UNF of the cited data.

\cite{Cook2016} presents the data product citations at a \gls{ORNL} \gls{DAAC}. The DAAC assigns a \gls{DOI} per dataset, which may contain anywhere between one and tens of thousands of files. A single file within a dataset can be identified by appending the file's UUID to the DOI (using the 'urlappend' functionality of the DOI resolver). The DAAC also provides a MODIS subsetting service. The user can request a citation to the subset which is comprised of the dataset's citation appended with a textual description of the temporal and/or spatial subsetting. The citation can be requested both as a plain string as well as in BibTex format.

A very similar approach is implemented for the \gls{ARM} Data Archive \citep{Prakash2016}. Upon data order fulfillment, the user is provided with both the data and a citation that contains a citation including a textual description of the temporal and/or spatial subsetting. The \gls{ARM} Data Archive additionally hosts a citation creator, which is a \gls{GUI} that allows the creation of a data citation subject to a fully qualified dataset stream name, and optionally manually specified subsetting parameters. The user hereby can choose between the custom ARM citation style, APA, MLA, and chicago.

\cite{Honor2016} describes a reference implementation for data citations of a database holding neuroimaging (the specific use-case is the \gls{NITRC}). The citable unit for their use-case are individual images, which are aggregated in studies/projects. Upon data upload, hierarchically, both each image as well as each studies is assigned a \gls{DOI}. The authors recognize that this implementation may result in the generation of many \glspl{DOI}, however evaluate the solution feasible for their use case.

\cite{Proll2013} present a reference implementation for data citations to data managed by a \gls{RDBMS}. 
The system is based on the premise that a timestamped SELECT query can correctly identify data. To allow revisability, the system timestamps every \gls{CUD} operation and acknowledges validity timeranges of records rather than allowing modification of records. When a user wants to create a citable subset, the system will store the according timestamped SELECT query, calculates a hash for the result set and assigns a PID to the query. 

\cite{Schubert2019} recognize that a manual identification of data is impractical and imprecise. The authors developed a dynamic data citation solution for the data service \gls{CCCA} which closely follow the \gls{RDA} \gls{WGDC} \citep{Rauber2015, Rauber2015} recommendations. CCCA is a a web service that allows subsetting of data through restful queries and is based on \gls{TDS}.
The metadata of a subset is constructed as the concatenation of the dataset metadata and the query arguments as well as name of subset creator and check-sums. 
Handle is used as a PID service


\citep{Buneman2010}: Rule based citation approach; application IUPHAR DB and EAD
\citep{Buneman2016}: Rule based citation approach; application: MODIS and IUPHAR 

\begin{table}
\caption{Comparison of automated data citation systems}
\begin{tabular}{l l l l l}
\toprule
    Repo        & PID           & Revisability   & Fixity   & Deep Citation \\\midrule
    MatDB       & DOIs          & Versioning     & Snapshot & No  \\
    eagle-i     & eagle id      & CUD TS         & UNF      & Query\\
    Dataverse   & Handle        & Versioning     & UNF      & PID+UNF \\ 
    ORNL        & DOIs+UID      & Versioning     & None     & Textual \\
    ARM         & DOIs/Name     & Versioning     & None     & Textual \\
    NITRC       & DOIs          & N/A            & None     & DOIs  \\
    RDBMS       & PID           & CUD TS         & UNFs     & Query \\
    CCCA        & Handle        & Versioning    & No       & Query \\
\bottomrule     
\end{tabular}
\label{tab_citcomp}
\end{table}

The crosscite DOI Citation formatter\footnote{\url{https://citation.crosscite.org/}} allows to generate citation strings from metadata retrieved from landing pages through content negotiation.
The citation strings are formatted subject to styles defined through the \gls{CSL}\footnote{\url{https://citationstyles.org/}}.

\newpage
\subsection{Global indexing schemes}
%\item How does a system that leverages \gls{STARE} within SciDB compare to google earth engine and the \gls{DGGS} \citep{OpenGeospatialConsortium2017}?


\label{lit_index}
Queries are often concerned only with a portion of the whole stored data volume and appropriate indexing schemes minimize the data that has to be scanned \citep{Kunszt2000}.

The idea of using hierarchical data structures to represent or index geospatial data has been discussed for decades \citep{Dutton1996, Samet1988}.
Hierarchical data structures are based on recursive decomposition of an initial planar or solid and can, for example, be implemented as quadtrees \citep{Samet1988}.

The initial applications of quadtrees in the geospatial domain have mainly focused on the representation of two-dimensional data in terms of visualization and image processing and were typically based on the tessellation of squares \citep{Lugo1995}.

An early example of the use of quadtrees to represent the globe three-dimensionally is \cite{Dutton1984}, who proposed the establishment of a Geodesic Elevation Model in which locations of elevation measurements are encoded/indexed in a quadtree. 
\cite{Dutton1989} suggests the use of this quadtree for general indexing of planetary data.
The quadtree is created by recursively tessellating the facets of a regular solid, in this case, an octahedron.
The triangle faces of the octahedron are broken down with the triacon breakdown. In a single step, the triacon triples the number of facets on each iteration and results in two alternating hierarchies. Every level though is fully contained within the level two steps above; two triacon breakdown steps therefore tessellate a triangle into nine smaller triangles. The address/index/code (``gemcode'') of each triangle is the concatenation of the facet numbers (one through nine) iterated through to arrive at the triangle. The author envisages a replacement of coordinates in geospatial data with geocodes, given that the community could agree on a common method to generate geocodes.

In parallel efforts \cite{Fekete1990, Fekete1990a}, and \cite{Goodchild1992} (and later also \cite{Lugo1995}) implemented the \gls{QTM} initially suggested by \cite{Dutton1984}. While \cite{Goodchild1992} used an octrahedron as initial regular solid, \cite{Fekete1990, Fekete1990a} use a icosahedron.
The resulting structures allows to geospatially index every feature object on the planet. 
In contrary to \cite{Dutton1984}, \cite{Fekete1990, Fekete1990a, Goodchild1992, Lugo1995} iteratively tessellate each triangle into four triangles, allowing them to store each tessellated triangle with two bits.
\cite{Goodchild1992} point out that the length of a trixel address (i.e. the index), which corresponds to the level/depth in the hierarchy simultaneously indexes the size (or spatial uncertainty) of the indexed object.

\cite{Dutton1996} explored the tradeoffs of the choices of the initial solid (Tetrahedron, Ocatahedron, icosahedron) and hereby empathizes the advantages of an octahedron for practical reasons: The vertices occupy cardinal points (e.g. poles) and lie in ocean areas. Additionally, edges align with cardinal lines (equator, meridians).

The idea of indexing spherical data with a quadtree was picked up again by \cite{Barret1995} and further adapted by \cite{Kunszt2000, Kunszt2001, Szalay2005}, who developed an indexing schema for the \gls{SDSS}, which would later on be implemented into the SkyServer\footnote{The SkyServer was built by Tom Barclay, Dr. Jim Gray and Dr. Alex Szaley from the TerraServer \citep{Barclay1998, Slutz1999} source code. The latter was a project to demonstrate the real-world scalability of MS SQL Server and Windows NT Server} \citep{Szalay2002, Thakar2003}.
The authors coined the term \gls{HTM}.

All nodes of the \gls{HTM} quad-tree are spherical triangles. The quad-tree is created by recursively dividing triangles into 4 new triangles by using the parent-triangles corners and the midpoints of the triangle sides as corners for the new triangles.
The name of a new node (triangle) is the concatenation of the name of the parent triangle and an index 1 through 4. Thus, node names increase in length by two bits for every level. The authors distinguish between \gls{HTM} names and the \gls{HTM} IDs, which is the 64 bit-encoded integer of the \gls{HTM} name. 
\cite{Kondor2014} use and extend the \gls{HTM} implementation to tessellate complex regions on the earth's surface.

\cite{Planthaber2012, Planthaber2012b, Krcal2015, Hausen2016, Doan2016} experimented with storing earth-observing satellite data in the array database SciDB. In their attempts, the data is indexed through integerized latitude-longitudes. \cite{Doan2016} emphasize the importance of indexes on the database performance as they govern data placement alignment and suggests \gls{HTM} as a promising approach.

\cite{Rilee2016} advanced the HTM implementation from right-justified mapping to left-justified mapping:
In a right justified mapping, trixels that are in proximity but at different levels are mapped to separate locations on the number line\footnote{trixel \textbf{S0123} has binary HTM code 1000011011 and thus HTM id 539, while trixel S01230 has the binary HTM code of 100001101100 and thus HTM id 2156. The ids are far from each other while the both trixels share the same first 4 digits in their name prefix (and thus are contained in each other)}.
Left justified mapping respects geometric containment by right-padding HTM binary codes with zeros.
The level (which in right-justified mapping is implicitly given) is specified by the last 5 bits of the 64 bit integer. Therefore, regardless of the resolution, indexes that are co-located are in similar index ranges.
\cite{Kuo2017} extend the implementation with a temporal component and name the resulting universal geoscience data representation \gls{STARE}.

\newpage

\subsection{Use Case (Multi-sensor snow mapping)}

    
\subsubsection{Motivation for remote sensing of snow}
Considering its high reflectance and large land cover, snow is an important forcing on Earth's radiation balance and hence the climate \citep{Durand2017}.
Further, significant portions of earth's population rely on water originating from snow-melt \citep{Barnett2005, Durand2017} and the snowpack itself buffers the runoff \citep{Lettenmaier2015}.
Understanding snowmelt processes therefore is crucial to manage water resources, especially considering the anticipated drastic changes in snowmelt caused by globally changing climatic conditions. 

In order to estimate and predict the spatial and temporal extent of snow cover, the snow's energy- and massbalance are simulated. Besides meteorological conditions, spatially resolved data on the snowpack in terms of extent/cover, depth, presence of water, temperature, \gls{SWE}, and albedo are herefore required \citep{Dozier2004}.


\subsubsection{Remotely sensed snow cover}
Measuring of snow can be subdivided into the identification of the existence/extent of snow, and the measurements of its properties such as depth, density, water content, albedo, and temperature profile. The former three properties can be collapsed into in the \gls{SWE}, while the latter two are necessary to model the forcings on the snowpack.

Traditional ways of measuring the snowpack are snowpillows, snowcourses, and metrological surveys. These type of measurements are sparse and infrequent and further subject to inhomogenous conditions. In contrary, remote sensing can provide temporal and spatial continuous data \citep{Dozier2004, Nolin2010}.

For hydrology, \gls{SWE}, is an essential snowpack parameter. However, the snow community lacks an approach for routinely mapping its global distribution \citep{Lettenmaier2015} on a sufficient space-time resolution. The SnowEX \citep{Durand2017} project therefore is seeking to help define a mission proposal. 

Measuring the global extent of snow on the other side is currently generally possible, with challenges arising from a) cloud cover, b) vegetation c) complex terrain. Snow extent can globally and routinely be calculated either from a combination of the visible and \gls{SWIR} surface reflectance data, or from (passive and active) microwave \citep{Frei2012} data. Global active microwave data is available from \gls{QuickSCAT}, however only for the time period between 1999 and 2009. Global passive microwave data is continuously available since the 1970s. However, due to the low signal, passive microwave data is only available at coarse resolution.

Specifically for mountainous regions, characterized by high topographic heterogeneity, spatial resolutions need to be fine enough to sufficiently capture the temporal and spatial variability of the snowpack. \cite{Lettenmaier2015} suggests spatial resolutions of snow extent not coarser than $\approx \SI{100}{\meter}$ and temporal resolution of not more than one week. Hence, passive microwave data is not suitable for snow extent measurements in mountainous areas. The required resolutions also exceed the spatial and/or temporal resolution of visible and \gls{SWIR} reflectance data of spaceborn remote sensing instrument. It therefore is necessary to map snow cover at sub-pixel accuracy \citep{Dozier2004}.

Several different algorithms to binarily classify pixels into 'snow' or 'non-snow' as well as algorithms to estimate fractional snow cover from visible and \gls{SWIR} reflectance data exist \citep{Nolin2010}.

Further, algorithms to estimate the snow's albedo via measurements of snow grain size and contamination by dust or soot have been developed \citep{Nolin2010, Dozier2004}. These algorithms are based on the spectral signature of a pixel without the consideration of the pixel's neighbors. The algorithms may or may not contain filtering based on e.g. surface temperatures, and vegetation and cloud masks.

\paragraph{\gls{NDSI}:}
Both snow and clouds are highly reflective in the visible part of the spectrum. However, in contrary to clouds, snow is highly absorptive in the \gls{SWIR} part of the spectrum, allowing to distinguish snow from clouds by using the ratio of visible and \gls{SWIR} \citep{Hall2011}. This became firstly feasible with the launch of the Landsat \gls{TM}, which included sensors for shortwave infrared \citep{Lettenmaier2015}. \cite{Dozier1989} introduced the use of normalized differences of \gls{SWIR} (later termed \gls{NDSI} by \cite{Hall1995}). Combined with threshold values of this difference, Landsat TM pixels can be categorized into snow covered and snow free.

A challange in mapping snow-covered area is forest cover obscuring the snow beneath the canopy contributing to the pixel reflectance. \cite{Klein1998} therefore introduced a combination of \gls{NDVI} and \gls{NDSI} to reduce the error in snow cover detection in dense vegetation \cite{Nolin2010}.
 
The approach was adapted by \cite{Hall2002, Hall2001} to introduce level-3 snow products for \gls{MODIS} at \SI{500}{\meter} resolution (MOD10A1, \citep{Hall2016}). The approach includes heuristics based on \gls{NDSI} to improve quality in forested regions and on thermal masks to identify ``spurious snow'' (a pixel is not determined to be snow if its temperature is greater \SI{277}{\kelvin}). The algorithm produces binary pixel maps (``SNOWMAP'') of snow cover.


\begin{equation}
 NDSI6 = \frac{band \, 4 -band \, 6}{band \, 4 + band \, 6}
\end{equation}

\cite{Salomonson2006, Salomonson2004} added a fractional snow cover product staring from version 005 of MOD10A1. The new algorithm also addresses the fact that detectors of band 6 on \gls{MODIS} aqua are nonfunctional, requiring the definition of a new \gls{NDSI}:

\begin{equation}
 NDSI7 = \frac{band \, 4 -band \, 7}{band \, 4 + band \, 7}
\end{equation}

Fractional snow covers are calculated by assuming a linear relationship between the \gls{NDSI} and the snow cover. The model is fitted with data from Landsat \gls{ETM} and resulted in the following relationships:

\begin{equation}
    FRA6T = -0.01 + 1.45 * NDIS6
\end{equation}

\begin{equation}
    FRA7T = -0.64 + 1.91 * NDIS7
\end{equation}

\paragraph{Spectral Unmixing:}
The \gls{NDSI} approach generally struggles to provide accurate results in the transitional periods during accumulation and melt. In contrast, methods based on spectral unmixing provide more consistent throughout the seasons and maintains its accuracy throughout a large range of surface properties \citep{Rittger2013}.

Spectral mixing is the assumption that measured radiances are a combination of the radiances reflected by different constituent surfaces \citep{Dozier2004}.
In remote sensing, this results from the fact that pixels of remote sensing sensors are often too large to represent a pure constituent material and rather represent a mixture of a number of constituents materials. Spectral unmixing is an approach in which the spectrum of a mixed pixel is decomposed into the spectra of the constituent materials in order to determine the proportionate contribution of each constituent material to the mixed pixel. Hence, spectral unmixing provides a method to retrieve sub-pixel detail \citep{Keshava2003}.

Spectral unmixing approaches are based on the inversion of mixing models, representing the physics of hyper-spectral mixing. Mixing models may either be linear or nonlinear. Linear mixing models can be employed if the reflectance area is contiguously divided so that the incident radiation is reflected only once. Homogeneously mixed reflectance areas may cause incident radiation to be reflected by multiple substances at once. The mixing therefore has to be modeled nonlinearly \citep{Keshava2003}.
In order to solve for the fractional surface types, the error of the endmember combination is minimized.

In the context of snow, the assumption of linear mixing is valid as long as only minimal interactions between different surfaces can be assumed, such as on planar areas and/or snow cover above the tree line \citep{Painter2009, Dozier2004}. If multiple scatterings are apparent (e.g. reflections from vegetation to snow), the mixing has to be modelled nonlinearly \citep{Roberts1993}.

\cite{Vikhamar2003} present an approach for fractional snow cover identification (SnowFrac) based on constrained linear spectral unmixing. The algorithm particularly tackles the challenge of identifying fractional snow cover in tree covered areas. It uses land-cover data to a-priori determine the non-snow endmember observable in a scene.

\cite{Painter2003} describe \gls{MEMSCAG}, a method derived from \gls{MESMA} \citep{Roberts1998}, to obtain the subpixel snowcover, grainsize and albedo for \gls{AVIRIS} pixels. The underlying method is linear spectral unmixing. Simultaneously, snow grain size and the fractional snow cover is estimated. Endmembers of pure snow for varying grain sizes were modeled with Mie theory. Additionally, 60 endmembers for rock, soil, vegetation, and ice are used.

\Cite{Painter2009} describes \gls{MODSCAG}, an progression of \gls{MEMSCAG}, to enable subpixel snowcover, grainsize and albedo estimations for \gls{MODIS} pixels. \gls{MODSCAG} uses the shape of the spectras rather than absolute reflectances, which makes it suitable to estimate snow parameters for rough mountainous regions in which terrain and hence illumination angles are not precisely determinable. \gls{MODSCAG} accounts for nonlinear mixture, however avoids nonlinear solving through the use of pre-calculated canopy-level endmembers that are linearly combined.

A general challenge in working with \gls{MODIS} pixels that \gls{MODSCAG} is facing is the wide viewing angles of \gls{MODIS} \citep{Dozier2009, Dozier2008, Liu2008}, resulting in the following problems:
\begin{itemize}
 \item Pixels far off NADIR cover more than 10 times the area of pixels close to NADIR, causing them to overlap and therefore blur the image.
 \item Considering that the topography and vegetation is not taken into account, the viewing zenith angle/geometry is unknown. However, e.g. trees will block the view to the ground increasingly with increasing viewing angles.
\end{itemize}

To overcome these and other issues, \cite{Dozier2008} developed a space-time interpolation to recover the day's best estimates by avoiding measurements far from NADIR.

%\subsubsection{Remote sensing of snow temperature}
%\cite{Dozier1981} presents a method to measure surface radiant temperatures at sub-pixel resolution. It exemplifies a method for TIROS-N satellites. The fundamental principle is that a sub-pixel area with higher temperature will contribute proportinally more to the signal in the shorter wavelengths. The method assumes that only two temperature fields exist within the pixel.


%\cite{Lundquist2018} presents a method to separate snow from forest temperatures and to determine fractional snow cover from MODIS data.


% No intext Citation

% \subsection{Environment Informatics}
\nocite{Frew2004} 


% \subsection{Indexing}
\nocite{Samet1990} 
\nocite{Goodchild2002} 
\nocite{OpenGeospatialConsortium2017} 

%\subsection{Databases (274) and Operating Systems (270)}
\nocite{Adya2000} 
\nocite{Agrawal1993} 
\nocite{Baker2011} 
\nocite{Berenson1995}
\nocite{Bernstein1987} 
\nocite{Chang2008} 
\nocite{Cooper2008} 
\nocite{Das2010} 
\nocite{Das2011} 
\nocite{Das2013} 
\nocite{Decandia2007}
\nocite{DivyakantAgrawal2012}
\nocite{Elhardt1984} 
\nocite{Elmore2011} 
\nocite{Gray2006} 
\nocite{Hellerstein2007} 
\nocite{Kung1981} 
\nocite{Larson2011}
\nocite{Weikum2002} 


%\subsection OS
\nocite{McKusick1984} 
\nocite{Dijkstra2001}
\nocite{Hansen1970}
\nocite{McKusick1984}
\nocite{Meyer1988}
\nocite{Rosenblum1992}
\nocite{Sandberg1985}
\nocite{Silberschatz2005} 


\newpage
\appendix
\printbibliography[keyword={snow},title={Remote sensing of Snow}]

\newpage
\printbibliography[keyword={global index},title={Global Indexing}]

\newpage
\printbibliography[keyword={data citation},title={Data Citation}]

\newpage
\printbibliography[keyword={environmental informatics},title={Environmental Informatics}]

\newpage
\printbibliography[keyword={database},title={Database Systems}]

\newpage
\printbibliography[keyword={operating system},title={Operating Systems}]


\end{document}