TFM.tex

\documentclass[conference]{IEEEtran}
\IEEEoverridecommandlockouts
% The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it
% out.
\usepackage[numbers]{natbib}
\bibliographystyle{plainnat}
\usepackage{hyperref}
\usepackage{amsmath,amssymb,amsfonts}
\usepackage{algorithmic}
\usepackage{graphicx}
\graphicspath{ {./assets/} }
\usepackage{textcomp}
\usepackage{xcolor}
\usepackage{lipsum}
\usepackage{color}
\def\BibTeX{{\rm B\kern-.05em{\sc i\kern-.025em b}\kern-.08em
T\kern-.1667em\lower.7ex\hbox{E}\kern-.125emX}}
\begin{document}


    \title{Automatic detection of floating aquatic vegetation from remote sensing data}

    \author{\IEEEauthorblockN{Esteve Soria Fabián}
    %\IEEEauthorblockA{\textit{dept. name of organization (of Aff.)} \\
    \textit{Universidad Internacional Menéndez Pelayo}\\
    esofabian@gmail.com}
    %\and
    %\IEEEauthorblockN{2\textsuperscript{nd} Given Name Surname}
    %\IEEEauthorblockA{\textit{dept. name of organization (of Aff.)} \\
    %\textit{name of organization (of Aff.)}\\
    %City, Country \\
    %email address or ORCID}

    \maketitle

    \begin{abstract}
        The Copernicus programme is an Earth observation component of the European Union Space Programme.
        One of the missions inside the programme is Sentinel-2 with 2 satellites orbiting around the earth.
        The two satellites, Sentinel-2A and Sentinel-2B, can provide images in 13 different bands with a resolution of
        up to 10m per pixel.
        Bands range from infrared through the visible spectrum to short wave infrared.
        The programme with its policy of free access allows scientists and researchers to obtain data for several
        research fields such as cropland, glacier or lake monitoring.

        In this work we focus our attention on water quality monitoring.
        During the study of several lakes in the Spanish territory, patches of an invasive species were found floating in the water.
        Finding out when, where and why these plants appear is of great interest for researchers.
        Its detection is a manual process through the use of common remote-sensing indexes as NDVI and water segmentation.
        The problem of this approach is its robustness out of the parameters under different locations and settings.

        This work proposes an automatic search system of these forms of life with a small amount of labelled training data.
        By applying self-supervised learning we are able to generate a model which can be fine-tuned with little data providing similar
        accuracies as other works in the same field.
        The model without fine-tuning provides image retrieval capabilities without the need of manual selection of visual features.
    \end{abstract}
    \newline
    Code: \href{https://github.com/sorny92/satellite-surface-algae-detection}{github.com/sorny92/satellite-surface-algae-detection}.
    \newline

    \begin{IEEEkeywords}
        remote sensing, computer vision, deep learning, image retrieval, Sentinel-2, self-supervised learning, image classification
    \end{IEEEkeywords}


    \section{Introduction}
    Thanks to the globally connected world we live in, it is possible to communicate with people from distant countries or
    enjoy food from produce that is not native to our area.
    Yet, this global economy also has some disadvantages.
    Due to the global transport network and the big distances we can travel, humans have become an important vector of movement of species
    between different ecosystems~\cite{invasive_species}.
    Non-native species can become threats to biological ecosystems and disrupt all sort of environments,
    from lakes or forests to whole countries~\cite{bhlitem21490}.

    For many years researchers have been tracking and studying different species of invasive vegetation~\cite{huang2009applications, aguir2013, donyana1, donyana2}.
    In this project we seek to automatically detect the location of aquatic vegetation floating in the water.
    The motivation is to provide a tool that could not only help with the detection of this vegetation, but also to be able to detect
    any other type of species or even locate other kinds of images thanks to the high level information that provides the model generated in this work.

    In this work we use self-supervised techniques to train a model to generate embeddings from a patch of a tile.
    Embeddings are a numerical representation of some information, in our case, an image.
    Embeddings allow compression of the information contained in an image to a much manageable and representative list of values.

    We use BarlowTwins~\cite{barlowtwins} framework to train an embedding generating architecture for remote-sensing.
    We chose to use BarlowTwins for its simple architecture which does not require a momentum encoder (like MoCo~\cite{he2020momentum, grill2020bootstrap}).
    Also, it does not need negative sampling unlike SimCLR~\cite{chen2020simple} which becomes helpful with satellite imaging as it allows the use
    of all the world indiscriminately without having to create contrastive triplets.
    Lastly, the simple architecture allows for pretraining that can happen in low budget systems and a minimal labelled amount of data for fine-tuning in downstream tasks.

    For the use of the system, fine-tuning a classification head becomes trivial with only hundreds of images with positive and negative
    samples of the target classes.
    However, because the model can generate embeddings from any Sentinel-2 image it can also be used as a tool for image retrieval
    where the researcher can provide example images of what they are looking for and the system will return the closest images that
    have been indexed.

    For this project, images have been sourced from the pair of satellites Sentinel-2~\cite{sentinel-2} which belong to the Copernicus program~\cite{whatiscopernicus}.
    These satellites haven been operating since 2015 providing remote sensing capabilities in the range of latitudes between 56º S and 84º N\@.
    This satellite can pass above the same point every 5 days allowing researchers and companies to track specific locations
    to study how they change in time.
    This satellite has 13 different bands or wavelengths with a resolution of up to 10m per pixel.
    Bands range from infrared through the visible spectrum to short wave infrared.
    The indexation of several bands together end up being very useful for some tasks, for example monitoring vegetation~\cite{TUCKER1979127}.
    Another use is the tracking of water levels of different lakes or the effect of flooding and natural catastrophes.

    Remote sensing has its own set of tools, approaches and problems.
    Some research is approached as a traditional computer vision problem were the researchers will
    generate their own heuristics to be able to track the metric they are interested in.
    An example of this is the NDVI~\cite{NDVIsource} metric which uses the bands from near infrared and red to
    measure the leaf water content and chlorophyll content~\cite{TUCKER1979127}.

    One of the problems of the heuristic approach is its own simplicity; it is excellent at measuring
    vegetation but it also comes with its own problems.
    NDVI, for example, is quite sensitive to the amount of atmosphere refraction that exists between the satellite and the ground.
    And also, the behaviour between different capture system is not constant~\cite{Huang2021}.

    Machine learning techniques can be useful to deal with the problems of heuristic based methods as they can fit the best model
    based on the data from the problem.

    Thanks to the increasing amount of satellites and aerial programs, there are many different providers with remote sensing capabilities.
    In section~\ref{sec:dataset} we will delve deeper into some of the data available today but nevertheless here are some details of the Sentinel-2 platform
    which is used in this project:
    \begin{itemize}
        \item 4 bands at 10 meters resolution per pixel.
        \item 6 bands at 20 meters resolution per pixel.
        \item 3 bands at 60 meters resolution per pixel.
        \item Range of latitudes between 56º S and 84º N.
        \item Visit the same location with the same angle every 10 days or 5 days using two satellites.
        \item Tiles of 290 km in size.
        \item 10980x10980 pixels of resolution.
    \end{itemize}
    The amount of data available is vast, the problem is gathering insights from the data as there is not as much labelled data
    available.
    For this project a small dataset had to be gathered so we could benchmark how effective this framework is to detect the
    floating vegetation.

    In this project we study an improved pipeline for ease of detection of this vegetation in lakes but also provide a tool
    to retrieve similar images in a tile that is a region of our interest.
    Currently, there are some lakes that are being tracked in a manual way where the researchers would apply several filters to detect them.
    This research makes use of data from the Cedillo dam in Spain spanning from the town of Cedillo to Alcántara.
    In figure~\ref{fig:satellite-image-airbus} an example of the vegetation mat can be seen.
    Other researchers have worked in similar projects in Doñana National park (Spain)~\cite{donyana1, donyana2} or California state (USA)~\cite{rs14133013}.
    These approaches can not scale to more lakes without more people looking for them as the tools currently used are targeted
    for specific use cases and heuristics need to be developed for each individual case.
    So searching at a global scale can not even be considered.

    \begin{figure}[h]
        \centering
        \includegraphics[width=9cm]{figure_aquatic_plants}
        \caption{Aerial image of an aquatic vegetation floating mat in the Cedillo lake (Spanish-Portuguese border). It can easily be identified as it's distinctive
        intense green colour in comparison of the water. Image from Sentinel-2}
        \label{fig:satellite-image-airbus}
    \end{figure}

    Our findings and contributions can be summarized as follows:
    \begin{enumerate}
        \item Exploration of self-supervised methods for remote-sensing applications.
        \item Generation of a feature extractor model for visual representation in remote-sensing.
        \item Similar accuracies for the task of floating aquatic vegetation detection with fewer labelled data than previous works.
        \item Demonstrate the system capabilities for image retrieval only using images as input.
    \end{enumerate}


    \section{Related work}
    A big concern is the damage floating vegetation creates for the biodiversity of aquatic areas and how these mats can cover the sunlight for species that live underwater.
    There are authors that have worked in the detection and tracking of these mats~\cite{donyana1, donyana2,rs14133013, srilanka_veg, 10.3389/fmars.2022.1004012} in different areas of the world.

    There are several approaches that can be followed:
    \subsection*{Heuristic based methods}

    Authors like~\citet{srilanka_veg, 10.3389/fmars.2022.1004012} use pixel level classification models with manually annotated data from satellite images but also unmanned aerial vehicles (UAVs).
    Depending on the authors, there are different classes they use as targets for the classification system.
    For example \citet{rs14133013} uses 8 classes: 4 different surface level species, one for submerged vegetation, one for non-photosynthetic vegetation, soil and water.
    To label the data the authors generate polygons of areas in the water based on GPS data from in-situ measurements and use GIS technologies to match them with the images from the satellite.

    \citet{10.3389/fmars.2022.1004012} also follow a similar approach taking measurements of the reflectance of the different classes in-situ to then match these values to the aerial images.
    Then they use an SVM based classification technique to train a model due to the good performance they offer with low amounts of data~\cite{Cortes1995}.

    Other works such as \citet{rs12244021} achieve a good performance for a country scale detection system.
    In their research they make use of a multistage system where water is first detected using the index MNDWI, then there is a second stage to detect vegetation in the water using NDVI index, lastly there is a classifier to
    identify different species of vegetation in the water.
    This is a complete end-to-end approach that accomplishes what we are looking for in this project but there are number of parameters and heuristics that need to be adjusted to make the system work.
    Another disadvantage is that it focuses on one single species, so the parameters and thresholds of the multistage system will have to be adapted to each new problem or situation.
    In our work we focus on the ability of the system to adapt to any image and be able to retrieve similar ones only based on a small subset of images of a region of interest.


    \subsection*{Data driven methods}
    Since the appearance of AlexNet~\cite{NIPS2012_c399862d}, deep neural networks have become state of the art systems for visual recognition.
    The problem of deep learning models is the amount of data that is required to train them effectively.
    This is a big problem for the detection of these plants as much of the research conducted in the past required having researchers in-situ to capture the data that then
    would be used to create a system for tracking and detection.

    There are also, several challenges in image scene classification for remote sensing \cite{9127795}:
    \begin{itemize}
        \item Big intraclass diversity;
        \item High interclass similarity (also known as low between-class separability);
        \item Large variance of object/scene scales;
        \item Coexistence of multiple ground objects
    \end{itemize}

    \subsubsection{Supervised}

    Different tasks can be solved as a supervised problem.
    In remote sensing we can find pixel-level classification, where each pixel will have a label assigned to it.
    This is comparable to image segmentation yet the approach to it can differ.

    Works like~\citet{rs12244021} use in-situ data to generate polygons with the expected label for those pixels in the polygon.
    Then a random forest classifier is trained per pixel so the entire image can have each pixel classified.
    \citet{rs14133013} labels the date in the same way as before but uses the data to find which heuristic works for the aquatic vegetation detection.
    Modern techniques, as convolutional neural networks, allow an end to end approach where you do not need to train
    a model to classify per pixel but instead your input is the whole image, also called image-to-image~\cite{rs12244140}.

    Pixel-classification or image segmentation can only give information per individual pixel, but for some applications it can be interesting to be able
    to know which objects we have in the image rather than the area they have as in object detection.

    In our project we focus on scene classification~\cite{9127795} as it can be used to locate regions of interest.
    This can be understood as a classification task where the input is the whole image and it is classified as a whole rather than per pixel
    basis.

    Supervised methods normally yield better performance than its unsupervised or self-supervised alternatives.
    But this comes at a cost of gathering labelled data.
    The amount of training samples needed to train from scratch is too high for small research teams or very specific tasks.
    As we will explore later in section~\ref{sec:dataset}, there are quite a few datasets for remote-sensing applications.
    The problem arises when these datasets do not fit the task the researchers want to explore.

    \subsubsection{Self-supervised}
    Self-supervised learning (SSL) has been shown to close the gap with supervised learning~\cite{gui2023survey}.
    At a cost of a longer training time, the model can learn features without labels that later can be exploited as a pretrained model.
    There are several approaches for self-learning.
    Mainly, generative, predictive and contrastive.
    In the past years the generative approach used to be the most common one with the use of generative adversarial networks (GANs)~\cite{goodfellow2014generative, radford2016unsupervised}.
    Nowadays, the interest has shifted to contrastive methods~\cite{chen2020simple, Jung2021SelfsupervisedLW, caron2021unsupervised} as they are showing strong results in comparison with supervised methods.
    Some of these methods do not require negative labels to generate useful visual features~\cite{DINO, barlowtwins, grill2020bootstrap}.
    These methods allow the exploitation of incredible amounts of data, even more than most of the pretrained models that are based on Imagenet which are used today.
    This is because they are not limited to finding negative pairs for each anchor as simCLR~\cite{chen2020simple} is.

    SSL has also been applied for remote-sensing applications.
    Works like tile2vec~\cite{jean2019tile2vec} exploit the architecture of simCLR~\cite{chen2020simple} to generate a pretrained model that
    then can be fine-tuned for other downstream tasks.
    Some projects~\cite{inproceedings, 9460820, Li_2022, akiva2020h2onet} use SSL for semantic segmentation.
    This is slightly different to works that target image classification as these projects also focus on creating strong pretrained models considering
    the encoder and decoder instead of just the decoder.

    Even though SSL allows for the generation of good pretrained models, some sort of labelling needs to be done for the downstream task.
    Researchers need to gather data and label it, either for testing purposes or fine-tuning the task.

    \subsection*{Decrease of data labelling time}

    On-site data gathering is expensive so it can be reduced with the use of labelling tools to assist the researchers.
    For example, some researchers used in-situ measurements to then draw polygons that delimited the pixels that covered the area of interest.
    Nowadays with the use of tools such as Segment Anything (SAM)~\cite{kirillov2023segment} which allows the segmentation of any type of image based on priors, the labelling time would be much shorter.
    These priors could be points or bounding boxes provided as input instead of whole polygons that take much longer to draw.
    In our case the dataset contains few images.
    As we only have dozens of bounding boxes, SAM could be used to find the perimeter of the vegetation mats and be used as inputs to train a segmentation model.
    Images of the whole earth are being generated all the time through satellites.
    If we could have enough images with ground truths as in figure~\ref{fig:tile-segmented}, we could train a system to segment
    this bodies in each tile so as the satellite is moving around the earth, an automated system could infer the segmented masks just after the tiles are uploaded.

    \begin{figure}[h]
        \centering
        \includegraphics[width=8cm]{segmented_tile}
        \caption{Example of segmentation of floating vegetation in a lake in South Africa. Blue is water segmentation. Green, black and red different indexes used for vegetation segmentation.
        Left column is top of atmosphere perspective and right column is bottom of atmosphere. Origin: ~\citet{rs12244021}}
        \label{fig:tile-segmented}
    \end{figure}


    \section{Datasets}\label{sec:dataset}
    For remote sensing tasks there are several sources of data.
    These are mostly aerial or from satellites.
    Depending on the resolution and spectral requirements different datasets can be used.
    For example, for semantic segmentation, SEN12MS~\cite{SEN12MS_dataset} is available with multi-spectral imaging (MSI)
    and also synthetic aperture radar (SAR).
    Providing 13 different wavelength bands and labels for land cover and land use.
    Other semantic segmentation datasets such as Postdam~\cite{postdam_dataset} provide high resolution images, 5cm per pixel of
    city areas.

    The biggest datasets available for classification tasks are BigEarthNet~\cite{bigearthnet}, EuroSAT~\cite{helber2019eurosat},
    PatternNet~\cite{patternet} and Million-AID~\cite{millionaid}.

    Examples such as BigEarthNet or EuroSAT include multispectral data (MSI) as they come from Sentinel-2, but others like PatterNet or Million-AID are restricted to RGB data.
    Due to the higher information that multispectral datasets provide for some research it is worth the use of MSI rather than RGB\@.

    In our work we choose to use EuroSAT for pretraining because it provides multispectral information.
    We believe it can be more helpful for the system because previous works used NDVI which uses spectral bands, such as nearinfrared (NIR), that are not available from RGB images.
    We could have used BigEarthNet but due to the limitations in computation it is not feasible to train in a reasonable time over the whole dataset.

    As we previously stated, EuroSAT provides a dataset with images from Sentinel-2 which is the same provider as the data
    we have manually labelled for the classification of the floating aquatic vegetation.

    To test the system's capabilities to detect floating vegetation mats, a small dataset was gathered with 140 labels.
    The labels are rectangles where the area can eventually have vegetation floating in water.
    The labelled data consists of a table with the region of interest (ROI) being defined by the corner coordinates in WKT format, in addition to
    the label and the date of the image where it was located.
    An example of two images can be seen at figure~\ref{fig:vegetation_example}.
    This dataset has been labelled manually for this work.
    With that information, the images can be extracted from the Sentinel-2 Open Access Hub.
    Sentinel-2 data can be loaded with the use of EOReader~\cite{eoreader_paper} and Python which makes for an easy integration with the deep learning ecosystem.

    This dataset has three labels.
    The label `no vegetation' indicate images of the same coordinates as the ones marked as `vegetation' but in a different date, so there is only water in the ROI\@.
    Figure~\ref{fig:vegetation_example} shows an example of what they look like.

    The label `unknown' shows random patches of the same size as the labelled images from the same tile.
    The assumption is similar to the one reached in other works with contrastive learning as simCLR~\cite{chen2020simple, jean2019tile2vec}.
    Data points that are far away from the anchor image are probably not the same as they are not close to it.
    In this case, these points are labelled with 0.1 probability to be aquatic vegetation, so we can consider it as a weak label in the dataset.


    \begin{figure}[h]
        \centering
        \subfloat{{\includegraphics[width=4cm]{figure_no_algae} }}%
        \qquad
        \subfloat{{\includegraphics[width=4cm]{figure_with_algae} }}%
        \caption{On the left a 64x64 crop of an image without vegetation. On the right a 64x64 crop of an image with vegetation.
        In red, the rectangle with the coordinates where the label has been assigned.
        Both images are represented as RGB, yet they have 13 colour channels.}
        \label{fig:vegetation_example}
    \end{figure}


    \section{Approach}
    In this work we use self-supervised learning (SSL) as it has been proved to generate strong baseline models.
    Following a regime of SSL we can generate a model that is able to output high quality embeddings which are not
    as good as supervised methods but allow to use huge amounts of data before fine-tuning in the domain specific to our interest.

    In this paper we separate the work in two steps: model pretraining and a downstream task, the classification head training.

    For the pretrained model we will use the Barlow Twins~\cite{barlowtwins} approach as it is a simple but powerful system for
    SSL.

    \subsection{Model pretraining}
    Barlow Twins differentiates itself from other methods as it is more resistant to a unique embedding collapsed solution.
    Also, it only uses one neural network which reduces the usage of memory during training.

    We mostly follow the same setup as the original publication~\cite{barlowtwins} with some differences to adapt it to our use case.
    The backbone architecture used is a ResNet50~\cite{he2015deep} shown in figure~\ref{fig:resnet50} with 2048 outputs then connected to a projector network as in the original work.

    \begin{figure}
        \centering
        \includegraphics[width=9cm]{Resnet50}
        \caption{Architecture of Resnet50}
        \label{fig:resnet50}
    \end{figure}

    \subsubsection{Data augmentation}
    In the original work they use a series of image augmentations that we can not apply in our system.
    Some of the image augmentations such as solarization, color jitter and conversion to grayscale do not make sense in multi-spectral images.

    Solarization is normally implemented as a clipping value in the luminance channel; it is not possible to convert multi-spectral data to
    a colorspace that separates luminance as this is based on human perception.
    Therefore, we decided not to use solarization in the image augmentation pipeline.

    Color jitter is also implemented as a conversion to LAB colorspace where the channels A and B are randomly increased or decreased from their mean value.
    In our case, we decided to do the same for every band of the image as it is assumed to be the most similar behaviour.

    In addition, the conversion to grayscale does not make sense so it is not used.

    Finally, as we are using EuroSAT for pretraining we are limited on the input size of the images as they have a size of 64x64 pixels so we do not want to change the scale of the images.
    For this we choose to add affine transformations to the images to then crop them back to 64x64 pixels.
    The goal of this augmentation is to change the shape of the objects of the image to create more variety and avoid overfitting.

    \subsubsection{Optimization}
    We follow the same protocol as in the original work~\cite{barlowtwins} which follows the same as BYOL~\cite{grill2020bootstrap}.
    Firstly, we use LARS optimizer and train the system for 1000 epochs.
    Then we keep all hyperparameters the same other than batch size, where we use 256 instead of 2048 as that is the maximum we could fit in memory.
    As shown in ~\citet{grill2020bootstrap, chen2020simple, barlowtwins}, the bigger the batch size the better performance there is during training.
    So using 256 instead of 2048 will potentially affect the accuracy of the system.

    \begin{figure}[h]
        \centering
        \includegraphics[width=9cm]{train_loss}
        \caption{Training loss graph.}
        \label{fig:training_loss_graph}
    \end{figure}

    \subsection{Classification model}
    Once the backbone was trained, we froze the weights of all the Resnet layers and appended a series of fully connected layers to the embeddings output node.
    It is common in the literature for SSL to append a linear layer with as many outputs as the classification problem.
    In our case we are going to test two classification problems.
    Firstly, the classification on EuroSAT test set for the 10 classes available.
    Then for the presence of floating aquatic vegetation in the water.

    \subsubsection{Data augmentation}
    For fine-tuning we use the same data augmentation protocol as in the pretraining.

    \subsubsection{Optimization}
    For EuroSAT classification a batch size of 256 is used and a learning rate of 0.1 with the Adam optimizer for 50 epochs of training.
    Then CosineAnnealingLR~\cite{loshchilov2017sgdr} is used during training.

    For fine-tuning the aquatic vegetation mat classifier, a batch size of 32 is used as the whole dataset is only approximately 140 images.
    Binary cross entropy is used as loss function and the adam optimizer with a learning rate of $1\cdot10^{-3}$ is used.
    The model is then trained for 20 epochs or until the loss stabilizes.


    \begin{figure}[t]
        \centering
        \includegraphics[width=9cm]{tsne_eurosat}
        \caption{t-SNE visualization of EuroSAT test set.}
        \label{fig:tsne_eurosat}
    \end{figure}


    \section{Main results}

    \subsection{Eurosat embeddings visualization}

    In figure~\ref{fig:training_loss_graph} the training graph shows there are no big steps in the loss which means the training was stable and converging.
    A sudden change on the loss function could have indicated a collapse of the embeddings but this does not seem to happen.

    To check that the model is generating embeddings which are able to represent the classes of the EuroSAT data, we use t-SNE~\cite{JMLR:v9:vandermaaten08a}.
    This technique allows for easy visualization of embeddings of high dimensional data as it clusters together data with similar values while it moves away
    values that are further away.
    In figure~\ref{fig:tsne_eurosat} we can visualize the representation.
    We can see a clear cluster for the SeaLake class.
    Also, we can observe how HerbaceousVegetation, PermanentCrop, AnnualCrop, Forest and Pasture are found more or less mixing with each other.

    Residential and Industrial classes are mostly found together which makes sense as they have a similar appearance from satellite images.

    \subsection{Eurosat fine-tuning}

    As can be seen in the table~\ref{table:eurosat_results}, our work does not improve over the supervised use case.
    However this is not a problem because it is not the goal of the project.
    Because for the dataset that we use later for the floating vegetation detection we do not have enough data to train a whole neural network.
    The results show that the embeddings generated from the model are good enough to almost achieve supervised level results,
    without having access to labelled data.


    \begin{table}[h!]
        \centering
        \begin{tabular}{ |p{3cm}||p{2cm}|p{2cm}|}
            \hline
            Model                                    & Architecture & Accuracy Top-1 \\
            \hline
            \hline
            Supervised                               & Resnet50     & 98.5\%         \\
            SSL4EO-S12(MoCo)\cite{wang2023ssl4eos12} & Resnet50     & 98.0\%         \\
            SSL4EO-S12(DINO)\cite{wang2023ssl4eos12} & Resnet50     & 97.2\%         \\
            SSL+Linear (Ours)                        & Resnet50     & 96.5\%         \\
            \hline
        \end{tabular}
        \caption{Results for eurosat classification model}
        \label{table:eurosat_results}
    \end{table}

    \subsection{Custom vegetation dataset embeddings visualization}
    In figure~\ref{fig:tsne_vegetation} we can visualize the representation of the two classes we have labelled plus a set of images
    with an unknown label.

    \begin{figure}[t]
        \centering
        \includegraphics[width=9cm]{tsne_vegetation}
        \caption{t-SNE visualization of the custom dataset.}
        \label{fig:tsne_vegetation}
    \end{figure}

    In a similar fashion as in figure~\ref{fig:tsne_eurosat}, we can see a separation between the `vegetation' and `no vegetation' labels
    indicating the model trained on EuroSAT dataset is able to produce significant differences in the embeddings for different classes.
    Therefore, there is a boundary that can be created between the two.
    In addition, we see that the unknown data which is from the same tile is distributed in the representation space not being a single cluster,
    showing again the model has not collapsed.

    \subsection{Custom vegetation dataset fine-tuning}
    After training the model in the same way as the EuroSAT classifier with our custom dataset, we obtained an accuracy of 86\%.
    This value is hard to compare to other works as the datasets are not the same or from the same area, so the data is only comparable as they look for the same
    class but in different environments.
    The values provided in table~\ref{table:vegetation_results} account for the same species of plant or at least the most similar situation,
    which is the floating aquatic vegetation in a lake.
    \begin{table}[h!]
        \centering
        \begin{tabular}{ |p{2.2cm}||p{1.5cm}|p{2.2cm}|p{1cm}|}
            \hline
            Model              & Architecture & Accuracy Top-1 & F1 score \\
            \hline
            \hline
            \citet{rs12244021} & RandomForest & 98\%* 93\%**   & 87\%**   \\
            \citet{rs14133013} & Index-tuning & 79-91\%***     & -        \\
            SSL+Linear (Ours)  & Resnet50     & 89.7\%         & 88.9\%   \\
            \hline
        \end{tabular}
        \caption{
            *This score is the accuracy over the detection of vegetation. \\
            **This score is the accuracy for each class. \\
            *** This score is for Water Hyacinth and Water Primrose as they are the similar class to the one in our dataset}
        \label{table:vegetation_results}
    \end{table}

    This work also approaches the problem as a scene classification problem, so we classify if an image contains floating aquatic vegetation or not.
    This is different to the other works cited in the table~\ref{table:vegetation_results} as they provide a label per pixel.

    Also, we need to consider that our model is an embedding generator, so it can be used with less data to potentially classify any Sentinel-2 image.
    Our system has been trained with approximately 140 images meanwhile ~\citet{rs12244021} used 462 images and~\citet{rs14133013} used approximately 2400 images.
    Therefore for our case, we use 3 times and 17 times less data, respectively.

    \begin{table}[h!]
        \centering
        \begin{tabular}{ |p{3cm}||p{1.3cm}|p{1.9cm}|p{1cm}|}
            \hline
            Model                                    & Architecture & Accuracy Top-1 & F1 score \\
            \hline
            \hline
            SSL4EO-S12(MoCo)\cite{wang2023ssl4eos12} & Resnet50     & 89.7\%         & 85.6\%   \\
            SSL4EO-S12(DINO)\cite{wang2023ssl4eos12} & Resnet50     & 89.7\%         & 88.9\%   \\
            SSL+Linear (Ours)                        & Resnet50     & 89.7\%         & 88.9\%   \\
            \hline
        \end{tabular}
        \caption{Comparison with other pretrained models from SSL4EO-S12~\cite{wang2023ssl4eos12}}
        \label{tab:vegetation_results_ssl}
    \end{table}

    We also compare our model to a couple of pretrained models from~\citet{wang2023ssl4eos12}.
    This project trains models with worldwide Sentinel-2 data with an SSL architecture.
    We train a linear layer for the two models as seen in table~\ref{tab:vegetation_results_ssl}.
    We train the DINO~\cite{DINO} model with adamW~\cite{loshchilov2019decoupled} and a learning rate of $1\cdot10^{-3}$.
    Then we additionally trained the model based on MoCo~\cite{chen2020mocov2} with adamW~\cite{loshchilov2019decoupled} and a learning rate of $5\cdot10^{-2}$.

    As seen in table~\ref{tab:vegetation_results_ssl}, even with a smaller dataset compared to SSL4EO-S12, this scene classification problem provides a
    similar accuracy.
    The limiting factor in accuracy for the problem is the data for fine-tuning.
    By looking into the dataset and which images are failing, we can observe that some of the images that appear as `no vegetation' actually have some
    vegetation.

    \begin{figure}[t]
        \centering
        \includegraphics[width=9cm]{example_vegetation_retrieval}
        \caption{Examples of image retreival through similarity. The first image is the queried image and the next 9 images are the closests one in similarity.
        In this case the queried image has floating aquatic vegetation.}
        \label{fig:example_vegetation_retrieval}
    \end{figure}

    \subsection{Image retrieval}
    The embedding generator can also be used to retrieve similar images with the use of k-nearest neighbors (KNN).

    Figure~\ref{fig:example_vegetation_retrieval}, figure~\ref{fig:example_no_vegetation_retrieval} and figure~\ref{fig:example_other_retrieval} show how the system can also
    be used to search for interesting areas for research through similarity instead of using index parameter tuning to analyse images.


    \begin{figure}[t]
        \centering
        \includegraphics[width=9cm]{example_no_vegetation_retrieval}
        \caption{Same as figure~\ref{fig:example_vegetation_retrieval} but the first image has no aquatic vegetation.}
        \label{fig:example_no_vegetation_retrieval}
    \end{figure}


    \section{Future work}
    In this work the dataset used is quite small and only covers data from Europe.
    There are bigger datasets such as BigEarthNet~\cite{bigearthnet} which cover more areas of the world, bringing more variety to the dataset.

    In that line of work, the authors of SSL4EO-S12~\cite{wang2023ssl4eos12} have created a dataset that aims to have an equal distribution of locations, seasons, and land usage.
    Sadly, another problem needs to be considered which is the catastrophic forgetting problem studied by ~\citet{kirkpatrick2017overcoming, de2021continual, 10135093, purushwalkam2022challenges}.

    More testing and thoughtful benchmarking needs to be done for different seasons, latitudes and weather conditions as it is not know
    how far this similarity metric can go when these variables are not considered.

    This system also uses image crops of 64x64 pixels and this might be too big or too small for a generic system.
    For example, in some queried images, this appears to be negative but a small part of the image contains vegetation and this is a common failure of classification problems
    when the target class is not big enough in the whole image.
    More data with different scales could help to find areas of interests where their scale does not match our system.

    Another interesting line of work would be the use of other architectures as most of the literature focuses in Resnet.
    A lot of work has been done with transformers and some systems like ~\citet{wang2023ssl4eos12, li2022efficient} have tested ViT~\cite{dosovitskiy2021image} with good results in remote-sensing.


    \section{Conclusions}

    We have explored the use of self-supervised learning for remote-sensing applications, specifically for the detection of floating aquatic vegetation.
    Previous works have mainly focused on classical machine learning techniques, where multi-stage heuristics need to be used or a lot of data needs
    to be gathered to be able to train a deep learning model.
    In this work we made use of SSL to achieve a generic model which is able to generate embeddings from 64x64 images which can be used in two modes:
    As a backbone for fine-tuning with a low amount of data or as an image retrieval system.
    In our application we achieved 89\% accuracy which is similar to works based on CNNs for the same problem but with the use of 17 times less labelled data.

    The system used as image retrieval for remote-sensing is comparable to an index based search made by researchers but instead of the index selection based
    on the prior knowledge of the ROI, the researcher can just look for similar images to the ones which they are interested in.

    \section*{Acknowledgment}
    I would like to thank all the professors at Universidad Internacional Menéndez Pelayo for all the content they have created for this master's degree.
    With specially thanks to Juan Miguel Soria for this project idea and the help and guidance about satellite imaging, remote-sensing and research guidelines.

    In addition, I would like to thank Óscar Luaces and Pablo Pérez for the corrections and guidance.

    Finally, thank you to my comrades in the open source community that provides all the tools that keep research active and allows it to advance without corporate gate-keeping.


    \begin{figure}[t]
        \centering
        \includegraphics[width=9cm]{example_other_retrieval}
        \caption{Same as figure~\ref{fig:example_vegetation_retrieval} but the first image is another type different class.}
        \label{fig:example_other_retrieval}
    \end{figure}
    \bibliography{refs}


\end{document}