Replies: 3 comments 3 replies
-
These benchmarks are great, but how do these benchmarks compare to other state-of-the-art models? For example, your ViT has a 98% validation accuracy on EuroSAT, but Google reported a 99.2% accuracy on EuroSAT using a simple ResNet-50 model 5 years ago. I would recommend against using saturated benchmarks like EuroSAT for which even simple ImageNet weights can easily achieve 98%+ accuracy. There are far better, larger, and newer benchmark suites out there (see Table 2a of this paper). |
Beta Was this translation helpful? Give feedback.
-
@srmsoumya is it documented what resources were used for fine-tuning, similar to the Training Card for training the Clay model from scratch? My assumption is that this was done single node with one p5.48x instance but not sure if the memory footprint of the fine-tuning is smaller because the encoder is frozen and if smaller instances can be used. https://clay-foundation.github.io/model/release-notes/specification.html#training-card |
Beta Was this translation helpful? Give feedback.
-
@rbavery For fine-tuning, we don't need larger VM instances like We can also use smaller instances by adjusting the batch size accordingly. |
Beta Was this translation helpful? Give feedback.
-
Experiment Overview
We have conducted experiments with the Clay model on various downstream tasks, specifically focusing on classification and segmentation. Note that in all cases, the Clay encoder remains frozen, and only the additional layers are trained.
Initial Observations
The Clay model shows strong learning capabilities, with most tasks being learned effectively within the first epoch, after which performance plateaus.
Classification Task
For the classification task, we added a fully connected (FC) block on top of the Clay encoder. After 5 minutes of training, the model achieved a training accuracy of 0.985 and a validation accuracy of 0.98. The loss curves indicate that most of the learning occurs within the first epoch.
Training statistics
Validation statistics
Dataset Used: EuroSAT
Dataset Citation:
Segmentation Task
For the segmentation task, we tested the model on the Chesapeake Bay CVPR dataset. We attached a decoder similar to Segformer, which extracts features from intermediate layers and fuses them to predict segmentation masks.
We used a subset of the dataset consisting of 2000 random samples for training and validation. After 10 minutes of training (10 epochs), the model showed a similar learning pattern, with most learning occurring in the first epoch. The validation scores were a weighted IOU of 0.875 and an F1 score of 0.93.
Training statistics
Validation statistics
Prediction on sample masks: Image / Actual Mask / Prediction Mask
Dataset Used: Chesapeake Bay CVPR
Dataset Citation:
Decoder Reference:
Next Steps
Beta Was this translation helpful? Give feedback.
All reactions