ETCI2021 high test acc up to 0.9 and then Nan with loss #745

liecn · 2022-08-30T02:47:44Z

liecn
Aug 30, 2022

Hi,
I am trying to implement the ETCI2021 training in federated learning. However, the test accuracy is 0.9 after one-round training and then changes to Nan.

I used the scripts to download the dataset (train and val) and adopt the training strategy in the repo, including the loss, optimizer, model, and hyperparameters as below. Thanks for your help. I doubt the lr is too large, so I am working on adapting it.

lr=0.001
self.model = smp.Unet(
                encoder_name='resnet50',
                encoder_weights=None,
                in_channels=6,
                classes=2,
            )

criterion = nn.CrossEntropyLoss(ignore_index=0)

tmp_optimizer = optim.Adam(model.parameters(), lr=lr)

test_accuracy = Accuracy(num_classes=2, ignore_index=0, mdmc_average="global")

# code for testing ACC
for data, target in dataloader:
      if torch.cuda.is_available():
          data, target = data.cuda(), target.cuda()
      output = model(data)
      loss = criterion(output, target)
     
      y_pred = output.argmax(dim=1)
      correct=self.test_accuracy(y_pred,target)

accuracy=self.test_accuracy.compute()
self.test_accuracy.reset()

Round-0, global model test accuracy = 0.3805171847343445, loss = 0.582478940486908
Round-1, global model test accuracy = 0.9026516675949097, loss = 0.10766775161027908
Round-2, global model test accuracy = 0.0, loss = nan

isaaccorley · 2022-08-30T13:16:55Z

isaaccorley
Aug 30, 2022
Maintainer

You appear to not be performing the loss.backward(), optimizer.step(), or optimizer.zero_grad() calls which are needed in PyTorch training loops. Could this be causing your issue?

4 replies

liecn Aug 30, 2022
Author

Thanks for your reply. Seem not. Please find my code snippet for training below. Again, thanks so much for your time.

def local_train(self, args, idxs, dataset):
        model = copy.deepcopy(self.model)
        client_id = self.client_id

        if torch.cuda.is_available():
            model.to(self.device)
            self.test_accuracy.cuda()
        dataloader = DataLoader(DatasetSplit(dataset, idxs), batch_size=args.local_bs, shuffle=True, num_workers=args.num_workers,prefetch_factor=args.prefetch_factor,pin_memory=True)
        
        criterion = nn.CrossEntropyLoss(ignore_index=0)

        model.train()
        model.zero_grad()
        loss_sum = 0
        correct = 0

        prev_model = copy.deepcopy(model)

        if args.tl_mode == -1:
            tmp_optimizer = optim.Adam(model.parameters(), lr=args.lr)
        else:
            frozen_model_(args, model)
            tmp_optimizer = optim.Adam(filter(lambda p: p.requires_grad, model.parameters()), lr=args.lr) 
       
        for jj in range(args.local_ep):
            for data, target in dataloader:
                if torch.cuda.is_available():
                    data, target = data.to(self.device), target.to(self.device)
                
                output = model(data)
                loss = criterion(output, target)
                loss.backward()
                loss_sum += loss.item()

                y_pred = output.argmax(dim=1)
                self.test_accuracy.update(y_pred,target)
              
                tmp_optimizer.step()
                
                tmp_optimizer.zero_grad()

        accuracy=self.test_accuracy.compute()
        self.test_accuracy.reset()
        
        loss_out = loss_sum / len(dataloader) / args.local_ep

        return sub_weights(model.state_dict(), prev_model.state_dict()),len(idxs),loss_utility, accuracy

isaaccorley Aug 30, 2022
Maintainer

You seem to be computing test accuracy on the training set, is this desired? Additionally, DatasetSplit is not part of torchgeo so it's difficult to debug for you. I would recommend using torchgeo.datamodules.ETCI2021DataModule and getting the train/val/test dataloaders like below as this includes additional image/mask preprocessing. Note that the test set labels were never provided so you can't compute metrics on them so you can use the val set instead. Maybe this will solve your issue.

from torchgeo.datamodules import ETCI2021DataModule

dm = ETCI2021DataModule()
dm.setup()
train_dataloader = dm.train_dataloader()
val_dataloader = dm.val_dataloader()
test_dataloader = dm.test_dataloader()

liecn Aug 30, 2022
Author

Hi,
Thanks. I also doubt it seems like a testing perf on the training set. But I do use the val (downloaded using the scripts) as the test dataset. For DatasetSplit, it is a function to sample the data by a given list.

class DatasetSplit(Dataset):
    def __init__(self, dataset, idxs):
        self.dataset = dataset
        self.idxs = list(idxs)

    def __len__(self):
        return len(self.idxs)

    def __getitem__(self, item):
        image, label = self.dataset[self.idxs[item]]
        return image, label

Currently, I use the criterion = JaccardLoss(mode="multiclass", classes=2) and JaccardIndex(num_classes=2, ignore_index=0). Feel so confused about the performance.

Could you please show me a normal time-to-round accuracy plot for me? That could be really helpful. Thanks.

isaaccorley Aug 30, 2022
Maintainer

In your code above the line self.test_accuracy.update(y_pred,target) is being computed using y_pred and target which appears to be the same data you are training on.
The labels of ETCI2021 are background=0, flood=255 which is why I recommend using our ETCI2021DataModule because we perform additional preprocessing of images and masks to get them to be in the range [0, 1]. Since you aren't using the ETCI2021DataModule I would recommend mapping values from 255 -> 1 in the target masks.

isaaccorley · 2022-08-30T18:02:32Z

isaaccorley
Aug 30, 2022
Maintainer

@liecn Please see the following gist for ETCI2021 train/val script which uses raw PyTorch. I didn't experience any NaN loss or 0 train/val accuracy. Let me know if you have any further issues.

2 replies

liecn Aug 30, 2022
Author

@isaaccorley Hi, I just run your scripts. It seems the val ACC reaches 1.0 even after one training round. That's kind of weird. The paper shows that it achieves 45.77%.

isaaccorley Aug 30, 2022
Maintainer

45.77% is mIoU, not accuracy. This is explained in the table caption in the paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ETCI2021 high test acc up to 0.9 and then Nan with loss #745

{{title}}

Replies: 2 comments 6 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

ETCI2021 high test acc up to 0.9 and then Nan with loss #745

liecn Aug 30, 2022

Replies: 2 comments · 6 replies

isaaccorley Aug 30, 2022 Maintainer

liecn Aug 30, 2022 Author

isaaccorley Aug 30, 2022 Maintainer

liecn Aug 30, 2022 Author

isaaccorley Aug 30, 2022 Maintainer

isaaccorley Aug 30, 2022 Maintainer

liecn Aug 30, 2022 Author

isaaccorley Aug 30, 2022 Maintainer

liecn
Aug 30, 2022

Replies: 2 comments 6 replies

isaaccorley
Aug 30, 2022
Maintainer

liecn Aug 30, 2022
Author

isaaccorley Aug 30, 2022
Maintainer

liecn Aug 30, 2022
Author

isaaccorley Aug 30, 2022
Maintainer

isaaccorley
Aug 30, 2022
Maintainer

liecn Aug 30, 2022
Author

isaaccorley Aug 30, 2022
Maintainer