CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos (Pattern Recognition 2025)
by Soufiane Belharbi1, Shakeeb Murtaza1, Marco Pedersoli1, Ismail Ben Ayed1, Luke McCaffrey2, Eric Granger1
1 LIVIA, Dept. of Systems Engineering, ÉTS, Montreal, Canada
2 Goodman Cancer Research Centre, Dept. of Oncology, McGill University, Montreal, Canada
Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our CoLo-CAM method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.
Code: Pytorch 1.12.1
@article{belharbi2025colocam,
title={CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos},
author={Belharbi, S. and Murtaza, S. and Pedersoli, M. and Ben Ayed, I. and
McCaffrey, L. and Granger, E.},
journal={Pattern Recognition},
volume={in prepapration},
pages={xxx-xxx},
issue={x},
year={2025}
}
Please create a github issue.
More demo:
002-bike.mp4
002-car.mp4
005-cat.mp4
012-car.mp4
016.mp4
016-bike.mp4
018-bike.mp4
024.mp4
025.mp4
027-car.mp4
033.mp4
036.mp4
041-dog.mp4
043-plane.mp4
shot-000002.mp4
shot-000034.mp4
shot-000045.mp4
shot-000129.mp4
shot-000178.mp4
shot-000373.mp4
See full requirements at ./dependencies/requirements.txt
- Python 3.10
- Pytorch 1.12.1
- torchvision 0.13.1
- Full dependencies
- Build and install CRF:
- Install Swig
- CRF
cdir=$(pwd)
cd dlib/crf/crfwrapper/bilateralfilter
swig -python -c++ bilateralfilter.i
python setup.py install
cd $cdir
cd dlib/crf/crfwrapper/colorbilateralfilter
swig -python -c++ colorbilateralfilter.i
python setup.py install
You can use these scripts to download the datasets: cmds. Use the script _video_ds_ytov2_2.py to reformat YTOv2.2.
Once you download the datasets, you need to adjust the paths in get_root_wsol_dataset().
Run code :
Examples on how to run the code.
- WSOL baselines: LayerCAM over YouTube-Objects-v1.0 using ResNet50:
cudaid=0
export CUDA_VISIBLE_DEVICES=$cudaid
getfreeport() {
freeport=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
}
export OMP_NUM_THREADS=50
export NCCL_BLOCKING_WAIT=1
getfreeport
torchrun --nnodes=1 --node_rank=0 --nproc_per_node=1 --master_port=$freeport main.py --local_world_size=1 \
--task STD_CL \
--encoder_name resnet50 \
--arch STDClassifier \
--opt__name_optimizer sgd \
--dist_backend gloo \
--batch_size 32 \
--max_epochs 100 \
--checkpoint_save 100 \
--keep_last_n_checkpoints 10 \
--freeze_cl False \
--freeze_encoder False \
--support_background True \
--method LayerCAM \
--spatial_pooling WGAP \
--dataset YouTube-Objects-v1.0 \
--box_v2_metric False \
--cudaid $cudaid \
--debug_subfolder DEBUG \
--amp True \
--plot_tr_cam_progress False \
--opt__lr 0.001 \
--opt__step_size 15 \
--opt__gamma 0.9 \
--opt__weight_decay 0.0001 \
--sample_fr_limit 0.6 \
--std_label_smooth False \
--exp_id 03_14_2023_19_49_04_857184__2897019
Train until convergence, then store the cams of trainset to be used later.
From the experiment folder, copy both folders YouTube-Objects-v1. 0-resnet50-LayerCAM-WGAP-cp_best_localization-boxv2_False
and YouTube-Objects-v1.0-resnet50-LayerCAM-WGAP-cp_best_classification -boxv2_False
to the folder pretrained
. They contain best weights which
will be loaded by CoLo-CAM model.
- CoLo-CAM: Run:
cudaid=0
export CUDA_VISIBLE_DEVICES=$cudaid
getfreeport() {
freeport=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
}
export OMP_NUM_THREADS=50
export NCCL_BLOCKING_WAIT=1
getfreeport
torchrun --nnodes=1 --node_rank=0 --nproc_per_node=1 --master_port=$freeport main.py --local_world_size=1 \
--task CoLo-CAM \
--encoder_name resnet50 \
--arch UnetCoLoCAM \
--opt__name_optimizer sgd \
--dist_backend gloo \
--batch_size 32 \
--max_epochs 10 \
--checkpoint_save 100 \
--keep_last_n_checkpoints 10 \
--freeze_cl True \
--support_background True \
--method LayerCAM \
--spatial_pooling WGAP \
--dataset YouTube-Objects-v1.0 \
--box_v2_metric False \
--cudaid $cudaid \
--debug_subfolder DEBUG \
--amp True \
--plot_tr_cam_progress False \
--opt__lr 0.01 \
--opt__step_size 5 \
--opt__gamma 0.9 \
--opt__weight_decay 0.0001 \
--sample_fr_limit 0.6 \
--elb_init_t 1.0 \
--elb_max_t 10.0 \
--elb_mulcoef 1.01 \
--sample_n_from_seq 2 \
--min_tr_batch_sz -1 \
--drop_small_tr_batch False \
--sample_n_from_seq_style before \
--sample_n_from_seq_dist uniform \
--sl_clc True \
--sl_clc_knn_t 0.0 \
--sl_clc_seed_epoch_switch_uniform -1 \
--sl_clc_epoch_switch_to_sl -1 \
--sl_clc_min_t 0.0 \
--sl_clc_lambda 1.0 \
--sl_clc_min 1000 \
--sl_clc_max 1000 \
--sl_clc_ksz 3 \
--sl_clc_max_p 0.7 \
--sl_clc_min_p 0.1 \
--sl_clc_seed_tech seed_weighted \
--sl_clc_use_roi True \
--sl_clc_roi_method largest \
--sl_clc_roi_min_size 0.05 \
--crf_clc True \
--crf_clc_lambda 2e-09 \
--crf_clc_sigma_rgb 15.0 \
--crf_clc_sigma_xy 100.0 \
--rgb_jcrf_clc True \
--rgb_jcrf_clc_lambda 9.0 \
--rgb_jcrf_clc_lambda_style adaptive \
--rgb_jcrf_clc_sigma_rgb 15.0 \
--rgb_jcrf_clc_input_data image \
--rgb_jcrf_clc_input_re_dim -1 \
--rgb_jcrf_clc_start_ep 0 \
--max_sizepos_clc True \
--max_sizepos_clc_lambda 0.01 \
--exp_id 03_14_2023_19_16_58_282581__5931773