CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos (Pattern Recognition 2025)

by Soufiane Belharbi¹, Shakeeb Murtaza¹, Marco Pedersoli¹, Ismail Ben Ayed¹, Luke McCaffrey², Eric Granger¹

¹ LIVIA, Dept. of Systems Engineering, ÉTS, Montreal, Canada
² Goodman Cancer Research Centre, Dept. of Oncology, McGill University, Montreal, Canada

Abstract

Leveraging spatiotemporal information in videos is critical for weakly supervised video object localization (WSVOL) tasks. However, state-of-the-art methods only rely on visual and motion cues, while discarding discriminative information, making them susceptible to inaccurate localizations. Recently, discriminative models have been explored for WSVOL tasks using a temporal class activation mapping (CAM) method. Although their results are promising, objects are assumed to have limited movement from frame to frame, leading to degradation in performance for relatively long-term dependencies. This paper proposes a novel CAM method for WSVOL that exploits spatiotemporal information in activation maps during training without constraining an object's position. Its training relies on Co-Localization, hence, the name CoLo-CAM. Given a sequence of frames, localization is jointly learned based on color cues extracted across the corresponding maps, by assuming that an object has similar color in consecutive frames. CAM activations are constrained to respond similarly over pixels with similar colors, achieving co-localization. This improves localization performance because the joint learning creates direct communication among pixels across all image locations and over all frames, allowing for transfer, aggregation, and correction of localizations. Co-localization is integrated into training by minimizing the color term of a conditional random field (CRF) loss over a sequence of frames/CAMs. Extensive experiments on two challenging YouTube-Objects datasets of unconstrained videos show the merits of our CoLo-CAM method, and its robustness to long-term dependencies, leading to new state-of-the-art performance for WSVOL task.

Code: Pytorch 1.12.1

Citation:

@article{belharbi2025colocam,
  title={CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos},
  author={Belharbi, S. and Murtaza, S. and Pedersoli, M. and Ben Ayed, I. and
  McCaffrey, L. and Granger, E.},
  journal={Pattern Recognition},
  volume={in prepapration},
  pages={xxx-xxx},
  issue={x},
  year={2025}
}

Issues:

Please create a github issue.

Results:

More demo:

002-bike.mp4

002-car.mp4

005-cat.mp4

012-car.mp4

016.mp4

016-bike.mp4

018-bike.mp4

024.mp4

025.mp4

027-car.mp4

033.mp4

036.mp4

041-dog.mp4

043-plane.mp4

shot-000002.mp4

shot-000034.mp4

shot-000045.mp4

shot-000129.mp4

shot-000178.mp4

shot-000373.mp4

Requirements:

See full requirements at ./dependencies/requirements.txt

Python 3.10
Pytorch 1.12.1
torchvision 0.13.1
Full dependencies
Build and install CRF:
- Install Swig
- CRF

  cdir=$(pwd)
  cd dlib/crf/crfwrapper/bilateralfilter
  swig -python -c++ bilateralfilter.i
  python setup.py install
  cd $cdir
  cd dlib/crf/crfwrapper/colorbilateralfilter
  swig -python -c++ colorbilateralfilter.i
  python setup.py install

Download datasets :

You can use these scripts to download the datasets: cmds. Use the script _video_ds_ytov2_2.py to reformat YTOv2.2.

Once you download the datasets, you need to adjust the paths in get_root_wsol_dataset().

Run code :

Examples on how to run the code.

WSOL baselines: LayerCAM over YouTube-Objects-v1.0 using ResNet50:

  cudaid=0
  export CUDA_VISIBLE_DEVICES=$cudaid

  getfreeport() {
  freeport=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
  }
  export OMP_NUM_THREADS=50
  export NCCL_BLOCKING_WAIT=1
  getfreeport
  torchrun --nnodes=1 --node_rank=0 --nproc_per_node=1 --master_port=$freeport main.py --local_world_size=1 \
         --task STD_CL \
         --encoder_name resnet50 \
         --arch STDClassifier \
         --opt__name_optimizer sgd \
         --dist_backend gloo \
         --batch_size 32 \
         --max_epochs 100 \
         --checkpoint_save 100 \
         --keep_last_n_checkpoints 10 \
         --freeze_cl False \
         --freeze_encoder False \
         --support_background True \
         --method LayerCAM \
         --spatial_pooling WGAP \
         --dataset YouTube-Objects-v1.0 \
         --box_v2_metric False \
         --cudaid $cudaid \
         --debug_subfolder DEBUG \
         --amp True \
         --plot_tr_cam_progress False \
         --opt__lr 0.001 \
         --opt__step_size 15 \
         --opt__gamma 0.9 \
         --opt__weight_decay 0.0001 \
         --sample_fr_limit 0.6 \
         --std_label_smooth False \
         --exp_id 03_14_2023_19_49_04_857184__2897019

Train until convergence, then store the cams of trainset to be used later. From the experiment folder, copy both folders YouTube-Objects-v1. 0-resnet50-LayerCAM-WGAP-cp_best_localization-boxv2_False and YouTube-Objects-v1.0-resnet50-LayerCAM-WGAP-cp_best_classification -boxv2_False to the folder pretrained. They contain best weights which will be loaded by CoLo-CAM model.

CoLo-CAM: Run:

  cudaid=0
  export CUDA_VISIBLE_DEVICES=$cudaid

  getfreeport() {
  freeport=$(python -c 'import socket; s=socket.socket(); s.bind(("", 0)); print(s.getsockname()[1]); s.close()')
  }
  export OMP_NUM_THREADS=50
  export NCCL_BLOCKING_WAIT=1
  getfreeport
  torchrun --nnodes=1 --node_rank=0 --nproc_per_node=1 --master_port=$freeport main.py --local_world_size=1 \
         --task CoLo-CAM \
         --encoder_name resnet50 \
         --arch UnetCoLoCAM \
         --opt__name_optimizer sgd \
         --dist_backend gloo \
         --batch_size 32 \
         --max_epochs 10 \
         --checkpoint_save 100 \
         --keep_last_n_checkpoints 10 \
         --freeze_cl True \
         --support_background True \
         --method LayerCAM \
         --spatial_pooling WGAP \
         --dataset YouTube-Objects-v1.0 \
         --box_v2_metric False \
         --cudaid $cudaid \
         --debug_subfolder DEBUG \
         --amp True \
         --plot_tr_cam_progress False \
         --opt__lr 0.01 \
         --opt__step_size 5 \
         --opt__gamma 0.9 \
         --opt__weight_decay 0.0001 \
         --sample_fr_limit 0.6 \
         --elb_init_t 1.0 \
         --elb_max_t 10.0 \
         --elb_mulcoef 1.01 \
         --sample_n_from_seq 2 \
         --min_tr_batch_sz -1 \
         --drop_small_tr_batch False \
         --sample_n_from_seq_style before \
         --sample_n_from_seq_dist uniform \
         --sl_clc True \
         --sl_clc_knn_t 0.0 \
         --sl_clc_seed_epoch_switch_uniform -1 \
         --sl_clc_epoch_switch_to_sl -1 \
         --sl_clc_min_t 0.0 \
         --sl_clc_lambda 1.0 \
         --sl_clc_min 1000 \
         --sl_clc_max 1000 \
         --sl_clc_ksz 3 \
         --sl_clc_max_p 0.7 \
         --sl_clc_min_p 0.1 \
         --sl_clc_seed_tech seed_weighted \
         --sl_clc_use_roi True \
         --sl_clc_roi_method largest \
         --sl_clc_roi_min_size 0.05 \
         --crf_clc True \
         --crf_clc_lambda 2e-09 \
         --crf_clc_sigma_rgb 15.0 \
         --crf_clc_sigma_xy 100.0 \
         --rgb_jcrf_clc True \
         --rgb_jcrf_clc_lambda 9.0 \
         --rgb_jcrf_clc_lambda_style adaptive \
         --rgb_jcrf_clc_sigma_rgb 15.0 \
         --rgb_jcrf_clc_input_data image \
         --rgb_jcrf_clc_input_re_dim -1 \
         --rgb_jcrf_clc_start_ep 0 \
         --max_sizepos_clc True \
         --max_sizepos_clc_lambda 0.01 \
         --exp_id 03_14_2023_19_16_58_282581__5931773

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
best-config		best-config
cmds		cmds
config_bash		config_bash
config_yaml		config_yaml
dependencies		dependencies
dlib		dlib
doc		doc
exps		exps
folds		folds
full_best_exps		full_best_exps
jobs		jobs
outputjobs		outputjobs
pretrained-imgnet		pretrained-imgnet
pretrained		pretrained
results		results
LICENSE		LICENSE
README.md		README.md
main.py		main.py
more-demo.md		more-demo.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos (Pattern Recognition 2025)

Abstract

Citation:

Issues:

Content:

Results:

Requirements:

Download datasets :

Run code :

About

Releases

Packages

Languages

License

sbelharbi/colo-cam

Folders and files

Latest commit

History

Repository files navigation

CoLo-CAM: Class Activation Mapping for Object Co-Localization in Weakly-Labeled Unconstrained Videos (Pattern Recognition 2025)

Abstract

Citation:

Issues:

Content:

Results:

Requirements:

Download datasets :

Run code :

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages