Main repository for the paper Unbiased Learning to Rank Meets Reality: Lessons from Baidu's Large-Scale Search Dataset
published at the SIGIR 2024 reproducibility track.
Dataset | Description |
---|---|
Baidu-ULTR reprocibility | This repository, containing the code to tune, train, and evaluate all reranking methods including reference implementations of ULTR methods in Jax and Rax. |
Baidu-ULTR MonoBERT models | Repository containing the code to train flax-based MonoBERT models from scratch (optionally with ULTR). |
Reranking datasets | Code to preprocess and publish the two reranking datasets to Huggingface (see below). |
ULTR bias toolkit | Reference implementation of Intervention Harvesting methods. RegressionEM as used in our work was implemented in this repository. |
Dataset | Description |
---|---|
Language modeling dataset | Subset of the original Baidu-ULTR used in our work to train and evaluate MonoBERT cross-encoders. |
Reranking dataset (Baidu BERT) | The first four partition of Baidu-ULTR with query-document embeddings produced by the official MonoBERT cross-encoder released by Baidu plus additional LTR features computed by us. |
Reranking dataset (our BERT) | The first four partition of Baidu-ULTR with query-document embeddings produced by our naive MonoBERT cross-encoder and additional LTR features. |
We list all hyperparameters used to train our models here.
We train small feed-forward networks with ReLU activation on fixed query-document embeddings and LTR vectors to compare ULTR objectives. We tune the model architecture per dataset and lr and dropout regularization per method/dataset combination. We list all final hyperparameters of the reranking models under config/hyperparameters/
.
Position bias as estimated with the ULTR Bias Toolkit on partitions 1-3 of Baidu-ULTR.
- If Poetry is available, you can install all dependencies by running:
poetry install
. - If Mamba is available, you can use
mamba env create --file environment.yaml
which supports CUDA 11.8.
Select a dataset ["baidu", "uva", "ltr"]
and model/loss combination, e.g.,: ["naive-pointwise", "regression-em", "dla", "pairwise-debias"]
and run:
python main.py data=baidu model=naive-pointwise
@inproceedings{Hager2024BaiduULTR,
author = {Philipp Hager and Romain Deffayet and Jean-Michel Renders and Onno Zoeter and Maarten de Rijke},
title = {Unbiased Learning to Rank Meets Reality: Lessons from Baidu’s Large-Scale Search Dataset},
booktitle = {Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR`24)},
organization = {ACM},
year = {2024},
}
This repository uses the MIT License.