Ze Zhong Wu, Ryan Clancy, and Jimmy Lin
This is the docker image for the Anserini toolkit (v0.5.1), with Elasticsearch indexing, conforming to the OSIRRC jig for the Open-Source IR Replicability Challenge (OSIRRC) at SIGIR 2019.
- Supported test collections:
robust04
,core17
,core18
(newswire);gov2
,cw09b
,cw12b
(web) - Supported hooks:
init
,index
,interact
The search results are the same as anserini-docker
, thus we use those results.
The following jig
command can be used to index TREC disks 4/5 for robust04
:
python run.py prepare --repo osirrc2019/elastirini --tag <tag> --collections robust04=/path/to/disk45=trectext
The following jig
command can be used to perform a retrieval run on the collection with the robust04
test collection.
python run.py interact --repo osirrc2019/elastirini --tag <tag>
Where <tag>
is valid tag.
After entering the above command, use docker port [container id]
to see the port mappings to know which ports to use to access Elasticsearch and Kibana on the host machine.
The Anserini image supports the following retrieval methods:
- BM25: k1=0.9, b=0.4 (Robertson et al., 1995)
- QL (query likelihood with Dirichlet smooth): mu=1000 (Zhai and Lafferty, 2001)
- +RM3 (RM3 variant of relevance models, applied on top of initial BM25 or QL results): exact parameter settings (Abdul-Jaleel et al., 2004)
- +Ax (Semantic term matching in the axiomatic framework, applied on top of initial BM25 or QL results): exact parameter settings (Fang and Zhai, 2006)
The following numbers should be able to be re-produced using the scripts provided in the bin directory.
MAP | BM25 | +RM3 | +Ax | QL | +RM3 | +Ax |
---|---|---|---|---|---|---|
TREC 2004 Robust Track Topics | 0.2531 | 0.2903 | 0.2895 | 0.2467 | 0.2747 | 0.2774 |
MAP | BM25 | +RM3 | +Ax | QL | +RM3 | +Ax |
---|---|---|---|---|---|---|
TREC 2017 Common Core Track Topics | 0.2087 | 0.2823 | 0.2787 | 0.2032 | 0.2606 | 0.2613 |
MAP | BM25 | +RM3 | +Ax | QL | +RM3 | +Ax |
---|---|---|---|---|---|---|
TREC 2018 Common Core Track Topics | 0.2495 | 0.3136 | 0.2920 | 0.2526 | 0.3073 | 0.2966 |
MAP | BM25 | +RM3 | +Ax | QL | +RM3 | +Ax |
---|---|---|---|---|---|---|
TREC 2004 Terabyte Track: Topics 701-750 | 0.2689 | 0.2844 | 0.2665 | 0.2681 | 0.2708 | 0.2666 |
TREC 2005 Terabyte Track: Topics 751-800 | 0.3390 | 0.3820 | 0.3664 | 0.3303 | 0.3559 | 0.3646 |
TREC 2006 Terabyte Track: Topics 801-850 | 0.3080 | 0.3377 | 0.3069 | 0.2996 | 0.3154 | 0.3084 |
MAP | BM25 | +RM3 | +Ax | QL | +RM3 | +Ax |
---|---|---|---|---|---|---|
TREC 2010 Web Track: Topics 51-100 | 0.1126 | 0.0933 | 0.0928 | 0.1060 | 0.1019 | 0.1086 |
TREC 2011 Web Track: Topics 101-150 | 0.1094 | 0.1081 | 0.0974 | 0.0958 | 0.0837 | 0.0879 |
TREC 2012 Web Track: Topics 151-200 | 0.1106 | 0.1107 | 0.1315 | 0.1069 | 0.1059 | 0.1212 |
MAP | BM25 | +RM3 | +Ax | QL | +RM3 | +Ax |
---|---|---|---|---|---|---|
TREC 2013 Web Track: Topics 201-250 | 0.0468 | 0.0412 | 0.0435 | 0.0397 | 0.0322 | 0.0359 |
TREC 2014 Web Track: Topics 251-300 | 0.0224 | 0.0210 | 0.0180 | 0.0235 | 0.0203 | 0.0186 |
The following is a quick breakdown of what happens in each of the scripts in this repo.
The Dockerfile
installs dependencies, sets the Java home path, copies scripts to the root dir, exposes ports 9200 and 5601, and sets the working dir to /work
.
The init
script clones Anserini, installs the ELK stack, and configures it.
The index
script indexes the collection with Elasticsearch.
- Stephen E. Robertson, Steve Walker, Micheline Hancock-Beaulieu, Mike Gatford, and A. Payne. (1995) Okapi at TREC-4. TREC
- ChengXiang Zhai and John Lafferty. (2001) A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval. SIGIR.
- Nasreen Abdul-Jaleel, James Allan, W. Bruce Croft, Fernando Diaz, Leah Larkey, Xiaoyan Li, Donald Metzler, Mark D. Smucker, Trevor Strohman, Howard Turtle, and Courtney Wade. (2004) UMass at TREC 2004: Novelty and HARD. TREC.
- Hui Fang and ChengXiang Zhai. (2006) Semantic Term Matching in Axiomatic Approaches to Information Retrieval. SIGIR.
Documentation reviewed at commit b44ccd9
(2019-06-24) by Ryan Clancy.