These are the detailed instructions on generating the ACL Anthology website as seen on https://aclweb.org/anthology/.
The Anthology website is generated using the Hugo static
site generator. However, before we can actually invoke Hugo, we need to prepare
the contents of the website. The following steps describe what happens
behind the scenes. All the steps have a corresponding make
target as well.
The data sources for the Anthology currently reside in the data/
directory. XML files contain the authoritative paper metadata, and additional
YAML files document information about venues and special interest groups (SIGs).
Before the Anthology website can be generated, all this information needs to be
converted and preprocessed for the static site generator.
This is achieved by calling:
$ python3 bin/create_hugo_yaml.py
This process should not take longer than a few minutes and can be sped up considerably by installing PyYAML with C bindings.
The YAML files created in Step 1 are used by Hugo to pull in information about venues/papers/etc., but they cannot be used to define what actual pages the website should have. Therefore, another script takes the YAML files generated in Step 1 and produce stubs of pages for each individual paper, venue, etc.
This is achieved by calling:
$ python3 bin/create_hugo_pages.py
This script will produce a lot of files in the hugo/content/
subdirectory
(most prominently, one for each paper in the Anthology).
In this step, we create .bib
files for each paper and proceedings volume in
the Anthology. This is achieved by calling:
$ python3 bin/create_bibtex.py
The exported files will be written to the hugo/data-export/
subdirectory.
For other export formats, we rely on the
bibutils
suite by
first converting the generated .bib
files to MODS XML:
$ find hugo/data-export -name '*.bib' -exec bin/bib2xml_wrapper {} \; >/dev/null
This creates a corresponding .xml
file in MODS format for every .bib
file
generated previously.
After all necessary files have been created, the website can be built by simply
invoking Hugo from the hugo/
subdirectory. Optionally, the --minify
flag
can be used to create minified HTML output:
$ hugo --minify
Generating the website is quite a resource-hungry process, but should not take
longer than a few minutes. Due to the high memory usage (approx. 18 GB
according to the output of hugo --stepAnalysis
), it is possible that it will
cause swapping and consequently slow down your system for a while.
The fully generated website will be in hugo/public/
afterwards.
The static site tries to follow a strict separation of content and presentation. If you need to make changes to the Anthology, the first step is to figure out where to make these changes.
Changes in content (paper metadata, information about SIGs, etc.) should
always be made in the data files under data/
or in the scripts that
interpret them; changes that only affect the presentation on the website can
be made within the Hugo templates.
The data sources of the Anthology are currently stored under data/
. They
comprise:
-
The authoritative XML files (in
xml/
); these contain all paper metadata. Their format is defined in an RelaxNG schemaschema.rnc
in the same directory. -
YAML files for SIGs (in
yaml/sigs/
); these contain names, URLs, and associated venues for all special interest groups. -
YAML files that define venues. These are currently:
venues.yaml
: Maps venue acronyms to full namesvenues_letters.yaml
: Maps top-level letters to venue acronyms (e.g. P ⟶ ACL)venues_joint_map.yaml
: Maps proceedings volumes to additional venues (Note: volumes will always be associated with the venue corresponding to their first letter; this file defines additional ones in case of joint events etc.)
-
A name variant list (
name_variants.yaml
) that defines which author names should be treated as identical for purposes of generating "author" pages.
The "anthology" module under bin/anthology/
is responsible
for parsing and interpreting all these data files. Some information that is not
explicitly stored in any of these files is derived automatically by this
module during Step 1 of building the website. (For example, if a publication
year is not explicitly given in the XML, it is derived from the volume ID in
Paper._infer_year()
.)
HTML templates for the website are found under hugo/layouts/
.
-
The main skeleton for all HTML files is
_default/baseof.html
. -
The front page is
index.html
. -
Most other pages are defined as
**/single.html
(e.g.,papers/single.html
defines the paper pages). -
The appearance of paper entries in lists (on proceedings pages, author pages, etc.) is defined in
papers/list-entry.html
.
CSS styling for the website is based on Bootstrap
4.3. The final CSS is compiled from
hugo/assets/css/main.scss
, which defines
- which Bootstrap components to include,
- which Bootstrap variables to customize (e.g. colors), and
- additional style definitions on top of that.
If a new year is added to the Anthology, make sure the front page
template is updated to include this new year. Make
sure to adjust the variable $all_years
(and $border_years
, if needed) and
don't forget to update the table headers as well! (Their colspan
attributes need to match the number of years subsumed under the header.)