diff --git a/data/xml/2024.blackboxnlp.xml b/data/xml/2024.blackboxnlp.xml
new file mode 100644
index 0000000000..33b3c88408
--- /dev/null
+++ b/data/xml/2024.blackboxnlp.xml
@@ -0,0 +1,403 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.blackboxnlp">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP</booktitle>
+      <editor><first>Yonatan</first><last>Belinkov</last></editor>
+      <editor><first>Najoung</first><last>Kim</last></editor>
+      <editor><first>Jaap</first><last>Jumelet</last></editor>
+      <editor><first>Hosein</first><last>Mohebbi</last></editor>
+      <editor><first>Aaron</first><last>Mueller</last></editor>
+      <editor><first>Hanjie</first><last>Chen</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, US</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="790d9f4a">2024.blackboxnlp-1</url>
+      <venue>blackboxnlp</venue>
+    </meta>
+    <frontmatter>
+      <url hash="74cf782f">2024.blackboxnlp-1.0</url>
+      <bibkey>blackboxnlp-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Optimal and efficient text counterfactuals using Graph Neural Networks</title>
+      <author><first>Dimitris</first><last>Lymperopoulos</last><affiliation>National Technical University of Athens</affiliation></author>
+      <author><first>Maria</first><last>Lymperaiou</last></author>
+      <author><first>Giorgos</first><last>Filandrianos</last><affiliation>National Technical University of Athens</affiliation></author>
+      <author><first>Giorgos</first><last>Stamou</last><affiliation>National Technical University of Athens</affiliation></author>
+      <pages>1-14</pages>
+      <abstract>As NLP models become increasingly integral to decision-making processes, the need for explainability and interpretability has become paramount. In this work, we propose a framework that achieves the aforementioned by generating semantically edited inputs, known as counterfactual interventions, which change the model prediction, thus providing a form of counterfactual explanations for the model. We frame the search for optimal counterfactual interventions as a graph assignment problem and employ a GNN to solve it, thus achieving high efficiency. We test our framework on two NLP tasks - binary sentiment classification and topic classification - and show that the generated edits are contrastive, fluent and minimal, while the whole process remains significantly faster than other state-of-the-art counterfactual editors.</abstract>
+      <url hash="87c7725b">2024.blackboxnlp-1.1</url>
+      <bibkey>lymperopoulos-etal-2024-optimal</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Routing in Sparsely-gated Language Models responds to Context</title>
+      <author><first>Stefan</first><last>Arnold</last></author>
+      <author><first>Marian</first><last>Fietta</last><affiliation>Friedrich-Alexander Universität Erlangen-Nürnberg</affiliation></author>
+      <author><first>Dilara</first><last>Yesilbas</last></author>
+      <pages>15-22</pages>
+      <abstract>Language Models (LMs) recently incorporate mixture-of-experts layers consisting of a router and a collection of experts to scale up their parameter count given a fixed computational budget. Building on previous efforts indicating that token-expert assignments are predominantly influenced by token identities and positions, we trace routing decisions of similarity-annotated text pairs to evaluate the context sensitivity of learned token-expert assignments. We observe that routing in encoder layers mainly depends on (semantic) associations, but contextual cues provide an additional layer of refinement. Conversely, routing in decoder layers is more variable and markedly less sensitive to context.</abstract>
+      <url hash="799783ab">2024.blackboxnlp-1.2</url>
+      <bibkey>arnold-etal-2024-routing</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Are there identifiable structural parts in the sentence embedding whole?</title>
+      <author><first>Vivi</first><last>Nastase</last><affiliation>University of Geneva</affiliation></author>
+      <author><first>Paola</first><last>Merlo</last><affiliation>Idiap Research Institute and University of Geneva, Switzerland</affiliation></author>
+      <pages>23-42</pages>
+      <abstract>Sentence embeddings from transformer models encode much linguistic information in a fixed-length vector. We investigate whether structural information – specifically, information about chunks and their structural and semantic properties – can be detected in these representations. We use a dataset consisting of sentences with known chunk structure, and two linguistic intelligence datasets, whose solution relies on detecting chunks and their grammatical number, and respectively, their semantic roles. Through an approach involving indirect supervision, and through analyses of the performance on the tasks and of the internal representations built during learning, we show that information about chunks and their properties can be obtained from sentence embeddings.</abstract>
+      <url hash="f5c25188">2024.blackboxnlp-1.3</url>
+      <bibkey>nastase-merlo-2024-identifiable</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Learning, Forgetting, Remembering: Insights From Tracking <fixed-case>LLM</fixed-case> Memorization During Training</title>
+      <author><first>Danny</first><last>Leybzon</last></author>
+      <author><first>Corentin</first><last>Kervadec</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <pages>43-57</pages>
+      <abstract>Large language models memorize portions of their training data verbatim. Our findings indicate that models exhibit higher memorization rates both early on and at the very end of their training, with the lowest rates occurring midway through the process. This phenomenon can be attributed to the models retaining most of the examples memorized early on, while forgetting many more examples as training progresses. Interestingly, these forgotten examples are sometimes re-memorized later on, often undergoing cycles of forgetting and re-memorization. Notably, examples memorized early in training are more likely to remain consistently retained, suggesting that they become more firmly ’crystallized’ in the model’s representation. Based on these insights, we tentatively recommend placing data that is more likely to be sensitive in the middle stages of the training process.</abstract>
+      <url hash="090002df">2024.blackboxnlp-1.4</url>
+      <bibkey>leybzon-kervadec-2024-learning</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Language Models Linearly Represent Sentiment</title>
+      <author><first>Oskar</first><last>Hollinsworth</last><affiliation>FAR AI</affiliation></author>
+      <author><first>Curt</first><last>Tigges</last><affiliation>EleutherAI Institute</affiliation></author>
+      <author><first>Atticus</first><last>Geiger</last><affiliation>Pr(Ai)²R Group</affiliation></author>
+      <author><first>Neel</first><last>Nanda</last><affiliation>Google DeepMind</affiliation></author>
+      <pages>58-87</pages>
+      <abstract>Sentiment is a pervasive feature in natural language text, yet it is an open question how sentiment is represented within Large Language Models (LLMs). In this study, we reveal that across a range of models, sentiment is represented linearly: a single direction in activation space mostly captures the feature across a range of tasks with one extreme for positive and the other for negative. In a causal analysis, we isolate this direction using interventions and show it is causal in both toy tasks and real world datasets such as Stanford Sentiment Treebank. We analyze the mechanisms that involve this direction and discover a phenomenon which we term the summarization motif: sentiment is not just represented on valenced words, but is also summarized at intermediate positions without inherent sentiment, such as punctuation and names. We show that in SST classification, ablating the sentiment direction across all tokens results in a drop in accuracy from 100% to 62% (vs. 50% random baseline), while ablating the summarized sentiment direction at comma positions alone produces close to half this result (reducing accuracy to 82%).</abstract>
+      <url hash="a843ea94">2024.blackboxnlp-1.5</url>
+      <bibkey>hollinsworth-etal-2024-language</bibkey>
+    </paper>
+    <paper id="6">
+      <title><fixed-case>LLM</fixed-case> Internal States Reveal Hallucination Risk Faced With a Query</title>
+      <author><first>Ziwei</first><last>Ji</last><affiliation>Hong Kong University of Science and Technology</affiliation></author>
+      <author><first>Delong</first><last>Chen</last><affiliation>Hong Kong University of Science and Technology</affiliation></author>
+      <author><first>Etsuko</first><last>Ishii</last><affiliation>Amazon</affiliation></author>
+      <author><first>Samuel</first><last>Cahyawijaya</last></author>
+      <author><first>Yejin</first><last>Bang</last></author>
+      <author><first>Bryan</first><last>Wilie</last></author>
+      <author><first>Pascale</first><last>Fung</last><affiliation>HKUST</affiliation></author>
+      <pages>88-104</pages>
+      <abstract>The hallucination problem of Large Language Models (LLMs) significantly limits their reliability and trustworthiness. Humans have a self-awareness process that allows us to recognize what we don’t know when faced with queries. Inspired by this, our paper investigates whether LLMs can estimate their own hallucination risk before response generation. We analyze the internal mechanisms of LLMs broadly both in terms of training data sources and across 15 diverse Natural Language Generation (NLG) tasks, spanning over 700 datasets. Our empirical analysis reveals two key insights: (1) LLM internal states indicate whether they have seen the query in training data or not; and (2) LLM internal states show they are likely to hallucinate or not regarding the query. Our study explores particular neurons, activation layers, and tokens that play a crucial role in the LLM perception of uncertainty and hallucination risk. By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.</abstract>
+      <url hash="66d6f28c">2024.blackboxnlp-1.6</url>
+      <bibkey>ji-etal-2024-llm</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Enhancing adversarial robustness in Natural Language Inference using explanations</title>
+      <author><first>Alexandros</first><last>Koulakos</last></author>
+      <author><first>Maria</first><last>Lymperaiou</last></author>
+      <author><first>Giorgos</first><last>Filandrianos</last><affiliation>National Technical University of Athens</affiliation></author>
+      <author><first>Giorgos</first><last>Stamou</last><affiliation>National Technical University of Athens</affiliation></author>
+      <pages>105-117</pages>
+      <abstract>The surge of state-of-the-art transformer-based models has undoubtedly pushed the limits of NLP model performance, excelling in a variety of tasks. We cast the spotlight on the underexplored task of Natural Language Inference (NLI), since models trained on popular well-suited datasets are susceptible to adversarial attacks, allowing subtle input interventions to mislead the model. In this work, we validate the usage of natural language explanation as a model-agnostic defence strategy through extensive experimentation: only by fine-tuning a classifier on the explanation rather than premise-hypothesis inputs, robustness under various adversarial attacks is achieved in comparison to explanation-free baselines. Moreover, since there is no standard strategy for testing the semantic validity of the generated explanations, we research the correlation of widely used language generation metrics with human perception, in order for them to serve as a proxy towards robust NLI models. Our approach is resource-efficient and reproducible without significant computational limitations.</abstract>
+      <url hash="fbba30f0">2024.blackboxnlp-1.7</url>
+      <bibkey>koulakos-etal-2024-enhancing</bibkey>
+    </paper>
+    <paper id="8">
+      <title><fixed-case>M</fixed-case>ulti<fixed-case>C</fixed-case>ontrievers: Analysis of Dense Retrieval Representations</title>
+      <author><first>Seraphina</first><last>Goldfarb-Tarrant</last></author>
+      <author><first>Pedro</first><last>Rodriguez</last><affiliation>Meta FAIR</affiliation></author>
+      <author><first>Jane</first><last>Dwivedi-Yu</last><affiliation>Meta AI</affiliation></author>
+      <author><first>Patrick</first><last>Lewis</last></author>
+      <pages>118-139</pages>
+      <abstract>Dense retrievers compress source documents into (possibly lossy) vector representations, yet there is little analysis of what information is lost versus preserved, and how it affects downstream tasks. We conduct the first analysis of the information captured by dense retrievers compared to the language models they are based on (e.g., BERT versus Contriever). We use 25 MultiBert checkpoints as randomized initialisations to train MultiContrievers, a set of 25 contriever models. We test whether specific pieces of information—such as genderand occupation—can be extracted from contriever vectors of wikipedia-like documents. We measure this extractability via information theoretic probing. We then examine the relationship of extractability to performance and gender bias, as well as the sensitivity of these results to many random initialisations and data shuffles. We find that (1) contriever models have significantly increased extractability, but extractability usually correlates poorly with benchmark performance 2) gender bias is present, but is not caused by the contriever representations 3) there is high sensitivity to both random initialisation and to data shuffle, suggesting that future retrieval research should test across a wider spread of both.</abstract>
+      <url hash="be145a35">2024.blackboxnlp-1.8</url>
+      <bibkey>goldfarb-tarrant-etal-2024-multicontrievers</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Can We Statically Locate Knowledge in Large Language Models? Financial Domain and Toxicity Reduction Case Studies</title>
+      <author><first>Jordi</first><last>Armengol-Estapé</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Lingyu</first><last>Li</last><affiliation>Bloomberg</affiliation></author>
+      <author><first>Sebastian</first><last>Gehrmann</last><affiliation>Bloomberg</affiliation></author>
+      <author><first>Achintya</first><last>Gopal</last></author>
+      <author><first>David</first><last>Rosenberg</last><affiliation>Bloomberg</affiliation></author>
+      <author><first>Gideon</first><last>Mann</last></author>
+      <author><first>Mark</first><last>Dredze</last><affiliation>Department of Computer Science, Whiting School of Engineering</affiliation></author>
+      <pages>140-176</pages>
+      <abstract>Current large language model (LLM) evaluations rely on benchmarks to assess model capabilities and their encoded knowledge. However, these evaluations cannot reveal where a model encodes its knowledge, and thus little is known about which weights contain specific information. We propose a method to statically (without forward or backward passes) locate topical knowledge in the weight space of an LLM, building on a prior insight that parameters can be decoded into interpretable tokens. If parameters can be mapped into the embedding space, it should be possible to directly search for knowledge via embedding similarity. We study the validity of this assumption across several LLMs for a variety of concepts in the financial domain and a toxicity detection setup. Our analysis yields an improved understanding of the promises and limitations of static knowledge location in real-world scenarios.</abstract>
+      <url hash="8c3b7582">2024.blackboxnlp-1.9</url>
+      <bibkey>armengol-estape-etal-2024-statically</bibkey>
+    </paper>
+    <paper id="10">
+      <title>Attend First, Consolidate Later: On the Importance of Attention in Different <fixed-case>LLM</fixed-case> Layers</title>
+      <author><first>Amit</first><last>Artzy</last></author>
+      <author><first>Roy</first><last>Schwartz</last><affiliation>Hebrew University, Hebrew University of Jerusalem</affiliation></author>
+      <pages>177-184</pages>
+      <abstract>In decoder-based LLMs, the representation of a given layer serves two purposes: as input to the next layer during the computation of the current token; and as input to the attention mechanism of future tokens. In this work, we show that the importance of the latter role might be overestimated. To show that, we start by manipulating the representations of previous tokens; e.g. by replacing the hidden states at some layer <tex-math>k</tex-math> with random vectors.Our experimenting with four LLMs and four tasks show that this operation often leads to small to negligible drop in performance. Importantly, this happens if the manipulation occurs in the top part of the model—<tex-math>k</tex-math> is in the final 30–50% of the layers. In contrast, doing the same manipulation in earlier layers might lead to chance level performance.We continue by switching the hidden state of certain tokens with hidden states of other tokens from another prompt; e.g., replacing the word “Italy” with “France” in “What is the capital of Italy?”. We find that when applying this switch in the top 1/3 of the model, the model ignores it (answering “Rome”). However if we apply it before, the model conforms to the switch (“Paris”).Our results hint at a two stage process in transformer-based LLMs: the first part gathers input from previous tokens, while the second mainly processes that information internally.</abstract>
+      <url hash="72085541">2024.blackboxnlp-1.10</url>
+      <bibkey>artzy-schwartz-2024-attend</bibkey>
+    </paper>
+    <paper id="11">
+      <title>Enhancing Question Answering on Charts Through Effective Pre-training Tasks</title>
+      <author><first>Ashim</first><last>Gupta</last></author>
+      <author><first>Vivek</first><last>Gupta</last><affiliation>Arizona State University</affiliation></author>
+      <author><first>Shuo</first><last>Zhang</last></author>
+      <author><first>Yujie</first><last>He</last><affiliation>Bloomberg L.P.</affiliation></author>
+      <author><first>Ning</first><last>Zhang</last><affiliation>Bloomberg</affiliation></author>
+      <author><first>Shalin</first><last>Shah</last><affiliation>Bloomberg</affiliation></author>
+      <pages>185-192</pages>
+      <abstract>To completely understand a document, the use of textual information is not enough. Understanding visual cues, such as layouts and charts, is also required. While the current state-of-the-art approaches for document understanding (both OCR-based and OCR-free) work well, a thorough analysis of their capabilities and limitations has not yet been performed. Therefore, in this work, we addresses the limitation of current VisualQA models when applied to charts and plots. To investigate shortcomings of the state-of-the-art models, we conduct a comprehensive behavioral analysis, using ChartQA as a case study. Our findings indicate that existing models particularly underperform in answering questions related to the chart’s structural and visual context, as well as numerical information. To address these issues, we propose three simple pre-training tasks that enforce the existing model in terms of both structural-visual knowledge, as well as its understanding of numerical questions. We evaluate our pre-trained model (called MatCha-v2) on three chart datasets - both extractive and abstractive question datasets - and observe that it achieves an average improvement of 1.7 % over the baseline model.</abstract>
+      <url hash="607c97d7">2024.blackboxnlp-1.11</url>
+      <bibkey>gupta-etal-2024-enhancing</bibkey>
+    </paper>
+    <paper id="12">
+      <title>Faithfulness and the Notion of Adversarial Sensitivity in <fixed-case>NLP</fixed-case> Explanations</title>
+      <author><first>Supriya</first><last>Manna</last></author>
+      <author><first>Niladri</first><last>Sett</last><affiliation>SRM University</affiliation></author>
+      <pages>193-206</pages>
+      <abstract>Faithfulness is arguably the most critical metric to assess the reliability of explainable AI. In NLP, current methods for faithfulness evaluation are fraught with discrepancies and biases, often failing to capture the true reasoning of models. We introduce Adversarial Sensitivity as a novel approach to faithfulness evaluation, focusing on the explainer’s response when the model is under adversarial attack. Our method accounts for the faithfulness of explainers by capturing sensitivity to adversarial input changes. This work addresses significant limitations in existing evaluation techniques, and furthermore, quantifies faithfulness from a crucial yet underexplored paradigm.</abstract>
+      <url hash="e8a96f75">2024.blackboxnlp-1.12</url>
+      <bibkey>manna-sett-2024-faithfulness</bibkey>
+    </paper>
+    <paper id="13">
+      <title>Transformers Learn Transition Dynamics when Trained to Predict <fixed-case>M</fixed-case>arkov Decision Processes</title>
+      <author><first>Yuxi</first><last>Chen</last></author>
+      <author><first>Suwei</first><last>Ma</last></author>
+      <author><first>Tony</first><last>Dear</last><affiliation>Columbia University</affiliation></author>
+      <author><first>Xu</first><last>Chen</last></author>
+      <pages>207-216</pages>
+      <abstract>Language models have displayed a wide array of capabilities, but the reason for their performance remains a topic of heated debate and investigation. Do these models simply recite the observed training data, or are they able to abstract away surface statistics and learn the underlying processes from which the data was generated? To investigate this question, we explore the capabilities of a GPT model in the context of Markov Decision Processes (MDPs), where the underlying transition dynamics and policies are not directly observed. The model is trained to predict the next state or action without any initial knowledge of the MDPs or the players’ policies. Despite this, we present evidence that the model develops emergent representations of the underlying parameters governing the MDPs.</abstract>
+      <url hash="d9385db1">2024.blackboxnlp-1.13</url>
+      <bibkey>chen-etal-2024-transformers</bibkey>
+    </paper>
+    <paper id="14">
+      <title>On the alignment of <fixed-case>LM</fixed-case> language generation and human language comprehension</title>
+      <author><first>Lena</first><last>Bolliger</last><affiliation>University of Zurich</affiliation></author>
+      <author><first>Patrick</first><last>Haller</last><affiliation>University of Zurich</affiliation></author>
+      <author><first>Lena</first><last>Jäger</last><affiliation>University of Zurich and Universität Potsdam</affiliation></author>
+      <pages>217-231</pages>
+      <abstract>Previous research on the predictive power (PP) of surprisal and entropy has focused on determining which language models (LMs) generate estimates with the highest PP on reading times, and examining for which populations the PP is strongest. In this study, we leverage eye movement data on texts that were generated using a range of decoding strategies with different LMs. We then extract the transition scores that reflect the models’ production rather than comprehension effort. This allows us to investigate the alignment of LM language production and human language comprehension. Our findings reveal that there are differences in the strength of the alignment between reading behavior and certain LM decoding strategies and that this alignment further reflects different stages of language understanding (early, late, or global processes). Although we find lower PP of transition-based measures compared to surprisal and entropy for most decoding strategies, our results provide valuable insights into which decoding strategies impose less processing effort for readers. Our code is available via https://github.com/DiLi-Lab/LM-human-alignment.</abstract>
+      <url hash="44c89ed7">2024.blackboxnlp-1.14</url>
+      <bibkey>bolliger-etal-2024-alignment</bibkey>
+    </paper>
+    <paper id="15">
+      <title>An Adversarial Example for Direct Logit Attribution: Memory Management in <fixed-case>GELU</fixed-case>-4<fixed-case>L</fixed-case></title>
+      <author><first>Jett</first><last>Janiak</last><affiliation>AI Safety Camp</affiliation></author>
+      <author><first>Can</first><last>Rager</last></author>
+      <author><first>James</first><last>Dao</last></author>
+      <author><first>Yeu-Tong</first><last>Lau</last></author>
+      <pages>232-237</pages>
+      <abstract>Prior work suggests that language models manage the limited bandwidth of the residual stream through a “memory management” mechanism, where certain attention heads and MLP layers clear residual stream directions set by earlier layers. Our study provides concrete evidence for this erasure phenomenon in a 4-layer transformer, identifying heads that consistently remove the output of earlier heads. We further demonstrate that direct logit attribution (DLA), a common technique for interpreting the output of intermediate transformer layers, can show misleading results by not accounting for erasure.</abstract>
+      <url hash="16c50e3c">2024.blackboxnlp-1.15</url>
+      <bibkey>janiak-etal-2024-adversarial</bibkey>
+    </paper>
+    <paper id="16">
+      <title>Uncovering Syllable Constituents in the Self-Attention-Based Speech Representations of Whisper</title>
+      <author><first>Erfan</first><last>A Shams</last><affiliation>University College Dublin</affiliation></author>
+      <author><first>Iona</first><last>Gessinger</last></author>
+      <author><first>Julie</first><last>Carson-Berndsen</last><affiliation>University College Dublin</affiliation></author>
+      <pages>238-247</pages>
+      <abstract>As intuitive units of speech, syllables have been widely studied in linguistics. A syllable can be defined as a three-constituent unit with a vocalic centre surrounded by two (in some languages optional) consonant clusters. Syllables are also used to design automatic speech recognition (ASR) models. The significance of knowledge-driven syllable-based tokenisation in ASR over data-driven byte-pair encoding has often been debated. However, the emergence of transformer-based ASR models employing self-attention (SA) overshadowed this debate. These models learn the nuances of speech from large corpora without prior knowledge of the domain; yet, they are not interpretable by design. Consequently, it is not clear if the recent performance improvements are related to the extraction of human-interpretable knowledge. We probe such models for syllable constituents and use an SA head pruning method to assess the relevance of the SA weights. We also investigate the role of vowel identification in syllable constituent probing. Our findings show that the general features of syllable constituents are extracted in the earlier layers of the model and the syllable-related features mostly depend on the temporal knowledge incorporated in specific SA heads rather than on vowel identification.</abstract>
+      <url hash="631a5569">2024.blackboxnlp-1.16</url>
+      <bibkey>a-shams-etal-2024-uncovering</bibkey>
+    </paper>
+    <paper id="17">
+      <title>Recurrent Neural Networks Learn to Store and Generate Sequences using Non-Linear Representations</title>
+      <author><first>Róbert</first><last>Csordás</last><affiliation>Stanford University</affiliation></author>
+      <author><first>Christopher</first><last>Potts</last><affiliation>Stanford University</affiliation></author>
+      <author><first>Christopher</first><last>Manning</last><affiliation>Computer Science Department, Stanford University</affiliation></author>
+      <author><first>Atticus</first><last>Geiger</last><affiliation>Pr(Ai)²R Group</affiliation></author>
+      <pages>248-262</pages>
+      <abstract>The Linear Representation Hypothesis (LRH) states that neural networks learn to encode concepts as directions in activation space, and a strong version of the LRH states that models learn only such encodings. In this paper, we present a counterexample to this strong LRH: when trained to repeat an input token sequence, gated recurrent neural networks (RNNs) learn to represent the token at each position with a particular order of magnitude, rather than a direction. These representations have layered features that are impossible to locate in distinct linear subspaces. To show this, we train interventions to predict and manipulate tokens by learning the scaling factor corresponding to each sequence position. These interventions indicate that the smallest RNNs find only this magnitude-based solution, while larger RNNs have linear representations. These findings strongly indicate that interpretability research should not be confined by the LRH.</abstract>
+      <url hash="06d97eec">2024.blackboxnlp-1.17</url>
+      <bibkey>csordas-etal-2024-recurrent</bibkey>
+    </paper>
+    <paper id="18">
+      <title>Log Probabilities Are a Reliable Estimate of Semantic Plausibility in Base and Instruction-Tuned Language Models</title>
+      <author><first>Carina</first><last>Kauf</last></author>
+      <author><first>Emmanuele</first><last>Chersoni</last><affiliation>The Hong Kong Polytechnic University</affiliation></author>
+      <author><first>Alessandro</first><last>Lenci</last><affiliation>University of Pisa</affiliation></author>
+      <author><first>Evelina</first><last>Fedorenko</last><affiliation>Massachusetts Institute of Technology</affiliation></author>
+      <author><first>Anna</first><last>Ivanova</last><affiliation>Georgia Institute of Technology</affiliation></author>
+      <pages>263-277</pages>
+      <abstract>Semantic plausibility (e.g. knowing that “the actor won the award” is more likely than “the actor won the battle”) serves as an effective proxy for general world knowledge. Language models (LMs) capture vast amounts of world knowledge by learning distributional patterns in text, accessible via log probabilities (LogProbs) they assign to plausible vs. implausible outputs. The new generation of instruction-tuned LMs can now also provide explicit estimates of plausibility via prompting. Here, we evaluate the effectiveness of LogProbs and basic prompting to measure semantic plausibility, both in single-sentence minimal pairs (Experiment 1) and short context-dependent scenarios (Experiment 2). We find that (i) in both base and instruction-tuned LMs, LogProbs offers a more reliable measure of semantic plausibility than direct zero-shot prompting, which yields inconsistent and often poor results; (ii) instruction-tuning generally does not alter the sensitivity of LogProbs to semantic plausibility (although sometimes decreases it); (iii) across models, context mostly modulates LogProbs in expected ways, as measured by three novel metrics of context-sensitive plausibility and their match to explicit human plausibility judgments. We conclude that, even in the era of prompt-based evaluations, LogProbs constitute a useful metric of semantic plausibility, both in base and instruction-tuned LMs.</abstract>
+      <url hash="21fc08d3">2024.blackboxnlp-1.18</url>
+      <bibkey>kauf-etal-2024-log</bibkey>
+    </paper>
+    <paper id="19">
+      <title>Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2</title>
+      <author><first>Tom</first><last>Lieberum</last><affiliation>Google</affiliation></author>
+      <author><first>Senthooran</first><last>Rajamanoharan</last><affiliation>Google DeepMind</affiliation></author>
+      <author><first>Arthur</first><last>Conmy</last><affiliation>Google DeepMind</affiliation></author>
+      <author><first>Lewis</first><last>Smith</last><affiliation>Google</affiliation></author>
+      <author><first>Nicolas</first><last>Sonnerat</last><affiliation>DeepMind</affiliation></author>
+      <author><first>Vikrant</first><last>Varma</last><affiliation>Google DeepMind</affiliation></author>
+      <author><first>Janos</first><last>Kramar</last><affiliation>DeepMind</affiliation></author>
+      <author><first>Anca</first><last>Dragan</last><affiliation>University of California Berkeley</affiliation></author>
+      <author><first>Rohin</first><last>Shah</last><affiliation>DeepMind</affiliation></author>
+      <author><first>Neel</first><last>Nanda</last><affiliation>Google DeepMind</affiliation></author>
+      <pages>278-300</pages>
+      <abstract>Sparse autoencoders (SAEs) are an unsupervised method for learning a sparse decomposition of a neural network’s latent representations into seemingly interpretable features.Despite recent excitement about their potential, research applications outside of industry are limited by the high cost of training a comprehensive suite of SAEs.In this work, we introduce Gemma Scope, an open suite of JumpReLU SAEs trained on all layers and sub-layers of Gemma 2 2B and 9B and select layers of Gemma 2 27B base models.We primarily train SAEs on the Gemma 2 pre-trained models, but additionally release SAEs trained on instruction-tuned Gemma 2 9B for comparison.We evaluate the quality of each SAE on standard metrics and release these results.We hope that by releasing these SAE weights, we can help make more ambitious safety and interpretability research easier for the community. Weights and a tutorial can be found at <url>https://huggingface.co/google/gemma-scope</url> and an interactive demo can be found at <url>https://neuronpedia.org/gemma-scope</url>.</abstract>
+      <url hash="75eb68f9">2024.blackboxnlp-1.19</url>
+      <bibkey>lieberum-etal-2024-gemma</bibkey>
+    </paper>
+    <paper id="20">
+      <title>Self-Assessment Tests are Unreliable Measures of <fixed-case>LLM</fixed-case> Personality</title>
+      <author><first>Akshat</first><last>Gupta</last><affiliation>University of California, Berkeley</affiliation></author>
+      <author><first>Xiaoyang</first><last>Song</last><affiliation>University of Michigan - Ann Arbor</affiliation></author>
+      <author><first>Gopala</first><last>Anumanchipalli</last><affiliation>University of California, Berkeley</affiliation></author>
+      <pages>301-314</pages>
+      <abstract>As large language models (LLM) evolve in their capabilities, various recent studies have tried to quantify their behavior using psychological tools created to study human behavior. One such example is the measurement of “personality” of LLMs using self-assessment personality tests developed to measure human personality. Yet almost none of these works verify the applicability of these tests on LLMs. In this paper, we analyze the reliability of LLM personality scores obtained from self-assessment personality tests using two simple experiments. We first introduce the property of <i>prompt sensitivity</i>, where three semantically equivalent prompts representing three intuitive ways of administering self-assessment tests on LLMs are used to measure the personality of the same LLM. We find that all three prompts lead to very different personality scores, a difference that is statistically significant for all traits in a large majority of scenarios. We then introduce the property of <i>option-order symmetry</i> for personality measurement of LLMs. Since most of the self-assessment tests exist in the form of multiple choice question (MCQ) questions, we argue that the scores should also be robust to not just the prompt template but also the order in which the options are presented. This test unsurprisingly reveals that the self-assessment test scores are not robust to the order of the options. These simple tests, done on ChatGPT and three Llama2 models of different sizes, show that self-assessment personality tests created for humans are unreliable measures of personality in LLMs.</abstract>
+      <url hash="91303c89">2024.blackboxnlp-1.20</url>
+      <bibkey>gupta-etal-2024-self</bibkey>
+    </paper>
+    <paper id="21">
+      <title>How Language Models Prioritize Contextual Grammatical Cues?</title>
+      <author><first>Hamidreza</first><last>Amirzadeh</last></author>
+      <author><first>Afra</first><last>Alishahi</last><affiliation>Tilburg University</affiliation></author>
+      <author><first>Hosein</first><last>Mohebbi</last></author>
+      <pages>315-336</pages>
+      <abstract>Transformer-based language models have shown an excellent ability to effectively capture and utilize contextual information. Although various analysis techniques have been used to quantify and trace the contribution of single contextual cues to a target task such as subject-verb agreement or coreference resolution, scenarios in which multiple relevant cues are available in the context remain underexplored.In this paper, we investigate how language models handle gender agreement when multiple gender cue words are present, each capable of independently disambiguating a target gender pronoun. We analyze two widely used Transformer-based models: BERT, an encoder-based, and GPT-2, a decoder-based model.Our analysis employs two complementary approaches: context mixing analysis, which tracks information flow within the model, and a variant of activation patching, which measures the impact of cues on the model’s prediction. We find that BERT tends to prioritize the first cue in the context to form both the target word representations and the model’s prediction, while GPT-2 relies more on the final cue. Our findings reveal striking differences in how encoder-based and decoder-based models prioritize and use contextual information for their predictions.</abstract>
+      <url hash="f475a580">2024.blackboxnlp-1.21</url>
+      <bibkey>amirzadeh-etal-2024-language</bibkey>
+    </paper>
+    <paper id="22">
+      <title>Copy Suppression: Comprehensively Understanding a Motif in Language Model Attention Heads</title>
+      <author><first>Callum</first><last>McDougall</last></author>
+      <author><first>Arthur</first><last>Conmy</last><affiliation>Google DeepMind</affiliation></author>
+      <author><first>Cody</first><last>Rushing</last><affiliation>University of Texas at Austin</affiliation></author>
+      <author><first>Thomas</first><last>McGrath</last><affiliation>Google</affiliation></author>
+      <author><first>Neel</first><last>Nanda</last><affiliation>Google DeepMind</affiliation></author>
+      <pages>337-363</pages>
+      <abstract>We present the copy suppression motif: an algorithm implemented by attention heads in large language models that reduces loss.If i) language model components in earlier layers predict a certain token, ii) this token appears earlier in the context and iii) later attention heads in the model suppress prediction of the token, then this is copy suppression. To show the importance of copy suppression, we focus on reverse-engineering attention head 10.7 (L10H7) in GPT-2 Small. This head suppresses naive copying behavior which improves overall model calibration, which explains why multiple prior works studying certain narrow tasks found negative heads that systematically favored the wrong answer. We uncover the mechanism that the negative heads use for copy suppression with weights-based evidence and are able to explain 76.9% of the impact of L10H7 in GPT-2 Small, by this motif alone.To the best of our knowledge, this is the most comprehensive description of the complete role of a component in a language model to date. One major effect of copy suppression is its role in self-repair. Self-repair refers to how ablating crucial model components results in downstream neural network parts compensating for this ablation. Copy suppression leads to self-repair: if an initial overconfident copier is ablated, then there is nothing to suppress. We show that self-repair is implemented by several mechanisms, one of which is copy suppression, which explains 39% of the behavior in a narrow task. Interactive visualizations of the copy suppression phenomena may be seen at our web app https://copy-suppression.streamlit.app/.</abstract>
+      <url hash="307f4989">2024.blackboxnlp-1.22</url>
+      <bibkey>mcdougall-etal-2024-copy</bibkey>
+    </paper>
+    <paper id="23">
+      <title><fixed-case>W</fixed-case>ell<fixed-case>D</fixed-case>unn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions</title>
+      <author><first>Seyedali</first><last>Mohammadi</last><affiliation>University of Maryland, Baltimore County</affiliation></author>
+      <author><first>Edward</first><last>Raff</last><affiliation>University of Maryland, Baltimore County and Booz Allen Hamilton</affiliation></author>
+      <author><first>Jinendra</first><last>Malekar</last></author>
+      <author><first>Vedant</first><last>Palit</last><affiliation>Indian Institute of Technology, Kharagpur</affiliation></author>
+      <author><first>Francis</first><last>Ferraro</last><affiliation>University of Maryland, Baltimore County</affiliation></author>
+      <author><first>Manas</first><last>Gaur</last><affiliation>University of Maryland Baltimore County</affiliation></author>
+      <pages>364-388</pages>
+      <abstract>Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a sufficient litmus test of a model’s utility in clinical practice. A model that can be trusted for practice should have a correspondence between explanation and clinical determination, yet no prior research has examined the attention fidelity of these models and their effect on ground truth explanations. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs). We focus on two existing mental health and well-being datasets: (a) Multi-label Classification-based MultiWD, and (b) WellXplain for evaluating attention mechanism veracity against expert-labeled explanations. The labels are based on Halbert Dunn’s theory of wellness, which gives grounding to our evaluation. We reveal four surprising results about LMs/LLMs: (1) Despite their human-like capabilities, GPT-3.5/4 lag behind RoBERTa, and MedAlpaca, a fine-tuned LLM on WellXplain fails to deliver any remarkable improvements in performance or explanations. (2) Re-examining LMs’ predictions based on a confidence-oriented loss function reveals a significant performance drop. (3) Across all LMs/LLMs, the alignment between attention and explanations remains low, with LLMs scoring a dismal 0.0. (4) Most mental health-specific LMs/LLMs overlook domain-specific knowledge and undervalue explanations, causing these discrepancies. This study highlights the need for further research into their consistency and explanations in mental health and well-being.</abstract>
+      <url hash="7047917d">2024.blackboxnlp-1.23</url>
+      <bibkey>mohammadi-etal-2024-welldunn</bibkey>
+    </paper>
+    <paper id="24">
+      <title>Do Metadata and Appearance of the Retrieved Webpages Affect <fixed-case>LLM</fixed-case>’s Reasoning in Retrieval-Augmented Generation?</title>
+      <author><first>Cheng-Han</first><last>Chiang</last></author>
+      <author><first>Hung-yi</first><last>Lee</last><affiliation>National Taiwan University</affiliation></author>
+      <pages>389-406</pages>
+      <abstract>Large language models (LLMs) answering questions with retrieval-augmented generation (RAG) can face conflicting evidence in the retrieved documents. While prior works study how textual features like perplexity and readability influence the persuasiveness of evidence, humans consider more than textual content when evaluating conflicting information on the web. In this paper, we focus on the following question: When two webpages contain conflicting information to answer a question, does non-textual information affect the LLM’s reasoning and answer? We consider three types of non-textual information: (1) the webpage’s publication time, (2) the source where the webpage is from, and (3) the appearance of the webpage. We give the LLM a Yes/No question and two conflicting webpages that support yes and no, respectively. We exchange the non-textual information in the two webpages to see if the LLMs tend to use the information from a newer, more reliable, and more visually appealing webpage. We find that changing the publication time of the webpage can change the answer for most LLMs, but changing the webpage’s source merely affects the LLM’s answer. We also reveal that the webpage’s appearance has a strong causal effect on Claude-3’s answers.The codes and datasets used in the paper are available at https://github.com/d223302/rag-metadata.</abstract>
+      <url hash="23b0cdf2">2024.blackboxnlp-1.24</url>
+      <bibkey>chiang-lee-2024-metadata</bibkey>
+    </paper>
+    <paper id="25">
+      <title>Attribution Patching Outperforms Automated Circuit Discovery</title>
+      <author><first>Aaquib</first><last>Syed</last></author>
+      <author><first>Can</first><last>Rager</last></author>
+      <author><first>Arthur</first><last>Conmy</last><affiliation>Google DeepMind</affiliation></author>
+      <pages>407-416</pages>
+      <abstract>Automated interpretability research has recently attracted attention as a potential research direction that could scale explanations of neural network behavior to large models. Existing automated circuit discovery work applies activation patching to identify subnetworks responsible for solving specific tasks (circuits). In this work, we show that a simple method based on attribution patching outperforms all existing methods while requiring just two forward passes and a backward pass. We apply a linear approximation to activation patching to estimate the importance of each edge in the computational subgraph. Using this approximation, we prune the least important edges of the network. We survey the performance and limitations of this method, finding that averaged over all tasks our method has greater AUC from circuit recovery than other methods.</abstract>
+      <url hash="80a20540">2024.blackboxnlp-1.25</url>
+      <bibkey>syed-etal-2024-attribution</bibkey>
+    </paper>
+    <paper id="26">
+      <title>Pruning for Protection: Increasing Jailbreak Resistance in Aligned <fixed-case>LLM</fixed-case>s Without Fine-Tuning</title>
+      <author><first>Adib</first><last>Hasan</last><affiliation>Massachusetts Institute of Technology</affiliation></author>
+      <author><first>Ileana</first><last>Rugina</last></author>
+      <author><first>Alex</first><last>Wang</last></author>
+      <pages>417-430</pages>
+      <abstract>This paper investigates the impact of model compression on the way Large Language Models (LLMs) process prompts, particularly concerning jailbreak resistance. We show that moderate WANDA pruning can enhance resistance to jailbreaking attacks without fine-tuning, while maintaining performance on standard benchmarks. To systematically evaluate this safety enhancement, we introduce a dataset of 225 harmful tasks across five categories. Our analysis of LLaMA-2 Chat, Vicuna 1.3, and Mistral Instruct v0.2 reveals that pruning benefits correlate with initial model safety levels. We interpret these results by examining changes in attention patterns and perplexity shifts, demonstrating that pruned models exhibit sharper attention and increased sensitivity to artificial jailbreak constructs. We extend our evaluation to the AdvBench harmful behavior tasks and the GCG attack method. We find that LLaMA-2 is much safer on AdvBench prompts than on our dataset when evaluated with manual jailbreak attempts, and that pruning is effective against both automated attacks and manual jailbreaking on Advbench.</abstract>
+      <url hash="9a4b051a">2024.blackboxnlp-1.26</url>
+      <bibkey>hasan-etal-2024-pruning</bibkey>
+    </paper>
+    <paper id="27">
+      <title><fixed-case>I</fixed-case>v<fixed-case>RA</fixed-case>: A Framework to Enhance Attention-Based Explanations for Language Models with Interpretability-Driven Training</title>
+      <author><first>Sean</first><last>Xie</last></author>
+      <author><first>Soroush</first><last>Vosoughi</last><affiliation>Dartmouth College</affiliation></author>
+      <author><first>Saeed</first><last>Hassanpour</last><affiliation>Dartmouth College</affiliation></author>
+      <pages>431-451</pages>
+      <abstract>Attention has long served as a foundational technique for generating explanations. With the recent developments made in Explainable AI (XAI), the multi-faceted nature of interpretability has become more apparent. Can attention, as an explanation method, be adapted to meet the diverse needs that our expanded understanding of interpretability demands? In this work, we aim to address this question by introducing IvRA, a framework designed to directly train a language model’s attention distribution through regularization to produce attribution explanations that align with interpretability criteria such as simulatability, faithfulness, and consistency. Our extensive experimental analysis demonstrates that IvRA outperforms existing methods in guiding language models to generate explanations that are simulatable, faithful, and consistent, in tandem with their predictions. Furthermore, we perform ablation studies to verify the robustness of IvRA across various experimental settings and to shed light on the interactions among different interpretability criteria.</abstract>
+      <url hash="581524d7">2024.blackboxnlp-1.27</url>
+      <bibkey>xie-etal-2024-ivra</bibkey>
+    </paper>
+    <paper id="28">
+      <title>Counterfactuals As a Means for Evaluating Faithfulness of Attribution Methods in Autoregressive Language Models</title>
+      <author><first>Sepehr</first><last>Kamahi</last></author>
+      <author><first>Yadollah</first><last>Yaghoobzadeh</last><affiliation>University of Tehran</affiliation></author>
+      <pages>452-468</pages>
+      <abstract>Despite the widespread adoption of autoregressive language models, explainability evaluation research has predominantly focused on span infilling and masked language models. Evaluating the faithfulness of an explanation method—how accurately it explains the inner workings and decision-making of the model—is challenging because it is difficult to separate the model from its explanation. Most faithfulness evaluation techniques corrupt or remove input tokens deemed important by a particular attribution (feature importance) method and observe the resulting change in the model’s output. However, for autoregressive language models, this approach creates out-of-distribution inputs due to their next-token prediction training objective. In this study, we propose a technique that leverages counterfactual generation to evaluate the faithfulness of attribution methods for autoregressive language models. Our technique generates fluent, in-distribution counterfactuals, making the evaluation protocol more reliable.</abstract>
+      <url hash="9f01cd03">2024.blackboxnlp-1.28</url>
+      <bibkey>kamahi-yaghoobzadeh-2024-counterfactuals</bibkey>
+    </paper>
+    <paper id="29">
+      <title>Investigating Layer Importance in Large Language Models</title>
+      <author><first>Yang</first><last>Zhang</last></author>
+      <author><first>Yanfei</first><last>Dong</last><affiliation>PayPal Inc. and national university of singaore, National University of Singapore</affiliation></author>
+      <author><first>Kenji</first><last>Kawaguchi</last><affiliation>National University of Singapore</affiliation></author>
+      <pages>469-479</pages>
+      <abstract>Large language models (LLMs) have gained increasing attention due to their prominent ability to understand and process texts. Nevertheless, LLMs largely remain opaque. The lack of understanding of LLMs has obstructed the deployment in safety-critical scenarios and hindered the development of better models. In this study, we advance the understanding of LLM by investigating the significance of individual layers in LLMs. We propose an efficient sampling method to faithfully evaluate the importance of layers using Shapley values, a widely used explanation framework in feature attribution and data valuation. In addition, we conduct layer ablation experiments to assess the performance degradation resulting from the exclusion of specific layers. Our findings reveal the existence of cornerstone layers, wherein certain early layers can exhibit a dominant contribution over others. Removing one cornerstone layer leads to a drastic collapse of the model performance, often reducing it to random guessing. Conversely, removing non-cornerstone layers results in only marginal performance changes. This study identifies cornerstone layers in LLMs and underscores their critical role for future research.</abstract>
+      <url hash="22202972">2024.blackboxnlp-1.29</url>
+      <bibkey>zhang-etal-2024-investigating</bibkey>
+    </paper>
+    <paper id="30">
+      <title>Mechanistic?</title>
+      <author><first>Naomi</first><last>Saphra</last><affiliation>Harvard University</affiliation></author>
+      <author><first>Sarah</first><last>Wiegreffe</last><affiliation>Allen Institute for Artificial Intelligence and University of Washington</affiliation></author>
+      <pages>480-498</pages>
+      <abstract>The rise of the term “mechanistic interpretability” has accompanied increasing interest in understanding neural models—particularly language models. However, this jargon has also led to a fair amount of confusion. So, what does it mean to be mechanistic? We describe four uses of the term in interpretability research. The most narrow technical definition requires a claim of causality, while a broader technical definition allows for any exploration of a model’s internals. However, the term also has a narrow cultural definition describing a cultural movement. To understand this semantic drift, we present a history of the NLP interpretability community and the formation of the separate, parallel mechanistic interpretability community. Finally, we discuss the broad cultural definition—encompassing the entire field of interpretability—and why the traditional NLP interpretability community has come to embrace it. We argue that the polysemy of “mechanistic” is the product of a critical divide within the interpretability community.</abstract>
+      <url hash="42772d31">2024.blackboxnlp-1.30</url>
+      <bibkey>saphra-wiegreffe-2024-mechanistic</bibkey>
+    </paper>
+    <paper id="31">
+      <title>Toward the Evaluation of Large Language Models Considering Score Variance across Instruction Templates</title>
+      <author><first>Yusuke</first><last>Sakai</last><affiliation>Nara Institute of Science and Technology, Japan</affiliation></author>
+      <author><first>Adam</first><last>Nohejl</last><affiliation>Nara Institute of Science and Technology, Japan</affiliation></author>
+      <author><first>Jiangnan</first><last>Hang</last></author>
+      <author><first>Hidetaka</first><last>Kamigaito</last><affiliation>Nara Institute of Science and Technology</affiliation></author>
+      <author><first>Taro</first><last>Watanabe</last><affiliation>Nara Institute of Science and Technology, Japan</affiliation></author>
+      <pages>499-529</pages>
+      <abstract>The natural language understanding (NLU) performance of large language models (LLMs) has been evaluated across various tasks and datasets. The existing evaluation methods, however, do not take into account the variance in scores due to differences in prompts, which leads to unfair evaluation and comparison of NLU performance. Moreover, evaluation designed for specific prompts is inappropriate for instruction tuning, which aims to perform well with any prompt. It is therefore necessary to find a way to measure NLU performance in a fair manner, considering score variance between different instruction templates. In this study, we provide English and Japanese cross-lingual datasets for evaluating the NLU performance of LLMs, which include multiple instruction templates for fair evaluation of each task, along with regular expressions to constrain the output format. Furthermore, we propose the Sharpe score as an evaluation metric that takes into account the variance in scores between templates. Comprehensive analysis of English and Japanese LLMs reveals that the high variance among templates has a significant impact on the fair evaluation of LLMs.</abstract>
+      <url hash="8d0f757d">2024.blackboxnlp-1.31</url>
+      <bibkey>sakai-etal-2024-toward</bibkey>
+    </paper>
+    <paper id="32">
+      <title>Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models</title>
+      <author><first>Davide</first><last>Ghilardi</last></author>
+      <author><first>Federico</first><last>Belotti</last></author>
+      <author><first>Marco</first><last>Molinari</last><affiliation>LSE.AI</affiliation></author>
+      <author><first>Jaehyuk</first><last>Lim</last></author>
+      <pages>530-550</pages>
+      <abstract>Sparse AutoEncoders (SAEs) have gained popularity as a tool for enhancing the interpretability of Large Language Models (LLMs). However, training SAEs can be computationally intensive, especially as model complexity grows. In this study, the potential of transfer learning to accelerate SAEs training is explored by capitalizing on the shared representations found across adjacent layers of LLMs. Our experimental results demonstrate that fine-tuning SAEs using pre-trained models from nearby layers not only maintains but often improves the quality of learned representations, while significantly accelerating convergence. These findings indicate that the strategic reuse of pretrained SAEs is a promising approach, particularly in settings where computational resources are constrained.</abstract>
+      <url hash="7df7d2db">2024.blackboxnlp-1.32</url>
+      <bibkey>ghilardi-etal-2024-accelerating</bibkey>
+    </paper>
+    <paper id="33">
+      <title>Wrapper Boxes for Faithful Attribution of Model Predictions to Training Data</title>
+      <author><first>Yiheng</first><last>Su</last></author>
+      <author><first>Junyi Jessy</first><last>Li</last><affiliation>University of Texas, Austin</affiliation></author>
+      <author><first>Matthew</first><last>Lease</last><affiliation>Amazon and University of Texas at Austin</affiliation></author>
+      <pages>551-576</pages>
+      <abstract>Can we preserve the accuracy of neural models while also providing faithful explanations of model decisions to training data? We propose a “wrapper box” pipeline: training a neural model as usual and then using its learned feature representation in classic, interpretable models to perform prediction. Across seven language models of varying sizes, including four large language models (LLMs), two datasets at different scales, three classic models, and four evaluation metrics, we first show that the predictive performance of wrapper classic models is largely comparable to the original neural models. Because classic models are transparent, each model decision is determined by a known set of training examples that can be directly shown to users. Our pipeline thus preserves the predictive performance of neural language models while faithfully attributing classic model decisions to training data. Among other use cases, such attribution enables model decisions to be contested based on responsible training instances. Compared to prior work, our approach achieves higher coverage and correctness in identifying which training data to remove to change a model decision. To reproduce findings, our source code is online at: https://github.com/SamSoup/WrapperBox.</abstract>
+      <url hash="6cb012cd">2024.blackboxnlp-1.33</url>
+      <bibkey>su-etal-2024-wrapper</bibkey>
+    </paper>
+    <paper id="34">
+      <title>Multi-property Steering of Large Language Models with Dynamic Activation Composition</title>
+      <author><first>Daniel</first><last>Scalena</last><affiliation>University of Milan - Bicocca and University of Groningen</affiliation></author>
+      <author><first>Gabriele</first><last>Sarti</last><affiliation>University of Groningen</affiliation></author>
+      <author><first>Malvina</first><last>Nissim</last><affiliation>University of Groningen</affiliation></author>
+      <pages>577-603</pages>
+      <abstract>Activation steering methods were shown to be effective in conditioning language model generation by additively intervening over models’ intermediate representations. However, the evaluation of these techniques has so far been limited to single conditioning properties and synthetic settings. In this work, we conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters to ensure a robust effect throughout generation. To address this issue, we propose Dynamic Activation Composition, an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Our experiments on multi-property steering show that our method successfully maintains high conditioning while minimizing the impact of conditioning on generation fluency.</abstract>
+      <url hash="8ffd1cb2">2024.blackboxnlp-1.34</url>
+      <bibkey>scalena-etal-2024-multi</bibkey>
+    </paper>
+    <paper id="35">
+      <title>Probing Language Models on Their Knowledge Source</title>
+      <author><first>Zineddine</first><last>Tighidet</last></author>
+      <author><first>Jiali</first><last>Mei</last><affiliation>BNP Paribas</affiliation></author>
+      <author><first>Benjamin</first><last>Piwowarski</last><affiliation>CNRS / ISIR, Sorbonne Université and CNRS</affiliation></author>
+      <author><first>Patrick</first><last>Gallinari</last><affiliation>Criteo AI Lab and Sorbonne Universite</affiliation></author>
+      <pages>604-614</pages>
+      <abstract>Large Language Models (LLMs) often encounter conflicts between their learned, internal (parametric knowledge, PK) and external knowledge provided during inference (contextual knowledge, CK). Understanding how LLMs models prioritize one knowledge source over the other remains a challenge. In this paper, we propose a novel probing framework to explore the mechanisms governing the selection between PK and CK in LLMs. Using controlled prompts designed to contradict the model’s PK, we demonstrate that specific model activations are indicative of the knowledge source employed. We evaluate this framework on various LLMs of different sizes and demonstrate that mid-layer activations, particularly those related to relations in the input, are crucial in predicting knowledge source selection, paving the way for more reliable models capable of handling knowledge conflicts effectively.</abstract>
+      <url hash="3cb42948">2024.blackboxnlp-1.35</url>
+      <bibkey>tighidet-etal-2024-probing</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.conll.xml b/data/xml/2024.conll.xml
new file mode 100644
index 0000000000..f0e2db33dd
--- /dev/null
+++ b/data/xml/2024.conll.xml
@@ -0,0 +1,462 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.conll">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the 28th Conference on Computational Natural Language Learning</booktitle>
+      <editor><first>Libby</first><last>Barak</last></editor>
+      <editor><first>Malihe</first><last>Alikhani</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, FL, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="dae0c156">2024.conll-1</url>
+      <venue>conll</venue>
+    </meta>
+    <frontmatter>
+      <url hash="8f22d13b">2024.conll-1.0</url>
+      <bibkey>conll-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Words That Stick: Using Keyword Cohesion to Improve Text Segmentation</title>
+      <author><first>Amit</first><last>Maraj</last></author>
+      <author><first>Miguel</first><last>Vargas Martin</last><affiliation>Ontario Tech University</affiliation></author>
+      <author><first>Masoud</first><last>Makrehchi</last><affiliation>Ontario Tech University</affiliation></author>
+      <pages>1-9</pages>
+      <abstract>Text Segmentation (TS) is the idea of segmenting bodies of text into coherent blocks, mostly defined by the topics each segment contains. Historically, techniques in this area have been unsupervised, with more success recently coming from supervised methods instead. Although these approaches see better performance, they require training data and upfront training time. We propose a new method called Coherence, where we use strong sentence embeddings to pull representational keywords as the main constructor of sentences when comparing them to one another. Additionally, we include a storage of previously found keywords for the purposes of creating a more accurate segment representation instead of just the immediate sentence in question. With our system, we show improved results over current state-of-the-art unsupervised techniques when analyzed using Pk and WindowDiff scores. Because its unsupervised, Coherence requires no fine-tuning.</abstract>
+      <url hash="d0c2b3fd">2024.conll-1.1</url>
+      <bibkey>maraj-etal-2024-words</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Investigating large language models for their competence in extracting grammatically sound sentences from transcribed noisy utterances</title>
+      <author><first>Alina</first><last>Wróblewska</last></author>
+      <pages>10-23</pages>
+      <abstract>Selectively processing noisy utterances while effectively disregarding speech-specific elements poses no considerable challenge for humans, as they exhibit remarkable cognitive abilities to separate semantically significant content from speech-specific noise (i.e. filled pauses, disfluencies, and restarts). These abilities may be driven by mechanisms based on acquired grammatical rules that compose abstract syntactic-semantic structures within utterances. Segments without syntactic and semantic significance are consistently disregarded in these structures. The structures, in tandem with lexis, likely underpin language comprehension and thus facilitate effective communication.In our study, grounded in linguistically motivated experiments, we investigate whether large language models (LLMs) can effectively perform analogical speech comprehension tasks. In particular, we examine the ability of LLMs to extract well-structured utterances from transcriptions of noisy dialogues. We conduct two evaluation experiments in the Polish language scenario, using a dataset presumably unfamiliar to LLMs to mitigate the risk of data contamination. Our results show that not all extracted utterances are correctly structured, indicating that either LLMs do not fully acquire syntactic-semantic rules or they acquire them but cannot apply them effectively. We conclude that the ability of LLMs to comprehend noisy utterances is still relatively superficial compared to human proficiency in processing them.</abstract>
+      <url hash="8701f0ad">2024.conll-1.2</url>
+      <bibkey>wroblewska-2024-investigating</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Multi-Cultural Norm Base: Frame-based Norm Discovery in Multi-Cultural Settings</title>
+      <author><first>Viet</first><last>Pham</last><affiliation>Monash University</affiliation></author>
+      <author><first>Shilin</first><last>Qu</last></author>
+      <author><first>Farhad</first><last>Moghimifar</last><affiliation>Monash University</affiliation></author>
+      <author><first>Suraj</first><last>Sharma</last></author>
+      <author><first>Yuan-Fang</first><last>Li</last><affiliation>Monash University and Oracle</affiliation></author>
+      <author><first>Weiqing</first><last>Wang</last></author>
+      <author><first>Reza</first><last>Haf</last><affiliation>Monash University</affiliation></author>
+      <pages>24-35</pages>
+      <abstract>Sociocultural norms serve as guiding principles for personal conduct in social interactions within a particular society or culture. The study of norm discovery has seen significant development over the last few years, with various interesting approaches. However, it is difficult to adopt these approaches to discover norms in a new culture, as they rely either on human annotations or real-world dialogue contents. This paper presents a robust automatic norm discovery pipeline, which utilizes the cultural knowledge of GPT-3.5 Turbo (ChatGPT) along with several social factors. By using these social factors and ChatGPT, our pipeline avoids the use of human dialogues that tend to be limited to specific scenarios, as well as the use of human annotations that make it difficult and costly to enlarge the dataset. The resulting database - Multi-cultural Norm Base (MNB) - covers 6 distinct cultures, with over 150k sociocultural norm statements in total. A state-of-the-art Large Language Model (LLM), Llama 3, fine-tuned with our proposed dataset, shows remarkable results on various downstream tasks, outperforming models fine-tuned on other datasets significantly.</abstract>
+      <url hash="3fe8a2b1">2024.conll-1.3</url>
+      <bibkey>pham-etal-2024-multi</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Lossy Context Surprisal Predicts Task-Dependent Patterns in Relative Clause Processing</title>
+      <author><first>Kate</first><last>McCurdy</last><affiliation>Universität des Saarlandes and University of Edinburgh, University of Edinburgh</affiliation></author>
+      <author><first>Michael</first><last>Hahn</last></author>
+      <pages>36-45</pages>
+      <abstract>English relative clauses are a critical test case for theories of syntactic processing. Expectation- and memory-based accounts make opposing predictions, and behavioral experiments have found mixed results. We present a technical extension of Lossy Context Surprisal (LCS) and use it to model relative clause processing in three behavioral experiments. LCS predicts key results at distinct retention rates, showing that task-dependent memory demands can account for discrepant behavioral patterns in the literature.</abstract>
+      <url hash="0ed72b74">2024.conll-1.4</url>
+      <bibkey>mccurdy-hahn-2024-lossy</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Global-Pruner: A Stable and Efficient Pruner for Retraining-Free Pruning of Encoder-Based Language Models</title>
+      <author><first>Guangzhen</first><last>Yao</last></author>
+      <author><first>Yuehan</first><last>Wang</last></author>
+      <author><first>Hui</first><last>Xu</last></author>
+      <author><first>Long</first><last>Zhang</last></author>
+      <author><first>MiaoQI</first><last>MiaoQI</last></author>
+      <pages>46-55</pages>
+      <abstract>Large language models (LLMs) have achieved significant success in complex tasks across various domains, but they come with high computational costs and inference latency issues. Pruning, as an effective method, can significantly reduce inference costs. However, current pruning algorithms for encoder-based language models often focus on locally optimal solutions, neglecting a comprehensive exploration of the global solution space. This oversight can lead to instability in the solution process, thereby affecting the overall performance of the model. To address these challenges, we propose a structured pruning algorithm named G-Pruner (Global Pruner), comprising two integral components: PPOM (Proximal Policy Optimization Mask) and CG²MT (Conjugate Gradient Squared Mask Tuning), utilizing a global optimization strategy. This strategy not only eliminates the need for retraining but also ensures the algorithm’s stability and adaptability to environmental changes, effectively addressing the issue of focusing solely on immediate optima while neglecting long-term effects. This method is evaluated on the GLUE and SQuAD benchmarks using BERTBASE and DistilBERT models. The experimental results indicate that without any retraining, G-Pruner achieves significant accuracy improvements on the SQuAD<tex-math>_{2.0}</tex-math> task with a FLOPs constraint of 60%, demonstrating a 6.02% increase in F1 score compared with baseline algorithms.</abstract>
+      <url hash="5fc9744d">2024.conll-1.5</url>
+      <bibkey>yao-etal-2024-global</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Transformer verbatim in-context retrieval across time and scale</title>
+      <author><first>Kristijan</first><last>Armeni</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Marko</first><last>Pranjić</last><affiliation>Jozef Stefan Institute and Jozef Stefan International Postgraduate School</affiliation></author>
+      <author><first>Senja</first><last>Pollak</last></author>
+      <pages>56-68</pages>
+      <abstract>To predict upcoming text, language models must in some cases retrieve in-context information verbatim. In this report, we investigated how the ability of language models to retrieve arbitrary in-context nouns developed during training (across time) and as language models trained on the same dataset increase in size (across scale). We then asked whether learning of in-context retrieval correlates with learning of more challenging zero-shot benchmarks. Furthermore, inspired by semantic effects in human short-term memory, we evaluated the retrieval with respect to a major semantic component of target nouns, namely whether they denote a concrete or abstract entity, as rated by humans. We show that verbatim in-context retrieval developed in a sudden transition early in the training process, after about 1% of the training tokens. This was observed across model sizes (from 14M and up to 12B parameters), and the transition occurred slightly later for the two smallest models. We further found that the development of verbatim in-context retrieval is positively correlated with the learning of zero-shot benchmarks. Around the transition point, all models showed the advantage of retrieving concrete nouns as opposed to abstract nouns. In all but two smallest models, the advantage dissipated away toward the end of training.</abstract>
+      <url hash="c5018a5a">2024.conll-1.6</url>
+      <bibkey>armeni-etal-2024-transformer</bibkey>
+    </paper>
+    <paper id="7">
+      <title><fixed-case>E</fixed-case>dit<fixed-case>E</fixed-case>val: An Instruction-Based Benchmark for Text Improvements</title>
+      <author><first>Jane</first><last>Dwivedi-Yu</last><affiliation>Meta AI</affiliation></author>
+      <author><first>Timo</first><last>Schick</last><affiliation>Facebook</affiliation></author>
+      <author><first>Zhengbao</first><last>Jiang</last><affiliation>School of Computer Science, Carnegie Mellon University</affiliation></author>
+      <author><first>Maria</first><last>Lomeli</last><affiliation>Meta</affiliation></author>
+      <author><first>Patrick</first><last>Lewis</last></author>
+      <author><first>Gautier</first><last>Izacard</last></author>
+      <author><first>Edouard</first><last>Grave</last><affiliation>Facebook</affiliation></author>
+      <author><first>Sebastian</first><last>Riedel</last><affiliation>Google and University College London</affiliation></author>
+      <author><first>Fabio</first><last>Petroni</last></author>
+      <pages>69-83</pages>
+      <abstract>Evaluation of text generation to date has primarily focused on content created sequentially, rather than improvements on a piece of text. Writing, however, is naturally an iterative and incremental process that requires expertise in different modular skills such as fixing outdated information or making the writing style more consistent. Even so, comprehensive evaluation of a model’s capacity to perform these skills and the ability to edit remains sparse. This work introduces EditEval: An instruction-based, benchmark and evaluation suite that leverages high-quality existing and new datasets in English for the automatic evaluation of editing capabilities, such as making text more cohesive and paraphrasing. We evaluate several pre-trained models, which shows that InstructGPT and PEER on average perform the best, but that most baselines fall below the supervised state-of-the-art, particularly when neutralizing and updating information. Our analysis also shows that commonly used metrics for editing tasks do not always correlate well, and that prompts leading to the strongest performance do not necessarily elicit strong performance across different models. Through the release of this benchmark (code and data available at https://github.com/facebookresearch/EditEval) and a publicly available leaderboard challenge, we hope to unlock future work on developing models more capable of controllable and iterative editing.</abstract>
+      <url hash="ab60aa3a">2024.conll-1.7</url>
+      <bibkey>dwivedi-yu-etal-2024-editeval</bibkey>
+    </paper>
+    <paper id="8">
+      <title>An Empirical Comparison of Vocabulary Expansion and Initialization Approaches For Language Models</title>
+      <author><first>Nandini</first><last>Mundra</last><affiliation>Department of Computer Science, Indian Institute of Technology, Madras, Indian Institute of Technology, Madras and AI4Bharat</affiliation></author>
+      <author><first>Aditya</first><last>Khandavally</last></author>
+      <author><first>Raj</first><last>Dabre</last><affiliation>National Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology</affiliation></author>
+      <author><first>Ratish</first><last>Puduppully</last><affiliation>IT University of Copenhagen</affiliation></author>
+      <author><first>Anoop</first><last>Kunchukuttan</last><affiliation>Microsoft and Indian Institute of Technology, Madras</affiliation></author>
+      <author><first>Mitesh</first><last>Khapra</last><affiliation>Indian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
+      <pages>84-104</pages>
+      <abstract>Language Models (LMs) excel in natural language processing tasks for English but show reduced performance in most other languages. This problem is commonly tackled by continually pre-training and fine-tuning these models for said languages. A significant issue in this process is the limited vocabulary coverage in the original model’s tokenizer, leading to inadequate representation of new languages and necessitating an expansion of the tokenizer. The initialization of the embeddings corresponding to new vocabulary items presents a further challenge. Current strategies require cross-lingual embeddings and lack a solid theoretical foundation as well as comparisons with strong baselines. In this paper, we first establish theoretically that initializing within the convex hull of existing embeddings is a good initialization, followed by a novel but simple approach, <i>Constrained Word2Vec (CW2V)</i>, which does not require cross-lingual embeddings. Our study evaluates different initialization methods for expanding RoBERTa and LLaMA 2 across four languages and five tasks. The results show that CW2V performs equally well or even better than more advanced techniques. Additionally, simpler approaches like multivariate initialization perform on par with these advanced methods indicating that efficient large-scale multilingual continued pretraining can be achieved even with simpler initialization methods.</abstract>
+      <url hash="1f522855">2024.conll-1.8</url>
+      <bibkey>mundra-etal-2024-empirical</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Critical Questions Generation: Motivation and Challenges</title>
+      <author><first>Blanca</first><last>Calvo Figueras</last><affiliation>Universidad del País Vasco</affiliation></author>
+      <author><first>Rodrigo</first><last>Agerri</last><affiliation>University of the Basque Country</affiliation></author>
+      <pages>105-116</pages>
+      <abstract>The development of Large Language Models (LLMs) has brought impressive performances on mitigation strategies against misinformation, such as counterargument generation. However, LLMs are still seriously hindered by outdated knowledge and by their tendency to generate hallucinated content. In order to circumvent these issues, we propose a new task, namely, Critical Questions Generation, consisting of processing an argumentative text to generate the critical questions (CQs) raised by it.In argumentation theory CQs are tools designed to lay bare the blind spots of an argument by pointing at the information it could be missing.Thus, instead of trying to deploy LLMs to produce knowledgeable and relevant counterarguments, we use them to question arguments, without requiring any external knowledge.Research on CQs Generation using LLMs requires a reference dataset for large scale experimentation. Thus, in this work we investigate two complementary methods to create such a resource: (i) instantiating CQs templates as defined by Walton’s argumentation theory and (ii), using LLMs as CQs generators. By doing so, we contribute with a procedure to establish what is a valid CQ and conclude that, while LLMs are reasonable CQ generators, they still have a wide margin for improvement in this task.</abstract>
+      <url hash="760a4b80">2024.conll-1.9</url>
+      <bibkey>calvo-figueras-agerri-2024-critical</bibkey>
+    </paper>
+    <paper id="10">
+      <title>Information Association for Language Model Updating by Mitigating <fixed-case>LM</fixed-case>-Logical Discrepancy</title>
+      <author><first>Pengfei</first><last>Yu</last><affiliation>Boson AI and University of Illinois at Urbana-Champaign</affiliation></author>
+      <author><first>Heng</first><last>Ji</last><affiliation>University of Illinois, Urbana-Champaign</affiliation></author>
+      <pages>117-129</pages>
+      <abstract>Large Language Models (LLMs) struggle with providing current information due to the outdated pre-training data. Existing methods for updating LLMs, such as knowledge editing and continual fine-tuning, have significant drawbacks in generalizability of new information and the requirements on structured updating corpus. We identify the core challenge behind these drawbacks: the LM-logical discrepancy featuring the difference between language modeling probabilities and logical probabilities. To evaluate and address the core challenge, we propose a new task formulation of the information updating task that only requires the provision of an unstructured updating corpus and evaluates the performance of information updating on the generalizability to question-answer pairs pertaining to the updating information.We further propose a novel and effective pipeline approach for the task, highlighting a self-prompting-based question-answer generation process and a associative distillation methods to bridge the LM-logical discrepancy.We develop two datasets for evaluation, one sourced from news articles published in March and April 2023, and the other from the Natural Questions benchmark.Experimental results demonstrate the superiority of our approach, significantly increasing the factual consistency score (on a scale from 0 to 1) by up to 0.16. Furthermore, our method effectively mitigates forgetting utilizing a compact replay buffer with only 2.3% of the training tokens.</abstract>
+      <url hash="2a536c37">2024.conll-1.10</url>
+      <bibkey>yu-ji-2024-information</bibkey>
+    </paper>
+    <paper id="11">
+      <title>Causal <fixed-case>ATE</fixed-case> Mitigates Unintended Bias in Controlled Text Generation</title>
+      <author><first>Rahul</first><last>Madhavan</last><affiliation>Indian Institute of Management, Ahmedabad, Indian Institute of Technology, Madras, Dhirubhai Ambani Institute Of Information and Communication Technology and Indian Institute of Science, Bangalore</affiliation></author>
+      <author><first>Kahini</first><last>Wadhawan</last></author>
+      <pages>130-142</pages>
+      <abstract>We study attribute control in language models through the method of Causal Average Treatment Effect (Causal ATE). Existing methodsfor the attribute control task in Language Models(LMs) check for the co-occurrence of words in a sentence with the attribute of interest, and control for them. However, spurious correlation of the words with the attribute in the training dataset, can cause models to hallucinate the presence of the attribute when presented with the spurious correlate during inference. We show that the simple perturbation-based method of Causal ATE removes this unintended effect. Specifically, we ground it in the problem of toxicity mitigation, where a significant challenge lies in the inadvertent bias that often emerges towards protected groups post detoxification. We show that this unintended bias can be solved by the use of the Causal ATE metric. We provide experimental validations for our claims and release our code (anonymously) here: [github.com/causalate-mitigates-bias](https://github.com/causalate-mitigates-bias/causal-ate-mitigates-bias).</abstract>
+      <url hash="467da957">2024.conll-1.11</url>
+      <bibkey>madhavan-wadhawan-2024-causal</bibkey>
+    </paper>
+    <paper id="12">
+      <title>On Functional Competence of <fixed-case>LLM</fixed-case>s for Linguistic Disambiguation</title>
+      <author><first>Raihan</first><last>Kibria</last></author>
+      <author><first>Sheikh</first><last>Dipta</last></author>
+      <author><first>Muhammad</first><last>Adnan</last><affiliation>Bangladesh University of Engineering and Technology</affiliation></author>
+      <pages>143-160</pages>
+      <abstract>We study some Large Language Models to explore their deficiencies in resolving sense ambiguities. In this connection, we evaluate their performance on well-known word sense disambiguation datasets. Word Sense Disambiguation (WSD) has been a long-standing NLP problem, which has given rise to many evaluation datasets and models over the decades. Recently the emergence of Large Language Models (LLM) raises much hope in improving accuracy. In this work, we evaluate word sense disambiguation capabilities of four LLMs: OpenAI’s ChatGPT-3.5, Mistral’s 7b parameter model, Meta’s Llama 70b, and Google’s Gemini Pro. We evaluate many well-established datasets containing a variety of texts and senses on these. After observing the performances of some datasets, we selectively study some failure cases and identify the reasons for failures. We explore human judgments that would correct these failures. Our findings suggest that many failure cases are related to a lack of world knowledge and the reasoning to amalgamate this knowledge rather than the lack of linguistic knowledge. We categorize the judgments so that the next generation of LLMs can improve by incorporating deeper world knowledge and reasoning. We conclude that word sense disambiguation could serve as a guide for probing the reasoning power of LLMs to measure their functional competency. We also list the accuracy of these datasets. We find that on many occasions, accuracy drops to below 70%, which is much less than that of well-performing existing models.</abstract>
+      <url hash="b5485806">2024.conll-1.12</url>
+      <bibkey>kibria-etal-2024-functional</bibkey>
+    </paper>
+    <paper id="13">
+      <title><fixed-case>AIS</fixed-case>tory<fixed-case>S</fixed-case>imilarity: Quantifying Story Similarity Using Narrative for Search, <fixed-case>IP</fixed-case> Infringement, and Guided Creativity</title>
+      <author><first>Jon</first><last>Chun</last><affiliation>Kenyon College</affiliation></author>
+      <pages>161-177</pages>
+      <abstract>Stories are central for interpreting experiences, communicating, and influencing each other via films, medical, media, and other narratives. Quantifying the similarity between stories has numerous applications including detecting IP infringement, detecting hallucinations, search/recommendation engines, and guiding human-AI collaborations. Despite this, traditional NLP text similarity metrics are limited to short text distance metrics like n-gram overlaps and embeddings. Larger texts require preprocessing with significant information loss through paraphrasing or multi-step decomposition. This paper introduces AIStorySimilarity, a novel benchmark to measure the semantic distance between long-text stories based on core structural elements drawn from narrative theory and script writing. Based on four narrative elements (characters, plot, setting, and themes) as well 31 sub-features within these, we use a SOTA LLM (gpt-3.5-turbo) to extract and evaluate the semantic similarity of a diverse set of major Hollywood movies. In addition, we compare human evaluation with story similarity scores computed three ways: extracting elements from film scripts before evaluation (Elements), directly evaluating entire scripts (Scripts), and extracting narrative elements from the parametric memory of SOTA LLMs without any provided scripts (GenAI). To the best of our knowledge, AIStorySimilarity is the first benchmark to measure long-text story similarity using a comprehensive approach to narrative theory. Code and data are available at https://github.com/jon-chun/AIStorySimiliarity.</abstract>
+      <url hash="9edef01b">2024.conll-1.13</url>
+      <bibkey>chun-2024-aistorysimilarity</bibkey>
+    </paper>
+    <paper id="14">
+      <title><fixed-case>SPAWN</fixed-case>ing Structural Priming Predictions from a Cognitively Motivated Parser</title>
+      <author><first>Grusha</first><last>Prasad</last><affiliation>Colgate University</affiliation></author>
+      <author><first>Tal</first><last>Linzen</last><affiliation>New York University and Google</affiliation></author>
+      <pages>178-197</pages>
+      <abstract>Structural priming is a widely used psycholinguistic paradigm to study human sentence representations. In this work we introduce SPAWN, a cognitively motivated parser that can generate quantitative priming predictions from contemporary theories in syntax which assume a lexicalized grammar. By generating and testing priming predictions from competing theoretical accounts, we can infer which assumptions from syntactic theory are useful for characterizing the representations humans build when processing sentences. As a case study, we use SPAWN to generate priming predictions from two theories (Whiz-Deletion and Participial-Phase) which make different assumptions about the structure of English relative clauses. By modulating the reanalysis mechanism that the parser uses and strength of the parser’s prior knowledge, we generated nine sets of predictions from each of the two theories. Then, we tested these predictions using a novel web-based comprehension-to-production priming paradigm. We found that while the some of the predictions from the Participial-Phase theory aligned with human behavior, none of the predictions from the the Whiz-Deletion theory did, thus suggesting that the Participial-Phase theory might better characterize human relative clause representations.</abstract>
+      <url hash="adbac3ca">2024.conll-1.14</url>
+      <bibkey>prasad-linzen-2024-spawning</bibkey>
+    </paper>
+    <paper id="15">
+      <title>Global Learning with Triplet Relations in Abstractive Summarization</title>
+      <author><first>Fengyu</first><last>Lu</last></author>
+      <author><first>Jiaxin</first><last>Duan</last></author>
+      <author><first>Junfei</first><last>Liu</last></author>
+      <pages>198-208</pages>
+      <abstract>Abstractive summarization models learned with token-level maximum likelihood estimation suffer from exposure bias, that the condition for predicting the next token is discrepant during training and inference. Existing solutions bridge this gap by learning to estimate semantic or lexical qualities of a candidate summary from the global view, namely global learning (GL), yet ignore maintaining rational triplet-relations among document, reference summary, and candidate summaries, e.g., the candidate and reference summaries should have a similar faithfulness degree judging by a source document. In this paper, we propose an iterative autoregressive summarization paradigm - IARSum, which fuses the learning of triplet relations into a GL framework and further enhances summarization performance. Specifically, IARSum develops a dual-encoder network to enable the simultaneous input of a document and its candidate (or reference) summary. On this basis, it learns to 1) model the relative semantics defined over tuples (candidate, document) and (reference, document) respectively and balance them; 2) reduce lexical differences between candidate and reference summaries. Furthermore, IARSum iteratively reprocesses a generated candidate at inference time to ground higher quality. We conduct extensive experiments on two widely used datasets to test our method, and IARSum shows the new or matched state-of-the-art on diverse metrics.</abstract>
+      <url hash="f8773231">2024.conll-1.15</url>
+      <bibkey>lu-etal-2024-global</bibkey>
+    </paper>
+    <paper id="16">
+      <title><fixed-case>T</fixed-case>p<fixed-case>T</fixed-case>-<fixed-case>ADE</fixed-case>: Transformer Based Two-Phase <fixed-case>ADE</fixed-case> Extraction</title>
+      <author><first>Suryamukhi</first><last>Kuchibhotla</last><affiliation>Indian Institute of Technology, Hyderabad, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
+      <author><first>Manish</first><last>Singh</last></author>
+      <pages>209-218</pages>
+      <abstract>Extracting adverse reactions to medications or treatments is a crucial activity in the biomedical domain. The task involves identifying mentions of drugs and their adverse effects/events in raw text, which is challenging due to the unstructured nature of clinical narratives. In this paper, we propose TpT-ADE, a novel joint two-phase transformer model combined with natural language processing (NLP) techniques, to identify adverse events (AEs) caused by drugs. In the first phase of TpT-ADE, entities are extracted and are grounded with their standard terms using the Unified Medical Language System (UMLS) knowledge base. In the second phase, entity and relation classification is performed to determine the presence of a relationship between the drug and AE pairs. TpT-ADE also identifies the intensity of AE entities by constructing a parts-of-speech (POS) embedding model. Unlike previous approaches that use complex classifiers, TpT-ADE employs a shallow neural network and yet outperforms the state-of-the-art methods on the standard ADE corpus.</abstract>
+      <url hash="d677d414">2024.conll-1.16</url>
+      <bibkey>kuchibhotla-singh-2024-tpt</bibkey>
+    </paper>
+    <paper id="17">
+      <title>The Effect of Surprisal on Reading Times in Information Seeking and Repeated Reading</title>
+      <author><first>Keren</first><last>Klein</last><affiliation>Technion - Israel Institute of Technology, Technion</affiliation></author>
+      <author><first>Yoav</first><last>Meiri</last></author>
+      <author><first>Omer</first><last>Shubi</last></author>
+      <author><first>Yevgeni</first><last>Berzak</last><affiliation>Technion - Israel Institute of Technology, Technion</affiliation></author>
+      <pages>219-230</pages>
+      <abstract>The effect of surprisal on processing difficulty has been a central topic of investigation in psycholinguistics. Here, we use eyetracking data to examine three language processing regimes that are common in daily life but have not been addressed with respect to this question: information seeking, repeated processing, and the combination of the two. Using standard regime-agnostic surprisal estimates we find that the prediction of surprisal theory regarding the presence of a linear effect of surprisal on processing times, extends to these regimes. However, when using surprisal estimates from regime-specific contexts that match the contexts and tasks given to humans, we find that in information seeking, such estimates do not improve the predictive power of processing times compared to standard surprisals. Further, regime-specific contexts yield near zero surprisal estimates with no predictive power for processing times in repeated reading. These findings point to misalignments of task and memory representations between humans and current language models, and question the extent to which such models can be used for estimating cognitively relevant quantities. We further discuss theoretical challenges posed by these results.</abstract>
+      <url hash="edbdb721">2024.conll-1.17</url>
+      <bibkey>klein-etal-2024-effect</bibkey>
+    </paper>
+    <paper id="18">
+      <title>Revisiting Hierarchical Text Classification: Inference and Metrics</title>
+      <author><first>Roman</first><last>Plaud</last></author>
+      <author><first>Matthieu</first><last>Labeau</last><affiliation>Télécom ParisTech</affiliation></author>
+      <author><first>Antoine</first><last>Saillenfest</last><affiliation>onepoint</affiliation></author>
+      <author><first>Thomas</first><last>Bonald</last><affiliation>Télécom ParisTech</affiliation></author>
+      <pages>231-242</pages>
+      <abstract>Hierarchical text classification (HTC) is the task of assigning labels to a text within a structured space organized as a hierarchy. Recent works treat HTC as a conventional multilabel classification problem, therefore evaluating it as such. We instead propose to evaluate models based on specifically designed hierarchical metrics and we demonstrate the intricacy of metric choice and prediction inference method. We introduce a new challenging dataset and we evaluate fairly, recent sophisticated models, comparing them with a range of simple but strong baselines, including a new theoretically motivated loss. Finally, we show that those baselines are very often competitive with the latest models. This highlights the importance of carefully considering the evaluation methodology when proposing new methods for HTC</abstract>
+      <url hash="4daae66e">2024.conll-1.18</url>
+      <bibkey>plaud-etal-2024-revisiting</bibkey>
+    </paper>
+    <paper id="19">
+      <title><fixed-case>N</fixed-case>e<fixed-case>LLC</fixed-case>om-<fixed-case>X</fixed-case>: A Comprehensive Neural-Agent Framework to Simulate Language Learning and Group Communication</title>
+      <author><first>Yuchen</first><last>Lian</last><affiliation>Leiden University and Xi’an Jiaotong University</affiliation></author>
+      <author><first>Tessa</first><last>Verhoef</last><affiliation>Leiden University, Leiden University</affiliation></author>
+      <author><first>Arianna</first><last>Bisazza</last><affiliation>University of Groningen</affiliation></author>
+      <pages>243-258</pages>
+      <abstract>Recent advances in computational linguistics include simulating the emergence of human-like languages with interacting neural network agents, starting from sets of random symbols. The recently introduced NeLLCom framework (Lian et al., 2023) allows agents to first learn an artificial language and then use it to communicate, with the aim of studying the emergence of specific linguistics properties. We extend this framework (NeLLCom-X) by introducing more realistic role-alternating agents and group communication in order to investigate the interplay between language learnability, communication pressures, and group size effects. We validate NeLLCom-X by replicating key findings from prior research simulating the emergence of a word-order/case-marking trade-off. Next, we investigate how interaction affects linguistic convergence and emergence of the trade-off. The novel framework facilitates future simulations of diverse linguistic aspects, emphasizing the importance of interaction and group dynamics in language evolution.</abstract>
+      <url hash="10c1cdcf">2024.conll-1.19</url>
+      <bibkey>lian-etal-2024-nellcom</bibkey>
+    </paper>
+    <paper id="20">
+      <title>A Novel Instruction Tuning Method for <fixed-case>V</fixed-case>ietnamese Mathematical Reasoning using Trainable Open-Source Large Language Models</title>
+      <author><first>Nguyen</first><last>Vinh</last></author>
+      <author><first>Thanh-Do</first><last>Nguyen</last></author>
+      <author><first>Vinh</first><last>Nguyen</last><affiliation>Vietnam National University Hanoi</affiliation></author>
+      <author><first>Nam</first><last>Bui</last><affiliation>Viettel Group</affiliation></author>
+      <pages>259-268</pages>
+      <abstract>This study introduces Simple Reasoning with Code (SiRC), a novel instruction fine-tuning method for solving mathematical reasoning problems, particularly effective for Vietnamese, which is considered a low-resource language. Specifically, solving mathematical problems requires strategic and logical reasoning, which remains challenging in this research area. This paper presents a simple yet effective instruction fine-tuning method for mathematical reasoning. Unlike previous approaches, our proposed method effectively combines chain-of-thought reasoning with code transfer methods without requiring a sophisticated inference procedure. Furthermore, we focus on exploiting small open-source large language models (LLMs) for the Vietnamese language. In this regard, we first introduce a trainable Vietnamese math reasoning dataset, which is named ViMath-InstructCode. The proposed dataset is then used for fine-tuning open-source LLMs (e.g., less than 10 billion parameters). Experiments conducted on our custom ViMath-Bench dataset, the largest benchmarking dataset focusing on Vietnamese mathematical problems, indicate the promising results of our proposed method. Our source code and dataset are available for further exploitation.</abstract>
+      <url hash="52e9c757">2024.conll-1.20</url>
+      <bibkey>vinh-etal-2024-novel</bibkey>
+    </paper>
+    <paper id="21">
+      <title>Generalizations across filler-gap dependencies in neural language models</title>
+      <author><first>Katherine</first><last>Howitt</last><affiliation>University of Maryland, College Park</affiliation></author>
+      <author><first>Sathvik</first><last>Nair</last></author>
+      <author><first>Allison</first><last>Dods</last><affiliation>University of Maryland, College Park</affiliation></author>
+      <author><first>Robert</first><last>Hopkins</last></author>
+      <pages>269-279</pages>
+      <abstract>Humans develop their grammars by making structural generalizations from finite input. We ask how filler-gap dependencies (FGDs), which share a structural generalization despite diverse surface forms, might arise from the input. We explicitly control the input to a neural language model (NLM) to uncover whether the model posits a shared representation for FGDs. We show that while NLMs do have success differentiating grammatical from ungrammatical FGDs, they rely on superficial properties of the input, rather than on a shared generalization. Our work highlights the need for specific linguistic inductive biases to model language acquisition.</abstract>
+      <url hash="23f7e587">2024.conll-1.21</url>
+      <bibkey>howitt-etal-2024-generalizations</bibkey>
+    </paper>
+    <paper id="22">
+      <title>Of Models and Men: Probing Neural Networks for Agreement Attraction with Psycholinguistic Data</title>
+      <author><first>Maxim</first><last>Bazhukov</last><affiliation>Higher School of Economics</affiliation></author>
+      <author><first>Ekaterina</first><last>Voloshina</last><affiliation>Göteborg University and Chalmers University of Technology</affiliation></author>
+      <author><first>Sergey</first><last>Pletenev</last></author>
+      <author><first>Arseny</first><last>Anisimov</last></author>
+      <author><first>Oleg</first><last>Serikov</last><affiliation>King Abdullah University of Science and Technology</affiliation></author>
+      <author><first>Svetlana</first><last>Toldova</last><affiliation>Higher School of Economics</affiliation></author>
+      <pages>280-290</pages>
+      <abstract>Interpretability studies have played an important role in the field of NLP. They focus on the problems of how models encode information or, for instance, whether linguistic capabilities allow them to prefer grammatical sentences to ungrammatical. Recently, several studies examined whether the models demonstrate patterns similar to humans and whether they are sensitive to the phenomena of interference like humans’ grammaticality judgements, including the phenomenon of agreement attraction.In this paper, we probe BERT and GPT models on the syntactic phenomenon of agreement attraction in Russian using the psycholinguistic data with syncretism. Working on the language with syncretism between some plural and singular forms allows us to differentiate between the effects of the surface form and of the underlying grammatical feature. Thus we can further investigate models’ sensitivity to this phenomenon and examine if the patterns of their behaviour are similar to human patterns. Moreover, we suggest a new way of comparing models’ and humans’ responses via statistical testing. We show that there are some similarities between models’ and humans’ results, while GPT is somewhat more aligned with human responses than BERT. Finally, preliminary results suggest that surface form syncretism influences attraction, perhaps more so than grammatical form syncretism.</abstract>
+      <url hash="a376a481">2024.conll-1.22</url>
+      <bibkey>bazhukov-etal-2024-models</bibkey>
+    </paper>
+    <paper id="23">
+      <title>Is Structure Dependence Shaped for Efficient Communication?: A Case Study on Coordination</title>
+      <author><first>Kohei</first><last>Kajikawa</last><affiliation>The University of Tokyo</affiliation></author>
+      <author><first>Yusuke</first><last>Kubota</last></author>
+      <author><first>Yohei</first><last>Oseki</last><affiliation>University of Tokyo</affiliation></author>
+      <pages>291-302</pages>
+      <abstract>Natural language exhibits various universal properties.But why do these universals exist?One explanation is that they arise from functional pressures to achieve <i>efficient communication</i>, a view which attributes cross-linguistic properties to domain-general cognitive abilities.This hypothesis has successfully addressed some syntactic universal properties such as compositionality and Greenbergian word order universals.However, more abstract syntactic universals have not been explored from the perspective of efficient communication.Among such universals, the most notable one is <i>structure dependence</i>, that is, grammar-internal operations crucially depend on hierarchical representations.This property has traditionally been taken to be central to natural language and to involve domain-specific knowledge irreducible to communicative efficiency. In this paper, we challenge the conventional view by investigating whether structure dependence realizes efficient communication, focusing on coordinate structures.We design three types of artificial languages: (i) one with a structure-dependent reduction operation, which is similar to natural language, (ii) one without any reduction operations, and (iii) one with a linear (rather than structure-dependent) reduction operation.We quantify the communicative efficiency of these languages.The results demonstrate that the language with the structure-dependent reduction operation is significantly more communicatively efficient than the counterfactual languages.This suggests that the existence of structure-dependent properties can be explained from the perspective of efficient communication.</abstract>
+      <url hash="6533765d">2024.conll-1.23</url>
+      <bibkey>kajikawa-etal-2024-structure</bibkey>
+    </paper>
+    <paper id="24">
+      <title>Large Language Model Recall Uncertainty is Modulated by the Fan Effect</title>
+      <author><first>Jesse</first><last>Roberts</last><affiliation>Tennessee Technological University</affiliation></author>
+      <author><first>Kyle</first><last>Moore</last><affiliation>Vanderbilt University</affiliation></author>
+      <author><first>Douglas</first><last>Fisher</last><affiliation>Vanderbilt University and Vanderbilt University</affiliation></author>
+      <author><first>Oseremhen</first><last>Ewaleifoh</last></author>
+      <author><first>Thao</first><last>Pham</last></author>
+      <pages>303-313</pages>
+      <abstract>This paper evaluates whether large language models (LLMs) exhibit cognitive fan effects, similar to those discovered by Anderson in humans, after being pre-trained on human textual data. We conduct two sets of in-context recall experiments designed to elicit fan effects. Consistent with human results, we find that LLM recall uncertainty, measured via token probability, is influenced by the fan effect. Our results show that removing uncertainty disrupts the observed effect. The experiments suggest the fan effect is consistent whether the fan value is induced in-context or in the pre-training data. Finally, these findings provide in-silico evidence that fan effects and typicality are expressions of the same phenomena.</abstract>
+      <url hash="8847db78">2024.conll-1.24</url>
+      <bibkey>roberts-etal-2024-large</bibkey>
+    </paper>
+    <paper id="25">
+      <title>Continuous Attentive Multimodal Prompt Tuning for Few-Shot Multimodal Sarcasm Detection</title>
+      <author><first>Soumyadeep</first><last>Jana</last></author>
+      <author><first>Animesh</first><last>Dey</last></author>
+      <author><first>Ranbir</first><last>Sanasam</last><affiliation>Indian Institute of Technology, Guwahati, Dhirubhai Ambani Institute Of Information and Communication Technology</affiliation></author>
+      <pages>314-326</pages>
+      <abstract>With the steep rise in multimodal content on social media, multimodal sarcasm detection has gained widespread attention from research communities. Existing studies depend on large-scale data, which is challenging to obtain and expensive to annotate. Thus, investigating this problem in a few-shot scenario is required. Overtly complex multimodal models are prone to overfitting on in-domain data, which hampers their performance on out-of-distribution (OOD) data. To address these issues, we propose Continuous Attentive Multimodal Prompt Tuning model (CAMP), that leverages the prompt tuning paradigm to handle few-shot multimodal sarcasm detection. To overcome the siloed learning process of continuous prompt tokens, we design a novel, continuous multimodal attentive prompt where the continuous tokens intricately engage with both image and text tokens, enabling the assimilation of knowledge from different input modalities. Experimental results indicate that our method outperforms other multimodal baseline methods in the few-shot setting and OOD scenarios.</abstract>
+      <url hash="d7c43722">2024.conll-1.25</url>
+      <bibkey>jana-etal-2024-continuous</bibkey>
+    </paper>
+    <paper id="26">
+      <title>Aligning Alignments: Do Colexification and Distributional Similarity Align as Measures of cross-lingual Lexical Alignment?</title>
+      <author><first>Taelin</first><last>Karidi</last></author>
+      <author><first>Eitan</first><last>Grossman</last></author>
+      <author><first>Omri</first><last>Abend</last><affiliation>Hebrew University of Jerusalem</affiliation></author>
+      <pages>327-341</pages>
+      <abstract>The data-driven investigation of the extent to which lexicons of different languages align has mostly fallen into one of two categories:colexification-based and distributional. The two approaches are grounded in distinct methodologies, operate on different assumptions, and are used in diverse ways.This raises two important questions: (a) are there settings in which the predictions of the two approaches can be directly compared? and if so, (b) what is the extent of the similarity and what are its determinants? We offer novel operationalizations for the two approaches in a manner that allows for their direct comparison, and conduct a comprehensive analysis on a diverse set of 16 languages.Our analysis is carried out at different levels of granularity. At the word-level, the two methods present different results across the board. However, intriguingly, at the level of semantic domains (e.g., kinship, quantity), the two methods show considerable convergence in their predictions.A detailed comparison of the metrics against a carefully validated dataset of kinship terms shows that the distributional methods likely capture a more fine-grained alignment than their counterpart colexification-based methods, and may thus be more suited for settings where fewer languages are evaluated.</abstract>
+      <url hash="63fdae41">2024.conll-1.26</url>
+      <bibkey>karidi-etal-2024-aligning</bibkey>
+    </paper>
+    <paper id="27">
+      <title><fixed-case>T</fixed-case>ext2<fixed-case>A</fixed-case>fford: Probing Object Affordance Prediction abilities of Language Models solely from Text</title>
+      <author><first>Sayantan</first><last>Adak</last></author>
+      <author><first>Daivik</first><last>Agrawal</last></author>
+      <author><first>Animesh</first><last>Mukherjee</last><affiliation>Indian Institute of Technology Kharagpur</affiliation></author>
+      <author><first>Somak</first><last>Aditya</last><affiliation>Indian Institute of Technology Kharagpur</affiliation></author>
+      <pages>342-364</pages>
+      <abstract>We investigate the knowledge of object affordances in pre-trained language models (LMs) and pre-trained Vision-Language models (VLMs).A growing body of literature shows that PTLMs fail inconsistently and non-intuitively, demonstrating a lack of reasoning and grounding. To take a first step toward quantifying the effect of grounding (or lack thereof), we curate a novel and comprehensive dataset of object affordances – Text2Afford, characterized by 15 affordance classes. Unlike affordance datasets collected in vision and language domains, we annotate in-the-wild sentences with objects and affordances. Experimental results reveal that PTLMs exhibit limited reasoning abilities when it comes to uncommon object affordances. We also observe that pre-trained VLMs do not necessarily capture object affordances effectively. Through few-shot fine-tuning, we demonstrate improvement in affordance knowledge in PTLMs and VLMs. Our research contributes a novel dataset for language grounding tasks, and presents insights into LM capabilities, advancing the understanding of object affordances.</abstract>
+      <url hash="9bbf748d">2024.conll-1.27</url>
+      <bibkey>adak-etal-2024-text2afford</bibkey>
+    </paper>
+    <paper id="28">
+      <title>How Are Metaphors Processed by Language Models? The Case of Analogies</title>
+      <author><first>Joanne</first><last>Boisson</last></author>
+      <author><first>Asahi</first><last>Ushio</last></author>
+      <author><first>Hsuvas</first><last>Borkakoty</last><affiliation>Cardiff University</affiliation></author>
+      <author><first>Kiamehr</first><last>Rezaee</last></author>
+      <author><first>Dimosthenis</first><last>Antypas</last></author>
+      <author><first>Zara</first><last>Siddique</last></author>
+      <author><first>Nina</first><last>White</last></author>
+      <author><first>Jose</first><last>Camacho-Collados</last><affiliation>Cardiff University</affiliation></author>
+      <pages>365-387</pages>
+      <abstract>The ability to compare by analogy, metaphorically or not, lies at the core of how humans understand the world and communicate. In this paper, we study the likelihood of metaphoric outputs, and the capability of a wide range of pretrained transformer-based language models to identify metaphors from other types of analogies, including anomalous ones. In particular, we are interested in discovering whether language models recognise metaphorical analogies equally well as other types of analogies, and whether the model size has an impact on this ability. The results show that there are relevant differences using perplexity as a proxy, with the larger models reducing the gap when it comes to analogical processing, and for distinguishing metaphors from incorrect analogies. This behaviour does not result in increased difficulties for larger generative models in identifying metaphors in comparison to other types of analogies from anomalous sentences in a zero-shot generation setting, when perplexity values of metaphoric and non-metaphoric analogies are similar.</abstract>
+      <url hash="035f5e6b">2024.conll-1.28</url>
+      <bibkey>boisson-etal-2024-metaphors</bibkey>
+    </paper>
+    <paper id="29">
+      <title>Further Compressing Distilled Language Models via Frequency-aware Partial Sparse Coding of Embeddings</title>
+      <author><first>Kohki</first><last>Tamura</last><affiliation>The University of Tokyo</affiliation></author>
+      <author><first>Naoki</first><last>Yoshinaga</last><affiliation>Institute of Industrial Science, the University of Tokyo</affiliation></author>
+      <author><first>Masato</first><last>Neishi</last></author>
+      <pages>388-399</pages>
+      <abstract>Although pre-trained language models (PLMs) are effective for natural language understanding (NLU) tasks, they demand a huge computational resource, thus preventing us from deploying them on edge devices. Researchers have therefore applied compression techniques for neural networks, such as pruning, quantization, and knowledge distillation, to the PLMs. Although these generic techniques can reduce the number of internal parameters of hidden layers in the PLMs, the embedding layers tied to the tokenizer arehard to compress, occupying a non-negligible portion of the compressed model. In this study, aiming to further compress PLMs reduced by the generic techniques, we exploit frequency-aware sparse coding to compress the embedding layers of the PLMs fine-tuned to downstream tasks. To minimize the impact of the compression on the accuracy, we retain the embeddings of common tokens as they are and use them to reconstruct embeddings of rare tokens by locally linear mapping. Experimental results on the GLUE and JGLUE benchmarks for language understanding in English and Japanese confirmed that our method can further compress the fine-tuned DistilBERT models models while maintaining accuracy.</abstract>
+      <url hash="9d1e10e8">2024.conll-1.29</url>
+      <bibkey>tamura-etal-2024-compressing</bibkey>
+    </paper>
+    <paper id="30">
+      <title>Translating Across Cultures: <fixed-case>LLM</fixed-case>s for Intralingual Cultural Adaptation</title>
+      <author><first>Pushpdeep</first><last>Singh</last><affiliation>Tata Consultancy Services Limited, India</affiliation></author>
+      <author><first>Mayur</first><last>Patidar</last><affiliation>Tata Consultancy Services Limited, India</affiliation></author>
+      <author><first>Lovekesh</first><last>Vig</last></author>
+      <pages>400-418</pages>
+      <abstract>LLMs are increasingly being deployed for multilingual applications and have demonstrated impressive translation capabilities between several low and high-resource languages. An aspect of translation that often gets overlooked is that of cultural adaptation, or modifying source culture references to suit the target culture. While specialized translation models still outperform LLMs on the machine translation task when viewed from the lens of correctness, they are not sensitive to cultural differences often requiring manual correction. LLMs on the other hand have a rich reservoir of cultural knowledge embedded within its parameters that can be potentially exploited for such applications. In this paper, we define the task of cultural adaptation and create an evaluation framework to evaluate the performance of modern LLMs for cultural adaptation and analyze their cross-cultural knowledge while connecting related concepts across different cultures. We also analyze possible issues with automatic adaptation. We hope that this task will offer more insight into the cultural understanding of LLMs and their creativity in cross-cultural scenarios.</abstract>
+      <url hash="1a572826">2024.conll-1.30</url>
+      <bibkey>singh-etal-2024-translating</bibkey>
+    </paper>
+    <paper id="31">
+      <title>Explaining the Hardest Errors of Contextual Embedding Based Classifiers</title>
+      <author><first>Claudio</first><last>Andrade</last><affiliation>Universidade Federal de Minas Gerais, Universidade Federal de Minas Gerais</affiliation></author>
+      <author><first>Washington</first><last>Cunha</last><affiliation>Universidade Federal de Minas Gerais and Universidade Federal de Minas Gerais</affiliation></author>
+      <author><first>Guilherme</first><last>Fonseca</last></author>
+      <author><first>Ana</first><last>Pagano</last></author>
+      <author><first>Luana</first><last>Santos</last></author>
+      <author><first>Adriana</first><last>Pagano</last><affiliation>Universidade Federal de Minas Gerais, Universidade Federal de Minas Gerais</affiliation></author>
+      <author><first>Leonardo</first><last>Rocha</last><affiliation>Universidade Federal de São João del-Rei</affiliation></author>
+      <author><first>Marcos</first><last>Gonçalves</last><affiliation>Universidade Federal de Minas Gerais, Universidade Federal de Minas Gerais</affiliation></author>
+      <pages>419-434</pages>
+      <abstract>We seek to explain the causes of the misclassification of the most challenging documents, namely those that no classifier using state-of-the-art, very semantically-separable contextual embedding representations managed to predict accurately. To do so, we propose a taxonomy of incorrect predictions, which we used to perform qualitative human evaluation. We posed two (research) questions, considering three sentiment datasets in two different domains – movie and product reviews. Evaluators with two different backgrounds evaluated documents by comparing the predominant sentiment assigned by the model to the label in the gold dataset in order to decide on a likely misclassification reason. Based on a high inter-evaluator agreement (81.7%), we observed significant differences between the product and movie review domains, such as the prevalence of ambivalence in product reviews and sarcasm in movie reviews. Our analysis also revealed an unexpectedly high rate of incorrect labeling in the gold dataset (up to 33%) and a significant amount of incorrect prediction by the model due to a series of linguistic phenomena (including amplified words, contrastive markers, comparative sentences, and references to world knowledge). Overall, our taxonomy and methodology allow us to explain between 80%-85% of the errors with high confidence (agreement) – enabling us to point out where future efforts to improve models should be concentrated.</abstract>
+      <url hash="ebecbc77">2024.conll-1.31</url>
+      <bibkey>andrade-etal-2024-explaining</bibkey>
+    </paper>
+    <paper id="32">
+      <title>A Multimodal Large Language Model “Foresees” Objects Based on Verb Information but Not Gender</title>
+      <author><first>Shuqi</first><last>Wang</last><affiliation>The Chinese University of Hong Kong</affiliation></author>
+      <author><first>Xufeng</first><last>Duan</last></author>
+      <author><first>Zhenguang</first><last>Cai</last></author>
+      <pages>435-441</pages>
+      <abstract>This study employs the classical psycholinguistics paradigm, the visual world eye-tracking paradigm (VWP), to explore the predictive capabilities of LLAVA, a multimodal large language model (MLLM), and compare them with human anticipatory gaze behaviors. Specifically, we examine the attention weight distributions of LLAVA when presented with visual displays and English sentences containing verb and gender cues. Our findings reveal that LLAVA, like humans, can predictively attend to objects relevant to verbs, but fails to demonstrate gender-based anticipatory attention. Layer-wise analysis indicates that the middle layers of the model are more related to predictive attention than the early or late layers. This study is pioneering in applying psycholinguistic paradigms to compare the multimodal predictive attention of humans and MLLMs, revealing both similarities and differences between them.</abstract>
+      <url hash="67a8081f">2024.conll-1.32</url>
+      <bibkey>wang-etal-2024-multimodal</bibkey>
+    </paper>
+    <paper id="33">
+      <title><fixed-case>PRACT</fixed-case>: Optimizing Principled Reasoning and Acting of <fixed-case>LLM</fixed-case> Agent</title>
+      <author><first>Zhiwei</first><last>Liu</last><affiliation>Salesforce AI Research</affiliation></author>
+      <author><first>Weiran</first><last>Yao</last><affiliation>SalesForce.com</affiliation></author>
+      <author><first>Jianguo</first><last>Zhang</last><affiliation>SalesForce AI Research</affiliation></author>
+      <author><first>Zuxin</first><last>Liu</last><affiliation>Salesforce AI Research</affiliation></author>
+      <author><first>Liangwei</first><last>Yang</last></author>
+      <author><first>Rithesh</first><last>R N</last><affiliation>SalesForce.com</affiliation></author>
+      <author><first>Tian</first><last>Lan</last><affiliation>SalesForce</affiliation></author>
+      <author><first>Ming</first><last>Zhu</last><affiliation>SalesForce.com</affiliation></author>
+      <author><first>Juntao</first><last>Tan</last><affiliation>Rutgers University</affiliation></author>
+      <author><first>Shirley</first><last>Kokane</last><affiliation>SalesForce.com</affiliation></author>
+      <author><first>Thai</first><last>Hoang</last><affiliation>Salesforce Research</affiliation></author>
+      <author><first>Juan Carlos</first><last>Niebles</last><affiliation>Salesforce Research and Stanford University</affiliation></author>
+      <author><first>Shelby</first><last>Heinecke</last><affiliation>Salesforce Research</affiliation></author>
+      <author><first>Huan</first><last>Wang</last></author>
+      <author><first>Silvio</first><last>Savarese</last><affiliation>Salesforce and Stanford University</affiliation></author>
+      <author><first>Caiming</first><last>Xiong</last><affiliation>Salesforce Research</affiliation></author>
+      <pages>442-446</pages>
+      <abstract>We introduce the Principled Reasoning and Acting (PRAct) framework, a novel method for learning and enforcing action principles from trajectory data. Central to our approach is the use of text gradients from a reflection and optimization engine to derive these action principles. To adapt action principles to specific task requirements, we propose a new optimization framework, Reflective Principle Optimization (RPO). After execution, RPO employs a reflector to critique current action principles and an optimizer to update them accordingly.We investigate the RPO framework under two scenarios: Reward-RPO, which uses environmental rewards for reflection, and Self-RPO, which conducts self-reflection without external rewards. Additionally, we developed two RPO methods, RPO-Traj and RPO-Batch, to adapt to different settings.Experimental results across four environments demonstrate that the PRAct agent, leveraging the RPO framework, can effectively learn and apply action principles to enhance performance.</abstract>
+      <url hash="e0d7f7e2">2024.conll-1.33</url>
+      <bibkey>liu-etal-2024-pract</bibkey>
+    </paper>
+    <paper id="34">
+      <title>Image-conditioned human language comprehension and psychometric benchmarking of visual language models</title>
+      <author><first>Subha Nawer</first><last>Pushpita</last></author>
+      <author><first>Roger</first><last>Levy</last><affiliation>Massachusetts Institute of Technology</affiliation></author>
+      <pages>447-457</pages>
+      <abstract>Large language model (LLM)s’ next-word predictions have shown impressive performance in capturing human expectations during real-time language comprehension. This finding has enabled a line of research on psychometric benchmarking of LLMs against human language-comprehension data in order to reverse-engineer humans’ linguistic subjective probability distributions and representations. However, to date, this work has exclusively involved unimodal (language-only) comprehension data, whereas much human language use takes place in rich multimodal contexts. Here we extend psychometric benchmarking to visual language models (VLMs). We develop a novel experimental paradigm, <tex-math>\textit{Image-Conditioned Maze Reading}</tex-math>, in which participants first view an image and then read a text describing an image within the Maze paradigm, yielding word-by-word reaction-time measures with high signal-to-noise ratio and good localization of expectation-driven language processing effects. We find a large facilitatory effect of correct image context on language comprehension, not only for words such as concrete nouns that are directly grounded in the image but even for ungrounded words in the image descriptions. Furthermore, we find that VLM surprisal captures most to all of this effect. We use these findings to benchmark a range of VLMs, showing that models with lower perplexity generally have better psychometric performance, but that among the best VLMs tested perplexity and psychometric performance dissociate. Overall, our work offers new possibilities for connecting psycholinguistics with multimodal LLMs for both scientific and engineering goals.</abstract>
+      <url hash="d684404c">2024.conll-1.34</url>
+      <bibkey>pushpita-levy-2024-image</bibkey>
+    </paper>
+    <paper id="35">
+      <title>Self-supervised speech representations display some human-like cross-linguistic perceptual abilities</title>
+      <author><first>Joselyn</first><last>Rodriguez</last><affiliation>University of Maryland, College Park</affiliation></author>
+      <author><first>Kamala</first><last>Sreepada</last></author>
+      <author><first>Ruolan</first><last>Famularo</last></author>
+      <author><first>Sharon</first><last>Goldwater</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Naomi</first><last>Feldman</last><affiliation>University of Maryland, College Park</affiliation></author>
+      <pages>458-463</pages>
+      <abstract>State of the art models in automatic speech recognition have shown remarkable improvements due to modern self-supervised (SSL) transformer-based architectures such as wav2vec 2.0 (Baevski et al., 2020). However, how these models encode phonetic information is still not well understood. We explore whether SSL speech models display a linguistic property that characterizes human speech perception: language specificity. We show that while wav2vec 2.0 displays an overall language specificity effect when tested on Hindi vs. English, it does not resemble human speech perception when tested on finer-grained differences in Hindi speech contrasts.</abstract>
+      <url hash="d8f0f12c">2024.conll-1.35</url>
+      <bibkey>rodriguez-etal-2024-self</bibkey>
+    </paper>
+    <paper id="36">
+      <title>One-Vs-Rest Neural Network <fixed-case>E</fixed-case>nglish Grapheme Segmentation: A Linguistic Perspective</title>
+      <author><first>Samuel</first><last>Rose</last></author>
+      <author><first>Nina</first><last>Dethlefs</last><affiliation>University of Hull</affiliation></author>
+      <author><first>C.</first><last>Kambhampati</last></author>
+      <pages>464-469</pages>
+      <abstract>Grapheme-to-Phoneme (G2P) correspondences form foundational frameworks of tasks such as text-to-speech (TTS) synthesis or automatic speech recognition. The G2P process involves taking words in their written form and generating their pronunciation. In this paper, we critique the status quo definition of a grapheme, currently a forced alignment process relating a single character to either a phoneme or a blank unit, that underlies the majority of modern approaches. We develop a linguistically-motivated redefinition from simple concepts such as vowel and consonant count and word length and offer a proof-of-concept implementation based on a multi-binary neural classification task. Our model achieves state-of-the-art results with a 31.86% Word Error Rate on a standard benchmark, while generating linguistically meaningful grapheme segmentations.</abstract>
+      <url hash="8affcd70">2024.conll-1.36</url>
+      <bibkey>rose-etal-2024-one</bibkey>
+    </paper>
+    <paper id="37">
+      <title><fixed-case>C</fixed-case>rowd<fixed-case>C</fixed-case>ounter: A benchmark type-specific multi-target counterspeech dataset</title>
+      <author><first>Punyajoy</first><last>Saha</last></author>
+      <author><first>Abhilash</first><last>Datta</last></author>
+      <author><first>Abhik</first><last>Jana</last><affiliation>IIT Bhubaneswar</affiliation></author>
+      <author><first>Animesh</first><last>Mukherjee</last><affiliation>Indian Institute of Technology Kharagpur</affiliation></author>
+      <pages>470-488</pages>
+      <abstract>Counterspeech presents a viable alternative to banning or suspending users for hate speech while upholding freedom of expression. However, writing effective counterspeech is challenging for moderators/users. Hence, developing suggestion tools for writing counterspeech is the need of the hour. One critical challenge in developing such a tool is the lack of quality and diversity of the responses in the existing datasets. Hence, we introduce a new dataset - CrowdCounter containing 3,425 hate speech-counterspeech pairs spanning six different counterspeech types (empathy, humor, questioning, warning, shaming, contradiction), which is the first of its kind. The design of our annotation platform itself encourages annotators to write type-specific, non-redundant and high-quality counterspeech. We evaluate two frameworks for generating counterspeech responses - vanilla and type-controlled prompts - across four large language models. In terms of metrics, we evaluate the responses using relevance, diversity and quality. We observe that Flan-T5 is the best model in the vanilla framework across different models. Type-specific prompts enhance the relevance of the responses, although they might reduce the language quality. DialoGPT proves to be the best at following the instructions and generating the type-specific counterspeech accurately.</abstract>
+      <url hash="0b08e8b7">2024.conll-1.37</url>
+      <bibkey>saha-etal-2024-crowdcounter</bibkey>
+    </paper>
+    <paper id="38">
+      <title>Solving the Challenge Set without Solving the Task: On <fixed-case>W</fixed-case>inograd Schemas as a Test of Pronominal Coreference Resolution</title>
+      <author><first>Ian</first><last>Porada</last><affiliation>McGill University</affiliation></author>
+      <author><first>Jackie</first><last>Cheung</last><affiliation>McGill University, Mila Research Institute and Microsoft</affiliation></author>
+      <pages>489-506</pages>
+      <abstract>Challenge sets such as the Winograd Schema Challenge (WSC) are used to benchmark systems’ ability to resolve ambiguities in natural language. If one assumes as in existing work that solving a given challenge set is at least as difficult as solving some more general task, then high performance on the challenge set should indicate high performance on the general task overall. However, we show empirically that this assumption of difficulty does not always hold. In particular, we demonstrate that despite the strong performance of prompted language models (LMs) on the WSC and its variants, these same modeling techniques perform relatively poorly at resolving certain pronominal ambiguities attested in OntoNotes and related datasets that are perceived to be easier. Motivated by these findings, we propose a method for ensembling a prompted LM with a supervised, task-specific system that is overall more accurate at resolving pronominal coreference across datasets. Finally, we emphasize that datasets involving the same linguistic phenomenon draw on distinct, but overlapping, capabilities, and evaluating on any one dataset alone does not provide a complete picture of a system’s overall capability.</abstract>
+      <url hash="34b3151e">2024.conll-1.38</url>
+      <bibkey>porada-cheung-2024-solving</bibkey>
+    </paper>
+    <paper id="39">
+      <title>Advancing <fixed-case>A</fixed-case>rabic Sentiment Analysis: <fixed-case>A</fixed-case>r<fixed-case>S</fixed-case>en Benchmark and the Improved Fuzzy Deep Hybrid Network</title>
+      <author><first>Yang</first><last>Fang</last><affiliation>Huaibei Normal University</affiliation></author>
+      <author><first>Cheng</first><last>Xu</last></author>
+      <author><first>Shuhao</first><last>Guan</last></author>
+      <author><first>Nan</first><last>Yan</last><affiliation>Georgia Institute of Technology</affiliation></author>
+      <author><first>Yuke</first><last>Mei</last><affiliation>Wuhu Institute of Technology</affiliation></author>
+      <pages>507-516</pages>
+      <abstract>Sentiment analysis is pivotal in Natural Language Processing for understanding opinions and emotions in text. While advancements in Sentiment analysis for English are notable, Arabic Sentiment Analysis (ASA) lags, despite the growing Arabic online user base. Existing ASA benchmarks are often outdated and lack comprehensive evaluation capabilities for state-of-the-art models. To bridge this gap, we introduce ArSen, a meticulously annotated COVID-19-themed Arabic dataset, and the IFDHN, a novel model incorporating fuzzy logic for enhanced sentiment classification. ArSen provides a contemporary, robust benchmark, and IFDHN achieves state-of-the-art performance on ASA tasks. Comprehensive evaluations demonstrate the efficacy of IFDHN using the ArSen dataset, highlighting future research directions in ASA.</abstract>
+      <url hash="89277151">2024.conll-1.39</url>
+      <bibkey>fang-etal-2024-advancing</bibkey>
+    </paper>
+    <paper id="40">
+      <title>Leveraging a Cognitive Model to Measure Subjective Similarity of Human and <fixed-case>GPT</fixed-case>-4 Written Content</title>
+      <author><first>Tyler</first><last>Malloy</last><affiliation>Carnegie Mellon University</affiliation></author>
+      <author><first>Maria</first><last>Ferreira</last><affiliation>Carnegie Mellon University</affiliation></author>
+      <author><first>Fei</first><last>Fang</last><affiliation>Carnegie Mellon University</affiliation></author>
+      <author><first>Cleotilde</first><last>Gonzalez</last></author>
+      <pages>517-527</pages>
+      <abstract>Cosine similarity between two documents can be computed using token embeddings formed by Large Language Models (LLMs) such as GPT-4, and used to categorize those documents across a range of uses. However, these similarities are ultimately dependent on the corpora used to train these LLMs, and may not reflect subjective similarity of individuals or how their biases and constraints impact similarity metrics. This lack of cognitively-aware personalization of similarity metrics can be particularly problematic in educational and recommendation settings where there is a limited number of individual judgements of category or preference, and biases can be particularly relevant. To address this, we rely on an integration of an Instance-Based Learning (IBL) cognitive model with LLM embeddings to develop the Instance-Based Individualized Similarity (IBIS) metric. This similarity metric is beneficial in that it takes into account individual biases and constraints in a manner that is grounded in the cognitive mechanisms of decision making. To evaluate the IBIS metric, we also introduce a dataset of human categorizations of emails as being either dangerous (phishing) or safe (ham). This dataset is used to demonstrate the benefits of leveraging a cognitive model to measure the subjective similarity of human participants in an educational setting.</abstract>
+      <url hash="e868acd5">2024.conll-1.40</url>
+      <bibkey>malloy-etal-2024-leveraging</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.crac.xml b/data/xml/2024.crac.xml
new file mode 100644
index 0000000000..bb9e9214ef
--- /dev/null
+++ b/data/xml/2024.crac.xml
@@ -0,0 +1,129 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.crac">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of The Seventh Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2024)</booktitle>
+      <editor><first>Maciej</first><last>Ogrodniczuk</last></editor>
+      <editor><first>Anna</first><last>Nedoluzhko</last></editor>
+      <editor><first>Massimo</first><last>Poesio</last></editor>
+      <editor><first>Sameer</first><last>Pradhan</last></editor>
+      <editor><first>Vincent</first><last>Ng</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="fd2fd227">2024.crac-1</url>
+      <venue>crac</venue>
+    </meta>
+    <frontmatter>
+      <url hash="d8abbb7d">2024.crac-1.0</url>
+      <bibkey>crac-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Major Entity Identification: A Generalizable Alternative to Coreference Resolution</title>
+      <author><first>Kawshik S.</first><last>Manikantan</last></author>
+      <author><first>Shubham</first><last>Toshniwal</last></author>
+      <author><first>Makarand</first><last>Tapaswi</last></author>
+      <author><first>Vineet</first><last>Gandhi</last></author>
+      <pages>1–17</pages>
+      <url hash="659bb2fc">2024.crac-1.1</url>
+      <bibkey>manikantan-etal-2024-major</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Enriching Conceptual Knowledge in Language Models through Metaphorical Reference Explanation</title>
+      <author><first>Zixuan</first><last>Zhang</last></author>
+      <author><first>Heng</first><last>Ji</last></author>
+      <pages>18–22</pages>
+      <url hash="ca0a81f7">2024.crac-1.2</url>
+      <bibkey>zhang-ji-2024-enriching</bibkey>
+    </paper>
+    <paper id="3">
+      <title><fixed-case>P</fixed-case>olish Coreference Corpus as an <fixed-case>LLM</fixed-case> Testbed: Evaluating Coreference Resolution within Instruction-Following Language Models by Instruction–Answer Alignment</title>
+      <author><first>Karol</first><last>Saputa</last></author>
+      <author><first>Angelika</first><last>Peljak-Łapińska</last></author>
+      <author><first>Maciej</first><last>Ogrodniczuk</last></author>
+      <pages>23–32</pages>
+      <url hash="01fa1667">2024.crac-1.3</url>
+      <bibkey>saputa-etal-2024-polish</bibkey>
+    </paper>
+    <paper id="4">
+      <title><fixed-case>MSCAW</fixed-case>-coref: Multilingual, Singleton and Conjunction-Aware Word-Level Coreference Resolution</title>
+      <author><first>Houjun</first><last>Liu</last></author>
+      <author><first>John</first><last>Bauer</last></author>
+      <author><first>Karel</first><last>D’Oosterlinck</last></author>
+      <author><first>Christopher</first><last>Potts</last></author>
+      <author><first>Christopher D.</first><last>Manning</last></author>
+      <pages>33–40</pages>
+      <url hash="159606dd">2024.crac-1.4</url>
+      <bibkey>liu-etal-2024-mscaw</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Unifying the Scope of Bridging Anaphora Types in <fixed-case>E</fixed-case>nglish: Bridging Annotations in <fixed-case>ARRAU</fixed-case> and <fixed-case>GUM</fixed-case></title>
+      <author><first>Lauren</first><last>Levine</last></author>
+      <author><first>Amir</first><last>Zeldes</last></author>
+      <pages>41–51</pages>
+      <url hash="7511e66b">2024.crac-1.5</url>
+      <bibkey>levine-zeldes-2024-unifying</bibkey>
+    </paper>
+    <paper id="6">
+      <title><fixed-case>W</fixed-case>ino<fixed-case>P</fixed-case>ron: Revisiting <fixed-case>E</fixed-case>nglish <fixed-case>W</fixed-case>inogender Schemas for Consistency, Coverage, and Grammatical Case</title>
+      <author><first>Vagrant</first><last>Gautam</last></author>
+      <author><first>Julius</first><last>Steuer</last></author>
+      <author><first>Eileen</first><last>Bingert</last></author>
+      <author><first>Ray</first><last>Johns</last></author>
+      <author><first>Anne</first><last>Lauscher</last></author>
+      <author><first>Dietrich</first><last>Klakow</last></author>
+      <pages>52–66</pages>
+      <url hash="09d40c5b">2024.crac-1.6</url>
+      <bibkey>gautam-etal-2024-winopron</bibkey>
+    </paper>
+    <paper id="7">
+      <title><fixed-case>D</fixed-case>eep<fixed-case>HC</fixed-case>oref: A Deep Neural Coreference Resolution for <fixed-case>H</fixed-case>indi Text</title>
+      <author><first>Kusum</first><last>Lata</last></author>
+      <author><first>Pardeep</first><last>Singh</last></author>
+      <author><first>Kamlesh</first><last>Dutta</last></author>
+      <author><first>Abhishek</first><last>Kanwar</last></author>
+      <pages>67–77</pages>
+      <url hash="1c23937b">2024.crac-1.7</url>
+      <bibkey>lata-etal-2024-deephcoref</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Findings of the Third Shared Task on Multilingual Coreference Resolution</title>
+      <author><first>Michal</first><last>Novák</last></author>
+      <author><first>Barbora</first><last>Dohnalová</last></author>
+      <author><first>Miloslav</first><last>Konopik</last></author>
+      <author><first>Anna</first><last>Nedoluzhko</last></author>
+      <author><first>Martin</first><last>Popel</last></author>
+      <author><first>Ondrej</first><last>Prazak</last></author>
+      <author><first>Jakub</first><last>Sido</last></author>
+      <author><first>Milan</first><last>Straka</last></author>
+      <author><first>Zdeněk</first><last>Žabokrtský</last></author>
+      <author><first>Daniel</first><last>Zeman</last></author>
+      <pages>78–96</pages>
+      <url hash="a6119d4c">2024.crac-1.8</url>
+      <bibkey>novak-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="9">
+      <title><fixed-case>C</fixed-case>or<fixed-case>P</fixed-case>ipe at <fixed-case>CRAC</fixed-case> 2024: Predicting Zero Mentions from Raw Text</title>
+      <author><first>Milan</first><last>Straka</last></author>
+      <pages>97–106</pages>
+      <url hash="34acfbbc">2024.crac-1.9</url>
+      <bibkey>straka-2024-corpipe</bibkey>
+    </paper>
+    <paper id="10">
+      <title>End-to-end Multilingual Coreference Resolution with Headword Mention Representation</title>
+      <author><first>Ondrej</first><last>Prazak</last></author>
+      <author><first>Miloslav</first><last>Konopík</last></author>
+      <pages>107–113</pages>
+      <url hash="9982c341">2024.crac-1.10</url>
+      <bibkey>prazak-konopik-2024-end</bibkey>
+    </paper>
+    <paper id="11">
+      <title>Multilingual coreference resolution as text generation</title>
+      <author><first>Natalia</first><last>Skachkova</last></author>
+      <pages>114–122</pages>
+      <url hash="4cbe94e6">2024.crac-1.11</url>
+      <bibkey>skachkova-2024-multilingual</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.customnlp4u.xml b/data/xml/2024.customnlp4u.xml
new file mode 100644
index 0000000000..f0ac135c2c
--- /dev/null
+++ b/data/xml/2024.customnlp4u.xml
@@ -0,0 +1,301 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.customnlp4u">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the 1st Workshop on Customizable NLP: Progress and Challenges in Customizing NLP for a Domain, Application, Group, or Individual (CustomNLP4U)</booktitle>
+      <editor><first>Sachin</first><last>Kumar</last></editor>
+      <editor><first>Vidhisha</first><last>Balachandran</last></editor>
+      <editor><first>Chan Young</first><last>Park</last></editor>
+      <editor><first>Weijia</first><last>Shi</last></editor>
+      <editor><first>Shirley Anugrah</first><last>Hayati</last></editor>
+      <editor><first>Yulia</first><last>Tsvetkov</last></editor>
+      <editor><first>Noah</first><last>Smith</last></editor>
+      <editor><first>Hannaneh</first><last>Hajishirzi</last></editor>
+      <editor><first>Dongyeop</first><last>Kang</last></editor>
+      <editor><first>David</first><last>Jurgens</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="dea123de">2024.customnlp4u-1</url>
+      <venue>customnlp4u</venue>
+    </meta>
+    <frontmatter>
+      <url hash="e7e4bdc5">2024.customnlp4u-1.0</url>
+      <bibkey>customnlp4u-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Navigate Complex Physical Worlds via Geometrically Constrained <fixed-case>LLM</fixed-case></title>
+      <author><first>Yongqiang</first><last>Huang</last></author>
+      <author><first>Wentao</first><last>Ye</last></author>
+      <author><first>Liyao</first><last>Li</last><affiliation>Zhejiang University</affiliation></author>
+      <author><first>Junbo</first><last>Zhao</last><affiliation>Zhejiang University</affiliation></author>
+      <pages>1-11</pages>
+      <abstract>This study investigates the potential of Large Language Models (LLMs) for reconstructing and understanding the physical world based solely on textual knowledge. It explores the impact of model performance on spatial understanding abilities by introducing a set of geometric conventions and developing a workflow based on multi-layer graphs and multi-agent systems. The study examines how LLMs achieve multi-step and multi-objective geometric inference in a spatial environment, using unified geometric conventions and a graph-driven framework. A genetic algorithm, inspired by large-scale model knowledge, is employed to solve geometric constraint problems, enhancing the spatial reasoning capabilities of LLMs. This work innovatively explores the feasibility of using text-based LLMs as builders of the physical world and designs a workflow to enhance their spatial comprehension and construction capabilities.</abstract>
+      <url hash="cd048df7">2024.customnlp4u-1.1</url>
+      <bibkey>huang-etal-2024-navigate</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Empowering <fixed-case>AAC</fixed-case> Users: A Systematic Integration of Personal Narratives with Conversational <fixed-case>AI</fixed-case></title>
+      <author><first>Sayantan</first><last>Pal</last><affiliation>State University of New York at Buffalo</affiliation></author>
+      <author><first>Souvik</first><last>Das</last><affiliation>State University of New York at Buffalo</affiliation></author>
+      <author><first>Rohini</first><last>Srihari</last><affiliation>State University of New York at Buffalo</affiliation></author>
+      <author><first>Jeff</first><last>Higginborham</last><affiliation>State University of New York at Buffalo</affiliation></author>
+      <author><first>Jenna</first><last>Bizovi</last><affiliation>State University of New York at Buffalo</affiliation></author>
+      <pages>12-25</pages>
+      <abstract>Communication barriers have long posed challenges for users of Alternate and Augmentative Communication (AAC). In AAC, effective conversational aids are not solely about harnessing Artificial Intelligence (AI) capabilities but more about ensuring these technologies resonate deeply with AAC user’s unique communication challenges. We aim to bridge the gap between generic outputs and genuine human interactions by integrating advanced Conversational AI with personal narratives. While existing solutions offer generic responses, a considerable gap in tailoring outputs reflecting an AAC user’s intent must be addressed. Thus, we propose to create a custom conversational dataset centered on the experiences and words of a primary AAC user to fine-tune advanced language models. Additionally, we employ a Retrieval-Augmented Generation (RAG) method, drawing context from a summarized version of authored content by the AAC user. This combination ensures that responses are contextually relevant and deeply personal. Preliminary evaluations underscore its transformative potential, with automated metrics and human assessments showcasing significantly enhanced response quality.</abstract>
+      <url hash="5ba67a1d">2024.customnlp4u-1.2</url>
+      <bibkey>pal-etal-2024-empowering</bibkey>
+    </paper>
+    <paper id="3">
+      <title><fixed-case>LLM</fixed-case>-Based Robust Product Classification in Commerce and Compliance</title>
+      <author><first>Sina</first><last>Gholamian</last></author>
+      <author><first>Gianfranco</first><last>Romani</last><affiliation>Thomson Reuters</affiliation></author>
+      <author><first>Bartosz</first><last>Rudnikowicz</last><affiliation>Thomson Reuters</affiliation></author>
+      <author><first>Stavroula</first><last>Skylaki</last></author>
+      <pages>26-36</pages>
+      <abstract>Product classification is a crucial task in international trade, as compliance regulations are verified and taxes and duties are applied based on product categories. Manual classification of products is time-consuming and error-prone, and the sheer volume of products imported and exported renders the manual process infeasible. Consequently, e-commerce platforms and enterprises involved in international trade have turned to automatic product classification using machine learning. However, current approaches do not consider the real-world challenges associated with product classification, such as very abbreviated and incomplete product descriptions. In addition, recent advancements in generative Large Language Models (LLMs) and their reasoning capabilities are mainly untapped in product classification and e-commerce. In this research, we explore the real-life challenges of industrial classification and we propose data perturbations that allow for realistic data simulation. Furthermore, we employ LLM-based product classification to improve the robustness of the prediction in presence of incomplete data. Our research shows that LLMs with in-context learning outperform the supervised approaches in the clean-data scenario. Additionally, we illustrate that LLMs are significantly more robust than the supervised approaches when data attacks are present.</abstract>
+      <url hash="a35796d7">2024.customnlp4u-1.3</url>
+      <bibkey>gholamian-etal-2024-llm</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Less is Fed More: Sparsity Reduces Feature Distortion in Federated Learning</title>
+      <author><first>Abhinav</first><last>Rao</last></author>
+      <author><first>Aashiq</first><last>Muhamed</last></author>
+      <author><first>Harshita</first><last>Diddee</last><affiliation>Carnegie Mellon University</affiliation></author>
+      <pages>37-46</pages>
+      <abstract>Our work studies Multilingual Federated Learning (FL), a decentralized paradigm that, although promising, grapples with issues such as client drift and suboptimal generalization in diverse, multilingual settings. We highlight limitations in existing approaches to generalize across both actively participating and inactive client language pairs. To mitigate these challenges, we introduce FedSparseNet, which incorporates sparse-network training, and LoRA, based on Low-Rank Adaptation. These approaches maintain the model’s fidelity to its pretraining distribution, thereby ensuring robust performance on both seen and unseen language pairs, while simultaneously enhancing communication efficiency by selectively transmitting trainable parameters. Our empirical evaluations demonstrate that FedSparseNet outperforms conventional FL models on both seen and unseen clients, while LoRA shows remarkable improvements in unseen client performance. Additionally, we propose the Continuous Relative Robustness Metric, a novel metric to uniformly assess a model’s performance across diverse language pairs. We open-source our code for reproducibility on GitHub.</abstract>
+      <url hash="fdfe5a66">2024.customnlp4u-1.4</url>
+      <bibkey>rao-etal-2024-less</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Understanding Players as if They Are Talking to the Game in a Customized Language: A Pilot Study</title>
+      <author><first>Tianze</first><last>Wang</last></author>
+      <author><first>Maryam</first><last>Honarijahromi</last><affiliation>Microsoft Xbox (king)</affiliation></author>
+      <author><first>Styliani</first><last>Katsarou</last></author>
+      <author><first>Olga</first><last>Mikheeva</last><affiliation>KTH Royal Institute of Technology, Stockholm, Sweden</affiliation></author>
+      <author><first>Theodoros</first><last>Panagiotakopoulos</last><affiliation>King</affiliation></author>
+      <author><first>Oleg</first><last>Smirnov</last><affiliation>Microsoft Gaming</affiliation></author>
+      <author><first>Lele</first><last>Cao</last><affiliation>Microsoft (ABK)</affiliation></author>
+      <author><first>Sahar</first><last>Asadi</last></author>
+      <pages>47-52</pages>
+      <abstract>This pilot study explores the application of language models (LMs) to model game event sequences, treating them as a customized natural language. We investigate a popular mobile game, transforming raw event data into textual sequences and pretraining a Longformer model on this data. Our approach captures the rich and nuanced interactions within game sessions, effectively identifying meaningful player segments. The results demonstrate the potential of self-supervised LMs in enhancing game design and personalization without relying on ground-truth labels.</abstract>
+      <url hash="1a1f5f09">2024.customnlp4u-1.5</url>
+      <bibkey>wang-etal-2024-understanding</bibkey>
+    </paper>
+    <paper id="6">
+      <title><fixed-case>L</fixed-case>3<fixed-case>M</fixed-case>asking: Multi-task Fine-tuning for Language Models by Leveraging Lessons Learned from Vanilla Models</title>
+      <author><first>Yusuke</first><last>Kimura</last><affiliation>Doshisha University</affiliation></author>
+      <author><first>Takahiro</first><last>Komamizu</last><affiliation>Nagoya University</affiliation></author>
+      <author><first>Kenji</first><last>Hatano</last><affiliation>Doshisha University</affiliation></author>
+      <pages>53-62</pages>
+      <abstract>When distributional differences exist between pre-training and fine-tuning data, language models (LMs) may perform poorly on downstream tasks.Recent studies have reported that multi-task learning of downstream task and masked language modeling (MLM) task during the fine-tuning phase improves the performance of the downstream task.Typical MLM tasks (e.g., random token masking (RTM)) tend not to care tokens corresponding to the knowledge already acquired during the pre-training phase, therefore LMs may not notice the important clue or not effective to acquire linguistic knowledge of the task or domain.To overcome this limitation, we propose a new masking strategy for MLM task, called L3Masking, that leverages lessons (specifically, token-wise likelihood in a context) learned from the vanilla language model to be fine-tuned.L3Masking actively masks tokens with low likelihood on the vanilla model.Experimental evaluations on text classification tasks in different domains confirms a multi-task text classification method with L3Masking performed task adaptation more effectively than that with RTM.These results suggest the usefulness of assigning a preference to the tokens to be learned as the task or domain adaptation.</abstract>
+      <url hash="778ef699">2024.customnlp4u-1.6</url>
+      <bibkey>kimura-etal-2024-l3masking</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Grounded Language Agent for Product Search via Intelligent Web Interactions</title>
+      <author><first>Moghis</first><last>Fereidouni</last></author>
+      <author><first>Adib</first><last>Mosharrof</last><affiliation>University of Kentucky</affiliation></author>
+      <author><first>A.b.</first><last>Siddique</last><affiliation>University of Kentucky</affiliation></author>
+      <pages>63-75</pages>
+      <abstract>Recent research has focused on developing agents powered by large language models (LLMs) to accomplish complex high-level user intents. However, employing LLMs with billions of parameters (e.g., GPT-4) may incur substantial costs on top of handcrafting extensive prompts. To address this, we introduce a Grounded Language Agent for Intelligent Web Interactions, named GLAINTEL. GLAINTEL employs Flan-T5 as its backbone and is flexible in training in various settings: unsupervised learning, supervised learning, and unsupervised domain adaptation. Specifically, we tackle both the challenge of learning without human demonstrations and the opportunity to leverage human demonstrations effectively when those are available. Additionally, we explore unsupervised domain adaptation for cases where demonstrations are limited to a specific domain. Experimental evaluations across diverse setups demonstrate the effectiveness of GLAINTEL in unsupervised settings, outperforming in-context learning-based approaches that employ larger models with up to 540 billion parameters. Surprisingly, behavioral cloning-based methods that straightforwardly use human demonstrations do not outperform unsupervised variants of GLAINTEL. Additionally, we show that combining human demonstrations with reinforcement learning-based training yields results comparable to methods utilizing GPT-4. The code is available at: https://github.com/MultifacetedNLP/Web-Agents-Unsupervised</abstract>
+      <url hash="8f1dcd35">2024.customnlp4u-1.7</url>
+      <bibkey>fereidouni-etal-2024-grounded</bibkey>
+    </paper>
+    <paper id="8">
+      <title><fixed-case>A</fixed-case>dapt<fixed-case>E</fixed-case>val: Evaluating Large Language Models on Domain Adaptation for Text Summarization</title>
+      <author><first>Anum</first><last>Afzal</last><affiliation>Technische Universität München</affiliation></author>
+      <author><first>Ribin</first><last>Chalumattu</last><affiliation>ETHZ - ETH Zurich</affiliation></author>
+      <author><first>Florian</first><last>Matthes</last><affiliation>Technische Universität München</affiliation></author>
+      <author><first>Laura</first><last>Mascarell</last></author>
+      <pages>76-85</pages>
+      <abstract>Despite the advances in the abstractive summarization task using Large Language Models (LLM), there is a lack of research that asses their abilities to easily adapt to different domains. We evaluate the domain adaptation abilities of a wide range of LLMs on the summarization task across various domains in both fine-tuning and in-context learning settings. We also present AdaptEval, the first domain adaptation evaluation suite. AdaptEval includes a domain benchmark and a set of metrics to facilitate the analysis of domain adaptation. Our results demonstrate that LLMs exhibit comparable performance in the in-context learning setting, regardless of their parameter scale.</abstract>
+      <url hash="aa50668b">2024.customnlp4u-1.8</url>
+      <bibkey>afzal-etal-2024-adapteval</bibkey>
+    </paper>
+    <paper id="9">
+      <title><fixed-case>CPS</fixed-case>-<fixed-case>T</fixed-case>ask<fixed-case>F</fixed-case>orge: Generating Collaborative Problem Solving Environments for Diverse Communication Tasks</title>
+      <author><first>Nikita</first><last>Haduong</last><affiliation>Department of Computer Science, University of Washington</affiliation></author>
+      <author><first>Irene</first><last>Wang</last><affiliation>NA</affiliation></author>
+      <author><first>Bo-Ru</first><last>Lu</last><affiliation>University of Washington</affiliation></author>
+      <author><first>Prithviraj</first><last>Ammanabrolu</last><affiliation>University of California, San Diego</affiliation></author>
+      <author><first>Noah</first><last>Smith</last><affiliation>University of Washington and Allen Institute for Artificial Intelligence</affiliation></author>
+      <pages>86-112</pages>
+      <abstract>Teams can outperform individuals; could adding AI teammates further bolster performance of teams solving problems collaboratively? Collaborative problem solving (CPS) research commonly studies teams with two agents (human-human or human-AI), but team research literature finds that, for complex tasks, larger teams are more effective. Progress in studying collaboration with more than two agents, through textual records of team interactions, is hindered by a major data challenge: available CPS corpora are predominantly dyadic, and adapting pre-existing CPS tasks to more agents is non-trivial. We address this data challenge by developing a CPS task generator, CPS-TaskForge, that can produce environments for studying CPS under a wide array of conditions, and releasing a CPS task design checklist grounded in the theoretical PISA 2015 CPS framework to help facilitate the development of CPS corpora with more agents. CPS-TaskForge takes the form of a resource management (tower defense) game, and different CPS tasks can be studied by manipulating game design parameters. We conduct a case study with groups of 3–4 humans to validate production of diverse natural language CPS communication in a game instance produced by CPS-TaskForge. We discuss opportunities for advancing research in CPS (both with human-only and human-AI teams) using different task configurations. We release all data and code.</abstract>
+      <url hash="1d118d6b">2024.customnlp4u-1.9</url>
+      <bibkey>haduong-etal-2024-cps</bibkey>
+    </paper>
+    <paper id="10">
+      <title>Active Learning for Robust and Representative <fixed-case>LLM</fixed-case> Generation in Safety-Critical Scenarios</title>
+      <author><first>Sabit</first><last>Hassan</last><affiliation>University of Pittsburgh</affiliation></author>
+      <author><first>Anthony</first><last>Sicilia</last><affiliation>Northeastern University</affiliation></author>
+      <author><first>Malihe</first><last>Alikhani</last><affiliation>Northeastern University</affiliation></author>
+      <pages>113-123</pages>
+      <abstract>Ensuring robust safety measures across a wide range of scenarios is crucial for user-facing systems. While Large Language Models (LLMs) can generate valuable data for safety measures, they often exhibit distributional biases, focusing on common scenarios and neglecting rare but critical cases. This can undermine the effectiveness of safety protocols developed using such data. To address this, we propose a novel framework that integrates active learning with clustering to guide LLM generation, enhancing their representativeness and robustness in safety scenarios. We demonstrate the effectiveness of our approach by constructing a dataset of 5.4K potential safety violations through an iterative process involving LLM generation and an active learner model’s feedback. Our results show that the proposed framework produces a more representative set of safety scenarios without requiring prior knowledge of the underlying data distribution. Additionally, data acquired through our method improves the accuracy and F1 score of both the active learner model as well models outside the scope of active learning process, highlighting its broad applicability.</abstract>
+      <url hash="dd392938">2024.customnlp4u-1.10</url>
+      <bibkey>hassan-etal-2024-active</bibkey>
+    </paper>
+    <paper id="11">
+      <title>Exploring the Readiness of Prominent Small Language Models for the Democratization of Financial Literacy</title>
+      <author><first>Tagore</first><last>Kosireddy</last></author>
+      <author><first>Jeffrey</first><last>Wall</last><affiliation>Michigan Technological University</affiliation></author>
+      <author><first>Evan</first><last>Lucas</last><affiliation>Michigan Technological University</affiliation></author>
+      <pages>124-149</pages>
+      <abstract>The use of <i>small language models</i> (SLMs), herein defined as models with less than three billion parameters, is increasing across various domains and applications. Due to their ability to run on more accessible hardware and preserve user privacy, SLMs possess the potential to democratize access to language models for individuals of different socioeconomic status and with different privacy preferences. This study assesses several state-of-the-art SLMs (e.g., Apple’s OpenELM, Microsoft’s Phi, Google’s Gemma, and the Tinyllama project) for use in the financial domain to support the development of financial literacy LMs. Democratizing access to quality financial information for those who are financially under educated is greatly needed in society, particularly as new financial markets and products emerge and participation in financial markets increases due to ease of access. We are the first to examine the use of open-source SLMs to democratize access to financial question answering capabilities for individuals and students. To this end, we provide an analysis of the memory usage, inference time, similarity comparisons to ground-truth answers, and output readability of prominent SLMs to determine which models are most accessible and capable of supporting access to financial information. We analyze zero-shot and few-shot learning variants of the models. The results suggest that some off-the-shelf SLMs merit further exploration and fine-tuning to prepare them for individual use, while others may have limits to their democratization. Code to replicate our experiments is shared.</abstract>
+      <url hash="5afa9408">2024.customnlp4u-1.11</url>
+      <bibkey>kosireddy-etal-2024-exploring</bibkey>
+    </paper>
+    <paper id="12">
+      <title>Customized Style Transfer using Discrete Sampling</title>
+      <author><first>Anugunj</first><last>Naman</last></author>
+      <pages>150-155</pages>
+      <abstract>Customizing text style or content typically involves extensive fine-tuning of large models, demanding significant data and training. Traditional unsupervised approaches using sampling often yield low diversity and creativity. We present a novel discrete Langevin proposal that samples directly from the categorical token distribution, overcoming these limitations. By adapting the continuous Langevin algorithm for discrete spaces, our approach enables efficient gradient-based sampling. Evaluations on style transfer tasks demonstrate superior performance over state-of-the-art methods in accuracy, BLEU, BERTScore, and diversity. Our proposed approach paves way for advanced customized text generation with desired styles as well as allows future scope for prompt generation for model safeguarding and jail-breaking.</abstract>
+      <url hash="b707942b">2024.customnlp4u-1.12</url>
+      <bibkey>naman-2024-customized</bibkey>
+    </paper>
+    <paper id="13">
+      <title>Trustful <fixed-case>LLM</fixed-case>s: Customizing and Grounding Text Generation with knowledge bases and Dual Decoders</title>
+      <author><first>Xiaofeng</first><last>Zhu</last></author>
+      <author><first>Jaya Krishna</first><last>Mandivarapu</last><affiliation>Microsoft</affiliation></author>
+      <pages>156-166</pages>
+      <abstract>Although people are impressed by the content generation skills of large language models, the use of LLMs, such as ChatGPT, is limited by the domain grounding of the content. The correctness and groundedness of the generated content need to be based on a verified context, such as results from Retrieval-Augmented Generation (RAG). One important issue when adapting LLMs to a customized domain is that the generated responses are often incomplete, or the additions are not verified and may even be hallucinated. Prior studies on hallucination detection have focused on evaluation metrics, which are not easily adaptable to dynamic domains and can be vulnerable to attacks like jail-breaking. In this work, we propose 1) a post-processing algorithm of leveraging knowledge triplets in RAG context to correct hallucinations and 2) a dual-decoder model that fuses RAG context to guide the generation process.</abstract>
+      <url hash="2899e3e2">2024.customnlp4u-1.13</url>
+      <bibkey>zhu-mandivarapu-2024-trustful</bibkey>
+    </paper>
+    <paper id="14">
+      <title>Constructing Domain-Specific Evaluation Sets for <fixed-case>LLM</fixed-case>-as-a-judge</title>
+      <author><first>Ravi</first><last>Raju</last><affiliation>Sambanova Systems</affiliation></author>
+      <author><first>Swayambhoo</first><last>Jain</last><affiliation>Sambanova Systems</affiliation></author>
+      <author id="bo-li"><first>Bo</first><last>Li</last></author>
+      <author><first>Jonathan</first><last>Li</last></author>
+      <author><first>Urmish</first><last>Thakker</last><affiliation>SambaNova Systems</affiliation></author>
+      <pages>167-181</pages>
+      <abstract>Large Language Models (LLMs) have revolutionized the landscape of machine learning, yet current benchmarks often fall short in capturing the diverse behavior of these models in real-world applications. A benchmark’s usefulness is determined by its ability to clearly differentiate between models of varying capabilities (separability) and closely align with human preferences. Existing frameworks like Alpaca-Eval 2.0 LC (CITATION) and Arena-Hard v0.1 (CITATION) are limited by their focus on general-purpose queries and lack of diversity across domains such as law, medicine, and multilingual contexts. In this paper, we address these limitations by introducing a novel data pipeline that curates diverse, domain-specific evaluation sets tailored for LLM-as-a-Judge frameworks. Our approach leverages a combination of manual curation, semi-supervised learning to generate clusters, and stratified sampling to ensure balanced representation across a wide range of domains and languages. The resulting evaluation set, which includes 1573 samples across 14 categories, demonstrates high separability (84%) across ten top-ranked models, and agreement (84%) with Chatbot Arena and (0.915) Spearman correlation. The agreement values are 9% better than Arena Hard and 20% better than AlpacaEval 2.0 LC, while the Spearman coefficient is 0.7 more than the next best benchmark, showcasing a significant improvement in the usefulness of the benchmark. We further provide an open-source evaluation tool that enables fine-grained analysis of model performance across user-defined categories, offering valuable insights for practitioners. This work contributes to the ongoing effort to enhance the transparency, diversity, and effectiveness of LLM evaluation methodologies.</abstract>
+      <url hash="782ae059">2024.customnlp4u-1.14</url>
+      <bibkey>raju-etal-2024-constructing</bibkey>
+    </paper>
+    <paper id="15">
+      <title>Learning to Adapt Large Language Models to One-Shot In-Context Intent Classification on Unseen Domains</title>
+      <author><first>Joongbo</first><last>Shin</last><affiliation>LG AI Research</affiliation></author>
+      <author><first>Youbin</first><last>Ahn</last><affiliation>LG Corporation</affiliation></author>
+      <author><first>Seungpil</first><last>Won</last><affiliation>LG Corporation</affiliation></author>
+      <author><first>Stanley Jungkyu</first><last>Choi</last><affiliation>Language Lab, LG AI Research</affiliation></author>
+      <pages>182-197</pages>
+      <abstract>In this paper, we explore one-shot in-context intent classification using large language models (LLMs) with the goal of minimizing the effort required to adapt models to unseen domains. To enhance the one-shot in-context learning capabilities of LLMs, we employ in-context tuning, leveraging its cross-domain transferability to unseen domains.To this end, we introduce the IC-collection, a compilation of open-source intent classification datasets from diverse domains, which are meticulously divided into held-in and held-out datasets.Our experiments demonstrate the effectiveness of the proposed method, showing that our model, with only 7B parameters, not only outperforms GPT-4 on intent classification but also achieves state-of-the-art in unseen domains with only one-shot demonstrations.Both our benchmark and model will be made publicly available to advance research in the chatbot systems.</abstract>
+      <url hash="b443c48f">2024.customnlp4u-1.15</url>
+      <bibkey>shin-etal-2024-learning</bibkey>
+    </paper>
+    <paper id="16">
+      <title>Pearl: Personalizing Large Language Model Writing Assistants with Generation-Calibrated Retrievers</title>
+      <author><first>Sheshera</first><last>Mysore</last></author>
+      <author><first>Zhuoran</first><last>Lu</last></author>
+      <author><first>Mengting</first><last>Wan</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Longqi</first><last>Yang</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Bahareh</first><last>Sarrafzadeh</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Steve</first><last>Menezes</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Tina</first><last>Baghaee</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Emmanuel</first><last>Gonzalez</last><affiliation>Centro de Enseñanza Técnica Industrial</affiliation></author>
+      <author><first>Jennifer</first><last>Neville</last><affiliation>Purdue University and Purdue University</affiliation></author>
+      <author><first>Tara</first><last>Safavi</last><affiliation>Microsoft Research</affiliation></author>
+      <pages>198-219</pages>
+      <abstract>Powerful large language models have facilitated the development of writing assistants that promise to significantly improve the quality and efficiency of composition and communication. However, a barrier to effective assistance is the lack of personalization in LLM outputs to the author’s communication style, specialized knowledge, and values. In this paper, we address this challenge by proposing Pearl, a LLM writing assistant personalized with a retriever that is trained to be generation-calibrated for personalization. Generation calibration ensures that our retriever selects historic user authored documents to augment an LLM prompt such that they are likely to help an LLM generation better adhere to a users’ preferences. We propose two key novelties for training such a retriever: (1) A training data selection method that identifies user requests likely to benefit from personalization and documents that provide that benefit; and (2) A scale-calibrating KL-divergence objective that ensures that our retriever scores remain proportional to the downstream generation quality from using the document for personalized generation. In a series of holistic evaluations, we demonstrate the effectiveness of Pearl in generating long-form texts on multiple social media datasets. Finally, we demonstrate how a generation-calibrated retriever can double as a performance predictor – detecting low quality retrieval, and improving potentially under-performing outputs via revision with LLMs.</abstract>
+      <url hash="b347742d">2024.customnlp4u-1.16</url>
+      <bibkey>mysore-etal-2024-pearl</bibkey>
+    </paper>
+    <paper id="17">
+      <title>Evaluating and Training Long-Context Large Language Models for Question Answering on Scientific Papers</title>
+      <author><first>Lukas</first><last>Hilgert</last><affiliation>Karlsruher Institut für Technologie</affiliation></author>
+      <author><first>Danni</first><last>Liu</last><affiliation>Karlsruher Institut für Technologie</affiliation></author>
+      <author><first>Jan</first><last>Niehues</last></author>
+      <pages>220-236</pages>
+      <abstract>With the number of scientific papers published every year growing and current large language models (LLMs) showing state-of-the-art performance on natural language processing (NLP) tasks, we ask the question if LLMs could be utilized to answer questions on scientific papers.We investigate how well state-of-the-art large language models (LLMs) can answer questions on scientific paper by experimenting with long-context versions of the LLaMA 2 model and evaluating and training on the Qasper dataset.We analyze how well the LLMs handle longer papers and questions that can only be answered by accessing information from far out paragraphs. During our experiments, we see that the performance of these LLMs drops with growing length and position of relevant information.We employ different measures from simple prompts to chain-of-thought prompts and zero-shot usage to fine-tuning with QLoRA.While we still observe a performance loss with increased context length, our measures reduce the effects of this flaw, and we can achieve <tex-math>F_{1}</tex-math> scores similar to bigger models like GPT-4.</abstract>
+      <url hash="90997a25">2024.customnlp4u-1.17</url>
+      <bibkey>hilgert-etal-2024-evaluating</bibkey>
+    </paper>
+    <paper id="18">
+      <title><fixed-case>H</fixed-case>y<fixed-case>PA</fixed-case>-<fixed-case>RAG</fixed-case>: A Hybrid Parameter Adaptive Retrieval-Augmented Generation System for <fixed-case>AI</fixed-case> Legal and Policy Applications</title>
+      <author><first>Rishi</first><last>Kalra</last></author>
+      <author><first>Zekun</first><last>Wu</last></author>
+      <author><first>Ayesha</first><last>Gulley</last><affiliation>NA</affiliation></author>
+      <author><first>Airlie</first><last>Hilliard</last></author>
+      <author><first>Xin</first><last>Guan</last></author>
+      <author><first>Adriano</first><last>Koshiyama</last></author>
+      <author><first>Philip</first><last>Treleaven</last><affiliation>University College London, University of London</affiliation></author>
+      <pages>237-256</pages>
+      <abstract>While Large Language Models (LLMs) excel in text generation and question-answering, their effectiveness in AI legal and policy applications is limited by outdated knowledge, hallucinations, and inadequate reasoning in complex contexts. Retrieval-Augmented Generation (RAG) systems improve response accuracy by integrating external knowledge but struggle with retrieval errors, poor context integration, and high costs, particularly in interpreting AI legal texts. This paper introduces a Hybrid Parameter-Adaptive RAG (HyPA-RAG) system tailored for AI legal and policy, exemplified by NYC Local Law 144 (LL144). HyPA-RAG uses a query complexity classifier for adaptive parameter tuning, a hybrid retrieval strategy combining dense, sparse, and knowledge graph methods, and an evaluation framework with specific question types and metrics. By dynamically adjusting parameters, HyPA-RAG significantly improves retrieval accuracy and response fidelity. Testing on LL144 shows enhanced correctness, faithfulness, and contextual precision, addressing the need for adaptable NLP systems in complex, high-stakes AI legal and policy applications.</abstract>
+      <url hash="766ff314">2024.customnlp4u-1.18</url>
+      <bibkey>kalra-etal-2024-hypa</bibkey>
+    </paper>
+    <paper id="19">
+      <title>What Kind of Sourcery is This? Evaluating <fixed-case>GPT</fixed-case>-4’s Performance on Linking Scientific Fact to Citations</title>
+      <author><first>Autumn</first><last>Toney</last><affiliation>Georgetown University</affiliation></author>
+      <pages>257-268</pages>
+      <abstract>From document summarization to code generation, chabots have disrupted various aspects of scientific research and writing. While chabots are useful research resources for ideation, information retrieval, and editing, their generative pre-trained transformer (GPT) models’ underlying knowledge infrastructure is opaque. This has raised questions about the reliability of generative chatbot responses, as GPT models are known to respond with misleading information that appears to be accurate. Prior research has investigated the utility of OpenAI’s public chatbot, ChatGPT, to generate reliable bibliographic information with a focus on small-scale medical-related scientific facts. We present an expanded study that analyzes GPT-4’s ability to accurately identify 1,326 scientific facts and link them to academic sources. Using both the API and UI service, we experimented with open-ended and close-ended prompts to establish an understanding of GPT-4’s general ability at this domain-specific task, as well as study the real-world scenario of an average user interacting with ChatGPT using its UI. GPT-4 accurately identified 96% of the scientific facts and generated relevant and existent academic citations with 78% accuracy. Using the claims that GPT-4 mislabeled and provided incorrect sources via the API, we prompt two public GPTs customized for academic writing to evaluate if they correctly label the scientific claims and provide accurate sources. We find that these GPTs are able to accurately label 38% of the mislabeled claims, with 95% of the corresponding citations being accurate and relevant.</abstract>
+      <url hash="cbc222b9">2024.customnlp4u-1.19</url>
+      <bibkey>toney-2024-kind</bibkey>
+    </paper>
+    <paper id="20">
+      <title>“Let’s Argue Both Sides”: Argument Generation Can Force Small Models to Utilize Previously Inaccessible Reasoning Capabilities</title>
+      <author><first>Kaveh</first><last>Eskandari Miandoab</last></author>
+      <author><first>Vasanth</first><last>Sarathy</last><affiliation>Tufts University</affiliation></author>
+      <pages>269-283</pages>
+      <abstract>Large Language Models (LLMs), despite achieving state-of-the-art results in a number of evaluation tasks, struggle to maintain their performance when logical reasoning is strictly required to correctly infer a prediction. In this work, we propose Argument Generation as a method of forcing models to utilize their reasoning capabilities when other approaches such as chain-of-thought reasoning prove insufficient. Our method involves the generation of arguments for each possible inference result, and asking the end model to rank the generated arguments. We show that Argument Generation can serve as an appropriate substitute for zero-shot prompting techniques without the requirement to add layers of complexity. Furthermore, we argue that knowledge-probing techniques such as chain-of-thought reasoning and Argument Generation are only useful when further reasoning is required to infer a prediction, making them auxiliary to more common zero-shot approaches. Finally, we demonstrate that our approach forces larger gains in smaller language models, showcasing a complex relationship between model size and prompting methods in foundation models.</abstract>
+      <url hash="8014d845">2024.customnlp4u-1.20</url>
+      <bibkey>eskandari-miandoab-sarathy-2024-lets</bibkey>
+    </paper>
+    <paper id="21">
+      <title><fixed-case>LLM</fixed-case>-as-a-tutor in <fixed-case>EFL</fixed-case> Writing Education: Focusing on Evaluation of Student-<fixed-case>LLM</fixed-case> Interaction</title>
+      <author><first>Jieun</first><last>Han</last><affiliation>Korea Advanced Institute of Science &amp; Technology</affiliation></author>
+      <author><first>Haneul</first><last>Yoo</last><affiliation>KAIST</affiliation></author>
+      <author><first>Junho</first><last>Myung</last><affiliation>Korea Advanced Institute of Science and Technology</affiliation></author>
+      <author><first>Minsun</first><last>Kim</last></author>
+      <author><first>Hyunseung</first><last>Lim</last><affiliation>Korea Advanced Institute of Science &amp; Technology</affiliation></author>
+      <author><first>Yoonsu</first><last>Kim</last><affiliation>Korea Advanced Institute of Science &amp; Technology</affiliation></author>
+      <author><first>Tak Yeon</first><last>Lee</last><affiliation>Korea Advanced Institute of Science &amp; Technology</affiliation></author>
+      <author><first>Hwajung</first><last>Hong</last></author>
+      <author><first>Juho</first><last>Kim</last><affiliation>Korea Advanced Institute of Science and Technology</affiliation></author>
+      <author><first>So-Yeon</first><last>Ahn</last><affiliation>Korea Advanced Institute of Science &amp; Technology</affiliation></author>
+      <author><first>Alice</first><last>Oh</last><affiliation>Korea Advanced Institute of Science and Technology</affiliation></author>
+      <pages>284-293</pages>
+      <abstract>In the context of English as a Foreign Language (EFL) writing education, LLM-as-a-tutor can assist students by providing real-time feedback on their essays. However, challenges arise in assessing LLM-as-a-tutor due to differing standards between educational and general use cases. To bridge this gap, we integrate pedagogical principles to assess student-LLM interaction. First, we explore how LLMs can function as English tutors, providing effective essay feedback tailored to students. Second, we propose three criteria to evaluate LLM-as-a-tutor specifically designed for EFL writing education, emphasizing pedagogical aspects. In this process, EFL experts evaluate the feedback from LLM-as-a-tutor regarding (1) quality and (2) characteristics. On the other hand, EFL learners assess their (3) learning outcomes from interaction with LLM-as-a-tutor. This approach lays the groundwork for developing LLMs-as-a-tutor tailored to the needs of EFL learners, advancing the effectiveness of writing education in this context.</abstract>
+      <url hash="6202961b">2024.customnlp4u-1.21</url>
+      <bibkey>han-etal-2024-llm</bibkey>
+    </paper>
+    <paper id="22">
+      <title><fixed-case>E</fixed-case>-Commerce Product Categorization with <fixed-case>LLM</fixed-case>-based Dual-Expert Classification Paradigm</title>
+      <author><first>Zhu</first><last>Cheng</last><affiliation>Amazon</affiliation></author>
+      <author><first>Wen</first><last>Zhang</last><affiliation>Amazon</affiliation></author>
+      <author><first>Chih-Chi</first><last>Chou</last><affiliation>NA</affiliation></author>
+      <author><first>You-Yi</first><last>Jau</last><affiliation>NA</affiliation></author>
+      <author><first>Archita</first><last>Pathak</last><affiliation>NA</affiliation></author>
+      <author><first>Peng</first><last>Gao</last><affiliation>NA</affiliation></author>
+      <author><first>Umit</first><last>Batur</last><affiliation>NA</affiliation></author>
+      <pages>294-304</pages>
+      <abstract>Accurate product categorization in e-commerce is critical for delivering a satisfactory online shopping experience to customers. With the vast number of available products and the numerous potential categories, it becomes crucial to develop a classification system capable of assigning products to their correct categories with high accuracy. We present a dual-expert classification system that utilizes the power of large language models (LLMs). This framework integrates domain-specific knowledge and pre-trained LLM’s general knowledge through effective model fine-tuning and prompting techniques. First, the fine-tuned domain-specific expert recommends top K candidate categories for a given input product. Then, the more general LLM-based expert, through prompting techniques, analyzes the nuanced differences between candidate categories and selects the most suitable target category. We introduce a new in-context learning approach that utilizes LLM self-generated summarization to provide clearer instructions and enhance its performance. Experiments on e-commerce datasets demonstrate the effectiveness of our LLM-based Dual-Expert classification system.</abstract>
+      <url hash="9a8d36c7">2024.customnlp4u-1.22</url>
+      <bibkey>cheng-etal-2024-e</bibkey>
+    </paper>
+    <paper id="23">
+      <title>Adapting <fixed-case>LLM</fixed-case> Predictions in In-Context Learning with Data Priors</title>
+      <author><first>Javier</first><last>Chiyah-Garcia</last></author>
+      <author><first>Prasoon</first><last>Goyal</last><affiliation>Amazon</affiliation></author>
+      <author><first>Michael</first><last>Johnston</last><affiliation>Amazon</affiliation></author>
+      <author><first>Reza</first><last>Ghanadan</last><affiliation>University of Maryland, College Park</affiliation></author>
+      <pages>305-316</pages>
+      <abstract>In-Context Learning (ICL) has enabled Large Language Models (LLMs) to excel as general-purpose models in zero and few-shot task settings. However, since LLMs are often not trained on the downstream tasks, they lack crucial contextual knowledge from the data distributions, which limits their task adaptability.This paper explores using data priors to automatically customize prompts in ICL. We extract these priors in a dataset-agnostic way basedon historical information, enabling LLMs to personalize their output towards users or tasks at inference time. We find that they improve LLM’s output by injecting latent dataset-specific information for the task of rating prediction. Throughout a series of experiments, we show replicable results across LLMs and datasets on what information and methods are most effective for adapting ICL outputs with priors. Our findings offer a systematic approach to customizing prompts with additional information in a privacy-friendly manner, requiring only aggregated data that is computationally efficient.</abstract>
+      <url hash="6b93e243">2024.customnlp4u-1.23</url>
+      <bibkey>chiyah-garcia-etal-2024-adapting</bibkey>
+    </paper>
+    <paper id="24">
+      <title><fixed-case>V</fixed-case>-<fixed-case>G</fixed-case>lór<fixed-case>IA</fixed-case> - Customizing Large Vision and Language Models to <fixed-case>E</fixed-case>uropean <fixed-case>P</fixed-case>ortuguese</title>
+      <author><first>Afonso</first><last>Simplício</last></author>
+      <author><first>David</first><last>Semedo</last><affiliation>Universidade NOVA de Lisboa and Universidade NOVA de Lisboa</affiliation></author>
+      <author><first>Joao</first><last>Magalhaes</last><affiliation>Universidade Nova de Lisboa</affiliation></author>
+      <pages>317-326</pages>
+      <abstract>Generative Vision and Language models have obtained remarkable results recently, thanks to the use of robust pre-trained Visual encoders and Large Language Models (LLMs), together with efficient model adaptation training strategies, requiring minimal architecturalmodifications, while preserving LLMs’ original capabilities. With these advances focusing mainly on the English language, there is a gap in customization methodologies for other languages. In this paper, we propose a customization methodology that adapts existingstate-of-the-art vision and language architectures to European Portuguese (PT-PT). As a result of applying this methodology, we introduce V-GlórIA , the first Large Vision and Language generative model specifically customized for European Portuguese. V-GlórIA supports multimodal tasks such as image captioning, retrieval, and dialogue. To deliver V-GlórIA, we leverage state-of-the-art V&amp;L architectures, and contribute with PT-PT machine-translated pre-training (CC3M PT-PT) and benchmark (MSCOCO PT-PT and VisDial PT-PT) datasets.Our experiments show that V-GlórIA delivers promising performance in text-image retrieval and downstream tasks in a zero-shot setting, such as image captioning and visual dialogue tasks, highlighting the effectiveness of our customization approach.</abstract>
+      <url hash="7ba22181">2024.customnlp4u-1.24</url>
+      <bibkey>simplicio-etal-2024-v</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.emnlp.xml b/data/xml/2024.emnlp.xml
index 496a4b3796..54ddb8b2c1 100644
--- a/data/xml/2024.emnlp.xml
+++ b/data/xml/2024.emnlp.xml
@@ -18780,6 +18780,25 @@
     </meta>
     <colocated>
       <volume-id>2024.findings-emnlp</volume-id>
+      <volume-id>2024.blackboxnlp-1</volume-id>
+      <volume-id>2024.conll-1</volume-id>
+      <volume-id>2024.crac-1</volume-id>
+      <volume-id>2024.customnlp4u-1</volume-id>
+      <volume-id>2024.fever-1</volume-id>
+      <volume-id>2024.futured-1</volume-id>
+      <volume-id>2024.genbench-1</volume-id>
+      <volume-id>2024.mrl-1</volume-id>
+      <volume-id>2024.nllp-1</volume-id>
+      <volume-id>2024.nlp4dh-1</volume-id>
+      <volume-id>2024.nlp4pi-1</volume-id>
+      <volume-id>2024.nlp4science-1</volume-id>
+      <volume-id>2024.sicon-1</volume-id>
+      <volume-id>2024.tsar-1</volume-id>
+      <volume-id>2024.wat-1</volume-id>
+      <volume-id>2024.wikinlp-1</volume-id>
+      <volume-id>2024.winlp-1</volume-id>
+      <volume-id>2024.wmt-1</volume-id>
+      <volume-id>2024.wnu-1</volume-id>
     </colocated>
   </event>
 </collection>
diff --git a/data/xml/2024.fever.xml b/data/xml/2024.fever.xml
new file mode 100644
index 0000000000..3ad0a72a5b
--- /dev/null
+++ b/data/xml/2024.fever.xml
@@ -0,0 +1,336 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.fever">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)</booktitle>
+      <editor><first>Michael</first><last>Schlichtkrull</last></editor>
+      <editor><first>Yulong</first><last>Chen</last></editor>
+      <editor><first>Chenxi</first><last>Whitehouse</last></editor>
+      <editor><first>Zhenyun</first><last>Deng</last></editor>
+      <editor><first>Mubashara</first><last>Akhtar</last></editor>
+      <editor><first>Rami</first><last>Aly</last></editor>
+      <editor><first>Zhijiang</first><last>Guo</last></editor>
+      <editor><first>Christos</first><last>Christodoulopoulos</last></editor>
+      <editor><first>Oana</first><last>Cocarascu</last></editor>
+      <editor><first>Arpit</first><last>Mittal</last></editor>
+      <editor><first>James</first><last>Thorne</last></editor>
+      <editor><first>Andreas</first><last>Vlachos</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="1933592f">2024.fever-1</url>
+      <venue>fever</venue>
+    </meta>
+    <frontmatter>
+      <url hash="e8013169">2024.fever-1.0</url>
+      <bibkey>fever-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>The Automated Verification of Textual Claims (<fixed-case>AV</fixed-case>eri<fixed-case>T</fixed-case>e<fixed-case>C</fixed-case>) Shared Task</title>
+      <author><first>Michael</first><last>Schlichtkrull</last><affiliation>Queen Mary University of London</affiliation></author>
+      <author><first>Yulong</first><last>Chen</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Chenxi</first><last>Whitehouse</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Zhenyun</first><last>Deng</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Mubashara</first><last>Akhtar</last><affiliation>King’s College London</affiliation></author>
+      <author><first>Rami</first><last>Aly</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Zhijiang</first><last>Guo</last><affiliation>Huawei</affiliation></author>
+      <author><first>Christos</first><last>Christodoulopoulos</last><affiliation>Amazon</affiliation></author>
+      <author><first>Oana</first><last>Cocarascu</last><affiliation>King’s College London</affiliation></author>
+      <author><first>Arpit</first><last>Mittal</last><affiliation>Meta</affiliation></author>
+      <author><first>James</first><last>Thorne</last><affiliation>KAIST</affiliation></author>
+      <author><first>Andreas</first><last>Vlachos</last><affiliation>University of Cambridge</affiliation></author>
+      <pages>1-26</pages>
+      <abstract>The Automated Verification of Textual Claims (AVeriTeC) shared task asks participants to retrieve evidence and predict veracity for real-world claims checked by fact-checkers. Evidence can be found either via a search engine, or via a knowledge store provided by the organisers. Submissions are evaluated using the AVeriTeC score, which considers a claim to be accurately verified if and only if both the verdict is correct and retrieved evidence is considered to meet a certain quality threshold. The shared task received 21 submissions, 18 of which surpassed our baseline. The winning team was TUDA_MAI with an AVeriTeC score of 63%. In this paper we describe the shared task, present the full results, and highlight key takeaways from the shared task.</abstract>
+      <url hash="1735c229">2024.fever-1.1</url>
+      <bibkey>schlichtkrull-etal-2024-automated</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Multi-hop Evidence Pursuit Meets the Web: Team Papelo at <fixed-case>FEVER</fixed-case> 2024</title>
+      <author><first>Christopher</first><last>Malon</last><affiliation>NEC Laboratories America</affiliation></author>
+      <pages>27-36</pages>
+      <abstract>Separating disinformation from fact on the web has long challenged both the search and the reasoning powers of humans. We show that the reasoning power of large language models (LLMs) and the retrieval power of modern search engines can be combined to automate this process and explainably verify claims. We integrate LLMs and search under a multi-hop evidence pursuit strategy. This strategy generates an initial question based on an input claim using a sequence to sequence model, searches and formulates an answer to the question, and iteratively generates follow-up questions to pursue the evidence that is missing using an LLM. We demonstrate our system on the FEVER 2024 (AVeriTeC) shared task. Compared to a strategy of generating all the questions at once, our method obtains .045 higher label accuracy and .155 higher AVeriTeC score (evaluating the adequacy of the evidence). Through ablations, we show the importance of various design choices, such as the question generation method, medium-sized context, reasoning with one document at a time, adding metadata, paraphrasing, reducing the problem to two classes, and reconsidering the final verdict. Our submitted system achieves .510 AVeriTeC score on the dev set and .477 AVeriTec score on the test set.</abstract>
+      <url hash="aaa81a74">2024.fever-1.2</url>
+      <bibkey>malon-2024-multi</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Retrieving Semantics for Fact-Checking: A Comparative Approach using <fixed-case>CQ</fixed-case> (Claim to Question) &amp; <fixed-case>AQ</fixed-case> (Answer to Question)</title>
+      <author><first>Nicolò</first><last>Urbani</last><affiliation>University of Milan - Bicocca</affiliation></author>
+      <author><first>Sandip</first><last>Modha</last><affiliation>University of Milan - Bicocca</affiliation></author>
+      <author><first>Gabriella</first><last>Pasi</last><affiliation>University of Milan - Bicocca</affiliation></author>
+      <pages>37-46</pages>
+      <abstract>Fact-checking using evidences is the preferred way to tackle the issue of misinformation in the society. The democratization of information through social media has accelerated the spread of information, allowing misinformation to reach and influence a vast audience. The significant impact of these falsehoods on society and public opinion underscores the need for automated approaches to identify and combat this phenomenon.This paper is describes the participation of team IKR3-UNIMIB in AVeriTeC (Automated Verification of Textual Claims) 2024 shared task. We proposed a methods to retrieve evidence in the question and answer format and predict the veracity of a claim. As part of the AVeriTeC shared task, our method combines similarity-based ColBERT re-ranker with traditional keyword search using BM25. Additionally, a recent promising approach, Chain of RAG (CoRAG) is introduced to generate question and answer pairs (QAs) to evaluate performance on this specific dataset. We explore whether generating questions from claims or answers produces more effective QA pairs for veracity prediction. Additionally, we try to generate questions from the claim rather than from evidence (opposite the AVeriTeC dataset paper) to generate effective QA pairs for veracity prediction. Our method achieved an AVeriTeC Score of 0.18 (more than baseline) on the test dataset, demonstrating its potential in automated fact-checking.</abstract>
+      <url hash="0f54576a">2024.fever-1.3</url>
+      <bibkey>urbani-etal-2024-retrieving</bibkey>
+    </paper>
+    <paper id="4">
+      <title><fixed-case>RAG</fixed-case>-Fusion Based Information Retrieval for Fact-Checking</title>
+      <author><first>Yuki</first><last>Momii</last></author>
+      <author><first>Tetsuya</first><last>Takiguchi</last><affiliation>Kobe University</affiliation></author>
+      <author><first>Yasuo</first><last>Ariki</last><affiliation>Kobe University</affiliation></author>
+      <pages>47-54</pages>
+      <abstract>Fact-checking involves searching for relevant evidence and determining whether the given claim contains any misinformation. In this paper, we propose a fact verification system based on RAG-Fusion. We use GPT-4o to generate questions from the claim, which helps improve the accuracy of evidence retrieval.Additionally, we adopt GPT-4o for the final judgment module and refine the prompts to enhance the detection accuracy, particularly when the claim contains misinformation. Experiment showed that our system achieved an AVeriTeC Score of 0.3865 on the AVeriTeC test data, significantly surpassing the baseline score of 0.11.</abstract>
+      <url hash="487cae47">2024.fever-1.4</url>
+      <bibkey>momii-etal-2024-rag</bibkey>
+    </paper>
+    <paper id="5">
+      <title><fixed-case>UHH</fixed-case> at <fixed-case>AV</fixed-case>eri<fixed-case>T</fixed-case>e<fixed-case>C</fixed-case>: <fixed-case>RAG</fixed-case> for Fact-Checking with Real-World Claims</title>
+      <author><first>Özge</first><last>Sevgili</last></author>
+      <author><first>Irina</first><last>Nikishina</last></author>
+      <author><first>Seid</first><last>Yimam</last><affiliation>Universität Hamburg</affiliation></author>
+      <author><first>Martin</first><last>Semmann</last><affiliation>Universität Hamburg</affiliation></author>
+      <author><first>Chris</first><last>Biemann</last><affiliation>U Hamburg</affiliation></author>
+      <pages>55-63</pages>
+      <abstract>This paper presents UHH’s approach developed for the AVeriTeC shared task. The goal of the challenge is to verify given real-world claims with evidences from the Web. In this shared task, we investigate a Retrieval-Augmented Generation (RAG) model, which mainly contains retrieval, generation, and augmentation components. We start with the selection of the top 10k evidences via BM25 scores, and continue with two approaches to retrieve the most similar evidences: (1) to retrieve top 10 evidences through vector similarity, generate questions for them, and rerank them or (2) to generate questions for the claim and retrieve the most similar evidence, again, through vector similarity. After retrieving the top evidences, a Large Language Model (LLM) is prompted using the claim along with either all evidences or individual evidence to predict the label. Our system submission, <tex-math>\textbf{UHH}</tex-math>, using the first approach and individual evidence prompts, ranks 6th out of 23 systems.</abstract>
+      <url hash="e648df01">2024.fever-1.5</url>
+      <bibkey>sevgili-etal-2024-uhh</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Improving Evidence Retrieval on Claim Verification Pipeline through Question Enrichment</title>
+      <author><first>Svetlana</first><last>Churina</last></author>
+      <author><first>Anab</first><last>Barik</last></author>
+      <author><first>Saisamarth</first><last>Phaye</last></author>
+      <pages>64-70</pages>
+      <abstract>The AVeriTeC shared task introduces a new real-word claim verification dataset, where a system is tasked to verify a real-world claim based on the evidence found in the internet.In this paper, we proposed a claim verification pipeline called QueenVer which consists of 2 modules, Evidence Retrieval and Claim Verification.Our pipeline collects pairs of &lt;Question, Answer&gt; as the evidence. Recognizing the pivotal role of question quality in the evidence efficacy, we proposed question enrichment to enhance the retrieved evidence. Specifically, we adopt three different Question Generation (QG) technique, muti-hop, single-hop, and Fact-checker style. For the claim verification module, we integrate an ensemble of multiple state-of-the-art LLM to enhance its robustness.Experiments show that QueenVC achieves 0.41, 0.29, and 0.42 on Q, Q+A, and AVeriTeC scores.</abstract>
+      <url hash="07d21420">2024.fever-1.6</url>
+      <bibkey>churina-etal-2024-improving</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Dunamu-ml’s Submissions on <fixed-case>AVERITEC</fixed-case> Shared Task</title>
+      <author><first>Heesoo</first><last>Park</last><affiliation>Dunamu</affiliation></author>
+      <author><first>Dongjun</first><last>Lee</last><affiliation>Dunamu</affiliation></author>
+      <author><first>Jaehyuk</first><last>Kim</last><affiliation>Dunamu</affiliation></author>
+      <author><first>ChoongWon</first><last>Park</last></author>
+      <author><first>Changhwa</first><last>Park</last><affiliation>LG Energy Solution</affiliation></author>
+      <pages>71-76</pages>
+      <abstract>This paper presents the Dunamu-ml’s submission to the AVERITEC shared task of the 7th the Fact Extraction and VERification (FEVER) workshop. The task focused on discriminating whether each claim is a fact or not. Our method is powered by the combination of an LLM and a non-parametric lexicon-based method (i.e. BM25). Essentially, we augmented the list of evidences containing the query and the corresponding answers using an powerful LLM, then, retrieved the relative documents using the generated evidences. As such, our method made a great improvement over the baseline results, achieving 0.33 performance gain over the baseline in AveriTec score.</abstract>
+      <url hash="85aa6153">2024.fever-1.7</url>
+      <bibkey>park-etal-2024-dunamu</bibkey>
+    </paper>
+    <paper id="8">
+      <title><fixed-case>FZI</fixed-case>-<fixed-case>WIM</fixed-case> at <fixed-case>AV</fixed-case>eri<fixed-case>T</fixed-case>e<fixed-case>C</fixed-case> Shared Task: Real-World Fact-Checking with Question Answering</title>
+      <author><first>Jin</first><last>Liu</last></author>
+      <author><first>Steffen</first><last>Thoma</last></author>
+      <author><first>Achim</first><last>Rettinger</last><affiliation>FZI Forschungszentrum Informatik and Trier University</affiliation></author>
+      <pages>77-85</pages>
+      <abstract>This paper describes the FZI-WIM system at the AVeriTeC shared Task, which aims to assess evidence-based automated fact-checking systems for real-world claims with evidence retrieved from the web. The FZI-WIM system utilizes open-source models to build a reliable fact-checking pipeline via question-answering. With different experimental setups, we show that more questions lead to higher scores in the shared task. Both in question generation and question-answering stages, sampling can be a way to improve the performance of our system. We further analyze the limitations of current open-source models for real-world claim verification. Our code is publicly available https://github.com/jens5588/FZI-WIM-AVERITEC.</abstract>
+      <url hash="756da120">2024.fever-1.8</url>
+      <bibkey>liu-etal-2024-fzi</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Zero-Shot Learning and Key Points Are All You Need for Automated Fact-Checking</title>
+      <author><first>Mohammad</first><last>Mohammadkhani</last></author>
+      <author><first>Ali</first><last>Mohammadkhani</last><affiliation>Shahid Soltani 4 High School</affiliation></author>
+      <author><first>Hamid</first><last>Beigy</last></author>
+      <pages>86-90</pages>
+      <abstract>Automated fact-checking is an important task because determining the accurate status of a proposed claim within the vast amount of information available online is a critical challenge. This challenge requires robust evaluation to prevent the spread of false information. Modern large language models (LLMs) have demonstrated high capability in performing a diverse range of Natural Language Processing (NLP) tasks. By utilizing proper prompting strategies, their versatility—due to their understanding of large context sizes and zero-shot learning ability—enables them to simulate human problem-solving intuition and move towards being an alternative to humans for solving problems. In this work, we introduce a straightforward framework based on _**Z**ero-**S**hot **L**earning_ and _**Ke**y **P**oints_ (ZSL-KeP) for automated fact-checking, which despite its simplicity, performed well on the AVeriTeC shared task dataset by robustly improving the baseline and achieving 10th place.</abstract>
+      <url hash="344c2829">2024.fever-1.9</url>
+      <bibkey>mohammadkhani-etal-2024-zero</bibkey>
+    </paper>
+    <paper id="10">
+      <title>Evidence-backed Fact Checking using <fixed-case>RAG</fixed-case> and Few-Shot In-Context Learning with <fixed-case>LLM</fixed-case>s</title>
+      <author><first>Ronit</first><last>Singal</last><affiliation>IIT Kharagpur, India</affiliation></author>
+      <author><first>Pransh</first><last>Patwa</last><affiliation>Aditya English Medium School, India</affiliation></author>
+      <author><first>Parth</first><last>Patwa</last><affiliation>Amazon and University of California, Los Angeles</affiliation></author>
+      <author><first>Aman</first><last>Chadha</last><affiliation>Amazon</affiliation></author>
+      <author><first>Amitava</first><last>Das</last><affiliation>University of South Carolina</affiliation></author>
+      <pages>91-98</pages>
+      <abstract>Given the widespread dissemination of misinformation on social media, implementing fact-checking mechanisms for online claims is essential. Manually verifying every claim is very challenging, underscoring the need for an automated fact-checking system. This paper presents our system designed to address this issue. We utilize the Averitec dataset (Schlichtkrull et al., 2023) to assess the performance of our fact-checking system. In addition to veracity prediction, our system provides supporting evidence, which is extracted from the dataset. We develop a Retrieve and Generate (RAG) pipeline to extract relevant evidence sentences from a knowledge base, which are then inputted along with the claim into a large language model (LLM) for classification. We also evaluate the few-shot In-Context Learning (ICL) capabilities of multiple LLMs. Our system achieves an ‘Averitec’ score of 0.33, which is a 22% absolute improvement over the baseline. Our Code is publicly available on https://github.com/ronit-singhal/evidence-backed-fact-checking-using-rag-and-few-shot-in-context-learning-with-llms.</abstract>
+      <url hash="55f4e15c">2024.fever-1.10</url>
+      <bibkey>singal-etal-2024-evidence</bibkey>
+    </paper>
+    <paper id="11">
+      <title><fixed-case>SK</fixed-case>_<fixed-case>DU</fixed-case> Team: Cross-Encoder based Evidence Retrieval and Question Generation with Improved Prompt for the <fixed-case>AV</fixed-case>eri<fixed-case>T</fixed-case>e<fixed-case>C</fixed-case> Shared Task</title>
+      <author><first>Shrikant</first><last>Malviya</last></author>
+      <author><first>Stamos</first><last>Katsigiannis</last><affiliation>Durham University</affiliation></author>
+      <pages>99-107</pages>
+      <abstract>As part of the AVeriTeC shared task, we developed a pipelined system comprising robust and finely tuned models. Our system integrates advanced techniques for evidence retrieval and question generation, leveraging cross-encoders and large language models (LLMs) for optimal performance. With multi-stage processing, the pipeline demonstrates improvements over baseline models, particularly in handling complex claims that require nuanced reasoning by improved evidence extraction, question generation and veracity prediction. Through detailed experiments and ablation studies, we provide insights into the strengths and weaknesses of our approach, highlighting the critical role of evidence sufficiency and context dependency in automated fact-checking systems. Our system secured a competitive rank, 7th on the development and 12th on the test data, in the shared task, underscoring the effectiveness of our methods in addressing the challenges of real-world claim verification.</abstract>
+      <url hash="4f375067">2024.fever-1.11</url>
+      <bibkey>malviya-katsigiannis-2024-sk</bibkey>
+    </paper>
+    <paper id="12">
+      <title><fixed-case>I</fixed-case>n<fixed-case>F</fixed-case>act: A Strong Baseline for Automated Fact-Checking</title>
+      <author><first>Mark</first><last>Rothermel</last><affiliation>Technische Universität Darmstadt</affiliation></author>
+      <author><first>Tobias</first><last>Braun</last></author>
+      <author><first>Marcus</first><last>Rohrbach</last><affiliation>Technische Universität Darmstadt</affiliation></author>
+      <author><first>Anna</first><last>Rohrbach</last><affiliation>Technische Universität Darmstadt</affiliation></author>
+      <pages>108-112</pages>
+      <abstract>The spread of disinformation poses a global threat to democratic societies, necessitating robust and scalable Automated Fact-Checking (AFC) systems. The AVeriTeC Shared Task Challenge 2024 offers a realistic benchmark for text-based fact-checking methods. This paper presents Information-Retrieving Fact-Checker (InFact), an LLM-based approach that breaks down the task of claim verification into a 6-stage process, including evidence retrieval. When using GPT-4o as the backbone, InFact achieves an AVeriTeC score of 63% on the test set, outperforming all other 20 teams competing in the challenge, and establishing a new strong baseline for future text-only AFC systems. Qualitative analysis of mislabeled instances reveals that InFact often yields a more accurate conclusion than AVeriTeC’s human-annotated ground truth.</abstract>
+      <url hash="a8a28155">2024.fever-1.12</url>
+      <bibkey>rothermel-etal-2024-infact</bibkey>
+    </paper>
+    <paper id="13">
+      <title>Exploring Retrieval Augmented Generation For Real-world Claim Verification</title>
+      <author><first>Adjali</first><last>Omar</last><affiliation>CEA</affiliation></author>
+      <pages>113-117</pages>
+      <abstract>Automated Fact-Checking (AFC) has recently gained considerable attention to address the increasing misinformation spreading in the web and social media. The recently introduced AVeriTeC dataset alleviates some limitations of existing AFC benchmarks. In this paper, we propose to explore Retrieval Augmented Generation (RAG) and describe the system (UPS participant) we implemented to solve the AVeriTeC shared task.Our end-to-end system integrates retrieval and generation in a joint training setup to enhance evidence retrieval and question generation. Our system operates as follows: First, we conduct dense retrieval of evidence by encoding candidate evidence sentences from the provided knowledge store documents. Next, we perform a secondary retrieval of question-answer pairs from the training set, encoding these into dense vectors to support question generation with relevant in-context examples. During training, the question generator is optimized to generate questions based on retrieved or gold evidence. In preliminary automatic evaluation, our system achieved respectively 0.198 and 0.210 AVeriTeC scores on the dev and test sets.</abstract>
+      <url hash="e0c10e3f">2024.fever-1.13</url>
+      <bibkey>omar-2024-exploring</bibkey>
+    </paper>
+    <paper id="14">
+      <title><fixed-case>GP</fixed-case>roof<fixed-case>T</fixed-case>: A Multi-dimension Multi-round Fact Checking Framework Based on Claim Fact Extraction</title>
+      <author><first>Jiayu</first><last>Liu</last></author>
+      <author><first>Junhao</first><last>Tang</last></author>
+      <author><first>Hanwen</first><last>Wang</last><affiliation>The Hong Kong University of Science and Technology</affiliation></author>
+      <author><first>Baixuan</first><last>Xu</last><affiliation>Hong Kong University of Science and Technology</affiliation></author>
+      <author><first>Haochen</first><last>Shi</last></author>
+      <author><first>Weiqi</first><last>Wang</last><affiliation>Johns Hopkins University and The Hong Kong University of Science and Technology</affiliation></author>
+      <author><first>Yangqiu</first><last>Song</last><affiliation>The Hong Kong University of Science and Technology</affiliation></author>
+      <pages>118-129</pages>
+      <abstract>In the information era, the vast proliferation of online content poses significant challenges, particularly concerning the trustworthiness of these digital statements, which can have profound societal implications. Although it is possible to manually annotate and verify the authenticity of such content, the sheer volume and rapid pace of information generation render this approach impractical, both in terms of time and cost. Therefore, it is imperative to develop automated systems capable of validating online claims, ensuring that users can use the wealth of information available on the Internet effectively and reliably. Using primarily ChatGPT and the Google search API, GProofT fact checking framework generates question-answer pairs to systematically extract and verify the facts within claims. Based on the outcomes of these QA pairs, claims are subsequently labeled as Supported, Conflicted Evidence/Cherry-Picking, or Refuted. Shown by extensive experiments, GProofT Retrieval generally performs effectively in fact-checking and makes a substantial contribution to the task. Our code is released on https://github.com/HKUST-KnowComp/GProofT.</abstract>
+      <url hash="23d5c6fb">2024.fever-1.14</url>
+      <bibkey>liu-etal-2024-gprooft</bibkey>
+    </paper>
+    <paper id="15">
+      <title><fixed-case>H</fixed-case>er<fixed-case>O</fixed-case> at <fixed-case>AV</fixed-case>eri<fixed-case>T</fixed-case>e<fixed-case>C</fixed-case>: The Herd of Open Large Language Models for Verifying Real-World Claims</title>
+      <author><first>Yejun</first><last>Yoon</last><affiliation>Soongsil University</affiliation></author>
+      <author><first>Jaeyoon</first><last>Jung</last></author>
+      <author><first>Seunghyun</first><last>Yoon</last><affiliation>Adobe Research</affiliation></author>
+      <author><first>Kunwoo</first><last>Park</last><affiliation>Soongsil University</affiliation></author>
+      <pages>130-136</pages>
+      <abstract>To tackle the AVeriTeC shared task hosted by the FEVER-24, we introduce a system that only employs publicly available large language models (LLMs) for each step of automated fact-checking, dubbed the <b>Her</b>d of <b>O</b>pen LLMs for verifying real-world claims (<b>HerO</b>). HerO employs multiple LLMs for each step of automated fact-checking. For evidence retrieval, a language model is used to enhance a query by generating hypothetical documents that check the veracity of a claim. We fine-tune LLMs for question generation and veracity prediction by crafting prompts with retrieved in-context samples. HerO achieved 2nd place on the leaderboard with the AVeriTeC score of 0.57, suggesting the potential of open LLMs for verifying real-world claims. For future research, we make our code publicly available at <url>https://github.com/ssu-humane/HerO</url>.</abstract>
+      <url hash="70846fe9">2024.fever-1.15</url>
+      <bibkey>yoon-etal-2024-hero</bibkey>
+    </paper>
+    <paper id="16">
+      <title><fixed-case>AIC</fixed-case> <fixed-case>CTU</fixed-case> system at <fixed-case>AV</fixed-case>eri<fixed-case>T</fixed-case>e<fixed-case>C</fixed-case>: Re-framing automated fact-checking as a simple <fixed-case>RAG</fixed-case> task</title>
+      <author><first>Herbert</first><last>Ullrich</last></author>
+      <author><first>Tomáš</first><last>Mlynář</last><affiliation>Czech Technical Univeresity in Prague, Czech Technical University of Prague</affiliation></author>
+      <author><first>Jan</first><last>Drchal</last><affiliation>Czech Technical Univeresity in Prague, Czech Technical University of Prague</affiliation></author>
+      <pages>137-150</pages>
+      <abstract>This paper describes our <tex-math>3^{rd}</tex-math> place submission in the AVeriTeC shared task in which we attempted to address the challenge of fact-checking with evidence retrieved in the wild using a simple scheme of Retrieval-Augmented Generation (RAG) designed for the task, leveraging the predictive power of Large Language Models.We release our codebase and explain its two modules - the Retriever and the Evidence &amp; Label generator - in detail, justifying their features such as MMR-reranking and Likert-scale confidence estimation.We evaluate our solution on AVeriTeC dev and test set and interpret the results, picking the GPT-4o as the most appropriate model for our pipeline at the time of our publication, with Llama 3.1 70B being a promising open-source alternative.We perform an empirical error analysis to see that faults in our predictions often coincide with noise in the data or ambiguous fact-checks, provoking further research and data augmentation.</abstract>
+      <url hash="2fad9189">2024.fever-1.16</url>
+      <bibkey>ullrich-etal-2024-aic</bibkey>
+    </paper>
+    <paper id="20">
+      <title>Enhancing Fact Verification with Causal Knowledge Graphs and Transformer-Based Retrieval for Deductive Reasoning</title>
+      <author><first>Fiona</first><last>Tan</last></author>
+      <author><first>Jay</first><last>Desai</last><affiliation>Amazon</affiliation></author>
+      <author><first>Srinivasan</first><last>Sengamedu</last><affiliation>Amazon</affiliation></author>
+      <pages>151-169</pages>
+      <abstract>The ability to extract and verify factual information from free-form text is critical in an era where vast amounts of unstructured data are available, yet unreliable sources abound. This paper focuses on enhancing causal deductive reasoning, a key component of factual verification, through the lens of accident investigation, where determining the probable causes of events is paramount. Deductive reasoning refers to the task of drawing conclusions based on a premise. While some deductive reasoning benchmarks exist, none focus on causal deductive reasoning and are from real-world applications. Recently, large language models (LLMs) used with prompt engineering techniques like retrieval-augmented generation (RAG) have demonstrated remarkable performance across various natural language processing benchmarks. However, adapting these techniques to handle scenarios with no knowledge bases and to different data structures, such as graphs, remains an ongoing challenge. In our study, we introduce a novel framework leveraging LLMs’ decent ability to detect and infer causal relations to construct a causal Knowledge Graph (KG) which represents knowledge that the LLM recognizes. Additionally, we propose a RoBERTa-based Transformer Graph Neural Network (RoTG) specifically designed to select relevant nodes within this KG. Integrating RoTG-retrieved causal chains into prompts effectively enhances LLM performance, demonstrating usefulness of our approach in advancing LLMs’ causal deductive reasoning capabilities.</abstract>
+      <url hash="7c3b6f45">2024.fever-1.20</url>
+      <bibkey>tan-etal-2024-enhancing-fact</bibkey>
+    </paper>
+    <paper id="21">
+      <title>Numerical Claim Detection in Finance: A New Financial Dataset, Weak-Supervision Model, and Market Analysis</title>
+      <author><first>Agam</first><last>Shah</last></author>
+      <author><first>Arnav</first><last>Hiray</last></author>
+      <author><first>Pratvi</first><last>Shah</last></author>
+      <author><first>Arkaprabha</first><last>Banerjee</last></author>
+      <author><first>Anushka</first><last>Singh</last></author>
+      <author><first>Dheeraj</first><last>Eidnani</last><affiliation>Georgia Institute of Technology</affiliation></author>
+      <author><first>Sahasra</first><last>Chava</last></author>
+      <author><first>Bhaskar</first><last>Chaudhury</last></author>
+      <author><first>Sudheer</first><last>Chava</last><affiliation>Georgia Institute of Technology</affiliation></author>
+      <pages>170-185</pages>
+      <abstract>In this paper, we investigate the influence of claims in analyst reports and earnings calls on financial market returns, considering them as significant quarterly events for publicly traded companies. To facilitate a comprehensive analysis, we construct a new financial dataset for the claim detection task in the financial domain. We benchmark various language models on this dataset and propose a novel weak-supervision model that incorporates the knowledge of subject matter experts (SMEs) in the aggregation function, outperforming existing approaches. We also demonstrate the practical utility of our proposed model by constructing a novel measure of *optimism*. Here, we observe the dependence of earnings surprise and return on our optimism measure. Our dataset, models, and code are publicly (under CC BY 4.0 license) available on GitHub.</abstract>
+      <url hash="649ae2ce">2024.fever-1.21</url>
+      <bibkey>shah-etal-2024-numerical</bibkey>
+    </paper>
+    <paper id="22">
+      <title>Streamlining Conformal Information Retrieval via Score Refinement</title>
+      <author><first>Yotam</first><last>Intrator</last></author>
+      <author><first>Regev</first><last>Cohen</last><affiliation>Google</affiliation></author>
+      <author><first>Ori</first><last>Kelner</last></author>
+      <author><first>Roman</first><last>Goldenberg</last></author>
+      <author><first>Ehud</first><last>Rivlin</last><affiliation>Technion, Technion</affiliation></author>
+      <author><first>Daniel</first><last>Freedman</last><affiliation>Verily</affiliation></author>
+      <pages>186-191</pages>
+      <abstract>Information retrieval (IR) methods, like retrieval augmented generation, are fundamental to modern applications but often lack statistical guarantees. Conformal prediction addresses this by retrieving sets guaranteed to include relevant information, yet existing approaches produce large-sized sets, incurring high computational costs and slow response times. In this work, we introduce a score refinement method that applies a simple monotone transformation to retrieval scores, leading to significantly smaller conformal sets while maintaining their statistical guarantees. Experiments on various BEIR benchmarks validate the effectiveness of our approach in producing compact sets containing relevant information.</abstract>
+      <url hash="dbda9513">2024.fever-1.22</url>
+      <bibkey>intrator-etal-2024-streamlining</bibkey>
+    </paper>
+    <paper id="23">
+      <title>Improving Explainable Fact-Checking via Sentence-Level Factual Reasoning</title>
+      <author><first>Francielle</first><last>Vargas</last></author>
+      <author><first>Isadora</first><last>Salles</last></author>
+      <author><first>Diego</first><last>Alves</last><affiliation>Universität des Saarlandes</affiliation></author>
+      <author><first>Ameeta</first><last>Agrawal</last><affiliation>Portland State University</affiliation></author>
+      <author><first>Thiago</first><last>Pardo</last><affiliation>Universidade de São Paulo</affiliation></author>
+      <author><first>Fabrício</first><last>Benevenuto</last></author>
+      <pages>192-204</pages>
+      <abstract>Most existing fact-checking systems are unable to explain their decisions by providing relevant rationales (justifications) for their predictions. It highlights a lack of transparency that poses significant risks, such as the prevalence of unexpected biases, which may increase political polarization due to limitations in impartiality. To address this critical gap, we introduce SEntence-Level FActual Reasoning (SELFAR), aimed at improving explainable fact-checking. SELFAR relies on fact extraction and verification by predicting the news source reliability and factuality (veracity) of news articles or claims at the sentence level, generating post-hoc explanations using SHAP/LIME and zero-shot prompts. Our experiments show that unreliable news stories predominantly consist of subjective statements, in contrast to reliable ones. Consequently, predicting unreliable news articles at the sentence level by analyzing impartiality and subjectivity is a promising approach for fact extraction and improving explainable fact-checking. Furthermore, LIME outperforms SHAP in explaining predictions on reliability. Additionally, while zero-shot prompts provide highly readable explanations and achieve an accuracy of 0.71 in predicting factuality, their tendency to hallucinate remains a challenge. Lastly, this paper also presents the first study on explainable fact-checking in the Portuguese language.</abstract>
+      <url hash="9a3cd532">2024.fever-1.23</url>
+      <bibkey>vargas-etal-2024-improving</bibkey>
+    </paper>
+    <paper id="24">
+      <title>Fast Evidence Extraction for Grounded Language Model Outputs</title>
+      <author><first>Pranav</first><last>Mani</last><affiliation>Abridge AI</affiliation></author>
+      <author><first>Davis</first><last>Liang</last><affiliation>Abridge</affiliation></author>
+      <author><first>Zachary</first><last>Lipton</last><affiliation>Carnegie Mellon University</affiliation></author>
+      <pages>205-218</pages>
+      <abstract>Summarizing documents with Large Language Models (LLMs) warrants a rigorous inspection of the resulting outputs by humans. However, unaided verification of generated outputs is time-intensive and intractable at scale. For high-stakes applications like healthcare where verification is necessary, expediting this step can unlock massive gains in productivity. In this paper, we focus on the task of evidence extraction for abstractive summarization: for each summary line, extract the corresponding evidence spans from a source document. Viewing this evidence extraction problem through the lens of extractive question answering, we train a set of fast and scalable hierarchical architectures: EarlyFusion, MidFusion, and LateFusion. Our experiments show that (i) our method outperforms the state-of-the-art by 1.4% relative F1-Score; (ii) our model architecture reduces latency by 4x over a RoBERTa-Large baseline; and (iii) pretraining on an extractive QA corpus confers positive transfer to evidence extraction, especially in low-resource regimes.</abstract>
+      <url hash="5548204e">2024.fever-1.24</url>
+      <bibkey>mani-etal-2024-fast</bibkey>
+    </paper>
+    <paper id="25">
+      <title>Question-Based Retrieval using Atomic Units for Enterprise <fixed-case>RAG</fixed-case></title>
+      <author><first>Vatsal</first><last>Raina</last></author>
+      <author><first>Mark</first><last>Gales</last><affiliation>University of Cambridge</affiliation></author>
+      <pages>219-233</pages>
+      <abstract>Enterprise retrieval augmented generation (RAG) offers a highly flexible framework for combining powerful large language models (LLMs) with internal, possibly temporally changing, documents. In RAG, documents are first chunked. Relevant chunks are then retrieved for a user query, which are passed as context to a synthesizer LLM to generate the query response. However, the retrieval step can limit performance, as incorrect chunks can lead the synthesizer LLM to generate a false response. This work applies a zero-shot adaptation of standard dense retrieval steps for more accurate chunk recall. Specifically, a chunk is first decomposed into atomic statements. A set of synthetic questions are then generated on these atoms (with the chunk as the context). Dense retrieval involves finding the closest set of synthetic questions, and associated chunks, to the user query. It is found that retrieval with the atoms leads to higher recall than retrieval with chunks. Further performance gain is observed with retrieval using the synthetic questions generated over the atoms. Higher recall at the retrieval step enables higher performance of the enterprise LLM using the RAG pipeline.</abstract>
+      <url hash="835a0ab9">2024.fever-1.25</url>
+      <bibkey>raina-gales-2024-question</bibkey>
+    </paper>
+    <paper id="26">
+      <title><fixed-case>AMRE</fixed-case>x: <fixed-case>AMR</fixed-case> for Explainable Fact Verification</title>
+      <author><first>Chathuri</first><last>Jayaweera</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Sangpil</first><last>Youm</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Bonnie</first><last>Dorr</last><affiliation>University of Florida</affiliation></author>
+      <pages>234-244</pages>
+      <abstract>With the advent of social media networks and the vast amount of information circulating through them, automatic fact verification is an essential component to prevent the spread of misinformation. It is even more useful to have fact verification systems that provide explanations along with their classifications to ensure accurate predictions. To address both of these requirements, we implement AMREx, an Abstract Meaning Representation (AMR)-based veracity prediction and explanation system for fact verification using a combination of Smatch, an AMR evaluation metric to measure meaning containment and textual similarity, and demonstrate its effectiveness in producing partially explainable justifications using two community standard fact verification datasets, FEVER and AVeriTeC. AMREx surpasses the AVeriTec baseline accuracy showing the effectiveness of our approach for real-world claim verification. It follows an interpretable pipeline and returns an explainable AMR node mapping to clarify the system’s veracity predictions when applicable. We further demonstrate that AMREx output can be used to prompt LLMs to generate natural-language explanations using the AMR mappings as a guide to lessen the probability of hallucinations.</abstract>
+      <url hash="ec74fc84">2024.fever-1.26</url>
+      <bibkey>jayaweera-etal-2024-amrex</bibkey>
+    </paper>
+    <paper id="27">
+      <title>Claim Check-Worthiness Detection: How Well do <fixed-case>LLM</fixed-case>s Grasp Annotation Guidelines?</title>
+      <author><first>Laura</first><last>Majer</last></author>
+      <author><first>Jan</first><last>Šnajder</last><affiliation>UniZg-FER, University of Zagreb</affiliation></author>
+      <pages>245-263</pages>
+      <abstract>The rising threat of disinformation underscores the need to fully or partially automate the fact-checking process. Identifying text segments requiring fact-checking is known as claim detection (CD) and claim check-worthiness detection (CW), the latter incorporating complex domain-specific criteria of worthiness and often framed as a ranking task. Zero- and few-shot LLM prompting is an attractive option for both tasks, as it bypasses the need for labeled datasets and allows verbalized claim and worthiness criteria to be directly used for prompting. We evaluate the LLMs’ predictive accuracy on five CD/CW datasets from diverse domains, using corresponding annotation guidelines in prompts. We examine two key aspects: (1) how to best distill factuality and worthiness criteria into a prompt, and (2) how much context to provide for each claim. To this end, we experiment with different levels of prompt verbosity and varying amounts of contextual information given to the model. We additionally evaluate the top-performing models with ranking metrics, resembling prioritization done by fact-checkers. Our results show that optimal prompt verbosity varies, meta-data alone adds more performance boost than co-text, and confidence scores can be directly used to produce reliable check-worthiness rankings.</abstract>
+      <url hash="dd191b8a">2024.fever-1.27</url>
+      <bibkey>majer-snajder-2024-claim</bibkey>
+    </paper>
+    <paper id="28">
+      <title>Contrastive Learning to Improve Retrieval for Real-World Fact Checking</title>
+      <author><first>Aniruddh</first><last>Sriram</last></author>
+      <author><first>Fangyuan</first><last>Xu</last><affiliation>University of Texas at Austin and University of Texas at Austin</affiliation></author>
+      <author><first>Eunsol</first><last>Choi</last><affiliation>New York University</affiliation></author>
+      <author><first>Greg</first><last>Durrett</last><affiliation>University of Texas, Austin</affiliation></author>
+      <pages>264-279</pages>
+      <abstract>Recent work on fact-checking addresses a realistic setting where models incorporate evidence retrieved from the web to decide the veracity of claims. A bottleneck in this pipeline is in retrieving relevant evidence: traditional methods may surface documents directly related to a claim, but fact-checking complex claims requires more inferences. For instance, a document about how a vaccine was developed is relevant to addressing claims about what it might contain, even if it does not address them directly. We present Contrastive Fact-Checking Reranker (CFR), an improved retriever for this setting. By leveraging the AVeriTeC dataset, which annotates subquestions for claims with human written answers from evidence documents, we fine-tune Contriever with a contrastive objective based on multiple training signals, including distillation from GPT-4, evaluating subquestion answers, and gold labels in the dataset. We evaluate our model on both retrieval and end-to-end veracity judgments about claims. On the AVeriTeC dataset, we find a 6% improvement in veracity classification accuracy. We also show our gains can be transferred to FEVER, ClaimDecomp, HotpotQA, and a synthetic dataset requiring retrievers to make inferences.</abstract>
+      <url hash="cc69d5b5">2024.fever-1.28</url>
+      <bibkey>sriram-etal-2024-contrastive</bibkey>
+    </paper>
+    <paper id="29">
+      <title><fixed-case>RAGAR</fixed-case>, Your Falsehood Radar: <fixed-case>RAG</fixed-case>-Augmented Reasoning for Political Fact-Checking using Multimodal Large Language Models</title>
+      <author><first>Mohammed</first><last>Khaliq</last></author>
+      <author><first>Paul</first><last>Chang</last><affiliation>appliedAI Initiative GmbH</affiliation></author>
+      <author><first>Mingyang</first><last>Ma</last></author>
+      <author><first>Bernhard</first><last>Pflugfelder</last><affiliation>appliedAI Initiative GmbH</affiliation></author>
+      <author><first>Filip</first><last>Miletić</last><affiliation>University of Stuttgart</affiliation></author>
+      <pages>280-296</pages>
+      <abstract>The escalating challenge of misinformation, particularly in political discourse, requires advanced fact-checking solutions; this is even clearer in the more complex scenario of multimodal claims. We tackle this issue using a multimodal large language model in conjunction with retrieval-augmented generation (RAG), and introduce two novel reasoning techniques: Chain of RAG (CoRAG) and Tree of RAG (ToRAG). They fact-check multimodal claims by extracting both textual and image content, retrieving external information, and reasoning subsequent questions to be answered based on prior evidence. We achieve a weighted F1-score of 0.85, surpassing a baseline reasoning technique by 0.14 points. Human evaluation confirms that the vast majority of our generated fact-check explanations contain all information from gold standard data.</abstract>
+      <url hash="8cfe0883">2024.fever-1.29</url>
+      <bibkey>khaliq-etal-2024-ragar</bibkey>
+    </paper>
+    <paper id="30">
+      <title><fixed-case>F</fixed-case>act<fixed-case>G</fixed-case>enius: Combining Zero-Shot Prompting and Fuzzy Relation Mining to Improve Fact Verification with Knowledge Graphs</title>
+      <author><first>Sushant</first><last>Gautam</last></author>
+      <author><first>Roxana</first><last>Pop</last></author>
+      <pages>297-306</pages>
+      <abstract>Fact-checking is a crucial natural language processing (NLP) task that verifies the truthfulness of claims by considering reliable evidence. Traditional methods are labour- intensive, and most automatic approaches focus on using documents as evidence. In this paper, we focus on the relatively understudied fact-checking with Knowledge Graph data as evidence and experiment on the recently introduced FactKG benchmark. We present FactGenius, a novel method that enhances fact- checking by combining zero-shot prompting of large language models (LLMs) with fuzzy text matching on knowledge graphs (KGs). Our method employs LLMs for filtering relevant connections from the graph and validates these connections via distance-based matching. The evaluation of FactGenius on an existing benchmark demonstrates its effectiveness, as we show it significantly outperforms state-of- the-art methods. The code and materials are available at https://github.com/SushantGautam/FactGenius.</abstract>
+      <url hash="3556c08f">2024.fever-1.30</url>
+      <bibkey>gautam-pop-2024-factgenius</bibkey>
+    </paper>
+    <paper id="32">
+      <title>Fact or Fiction? Improving Fact Verification with Knowledge Graphs through Simplified Subgraph Retrievals</title>
+      <author><first>Tobias</first><last>Opsahl</last></author>
+      <pages>307-316</pages>
+      <abstract>Despite recent success in natural language processing (NLP), fact verification still remains a difficult task. Due to misinformation spreading increasingly fast, attention has been directed towards automatically verifying the correctness of claims. In the domain of NLP, this is usually done by training supervised machine learning models to verify claims by utilizing evidence from trustworthy corpora. We present efficient methods for verifying claims on a dataset where the evidence is in the form of structured knowledge graphs. We use the FactKG dataset, which is constructed from the DBpedia knowledge graph extracted from Wikipedia. By simplifying the evidence retrieval process, from fine-tuned language models to simple logical retrievals, we are able to construct models that both require less computational resources and achieve better test-set accuracy.</abstract>
+      <url hash="99c84ebb">2024.fever-1.32</url>
+      <bibkey>opsahl-2024-fact</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.futured.xml b/data/xml/2024.futured.xml
new file mode 100644
index 0000000000..5a001131fe
--- /dev/null
+++ b/data/xml/2024.futured.xml
@@ -0,0 +1,100 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.futured">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Workshop on the Future of Event Detection (FuturED)</booktitle>
+      <editor><first>Joel</first><last>Tetreault</last></editor>
+      <editor><first>Thien</first><last>Nguyen</last></editor>
+      <editor><first>Hemank</first><last>Lamba</last></editor>
+      <editor><first>Amanda</first><last>Hughes</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="3975ea28">2024.futured-1</url>
+      <venue>futured</venue>
+    </meta>
+    <frontmatter>
+      <url hash="e03d2617">2024.futured-1.0</url>
+      <bibkey>futured-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title><fixed-case>BERT</fixed-case>rend: Neural Topic Modeling for Emerging Trends Detection</title>
+      <author><first>Allaa</first><last>Boutaleb</last><affiliation>Sorbonne University / RTE France</affiliation></author>
+      <author><first>Jerome</first><last>Picault</last><affiliation>RTE</affiliation></author>
+      <author><first>Guillaume</first><last>Grosjean</last><affiliation>Réseau de Transport d’Électricité (RTE)</affiliation></author>
+      <pages>1-17</pages>
+      <abstract>enter abstract here</abstract>
+      <url hash="ebe148ad">2024.futured-1.1</url>
+      <bibkey>boutaleb-etal-2024-bertrend</bibkey>
+    </paper>
+    <paper id="2">
+      <title>An Incremental Clustering Baseline for Event Detection on <fixed-case>T</fixed-case>witter</title>
+      <author><first>Marjolaine</first><last>Ray</last><affiliation>Lattice</affiliation></author>
+      <author><first>Qi</first><last>Wang</last><affiliation>Lattice</affiliation></author>
+      <author><first>Frédérique</first><last>Mélanie-Becquet</last><affiliation>Lattice</affiliation></author>
+      <author><first>Thierry</first><last>Poibeau</last><affiliation>LATTICE (CNRS &amp; ENS/PSL)</affiliation></author>
+      <author><first>Béatrice</first><last>Mazoyer</last><affiliation>SciencesPo</affiliation></author>
+      <pages>18-24</pages>
+      <abstract>enter abstract here</abstract>
+      <url hash="e5ab31d2">2024.futured-1.2</url>
+      <bibkey>ray-etal-2024-incremental</bibkey>
+    </paper>
+    <paper id="3">
+      <title><fixed-case>DEGREE</fixed-case>ˆ2: Efficient Extraction of Multiple Events Using Language Models</title>
+      <author><first>Philip</first><last>Blair</last><affiliation>Blair Software</affiliation></author>
+      <author><first>Kfir</first><last>Bar</last><affiliation>Reichman University</affiliation></author>
+      <pages>25-31</pages>
+      <abstract>enter abstract here</abstract>
+      <url hash="fc82f0be">2024.futured-1.3</url>
+      <bibkey>blair-bar-2024-degree</bibkey>
+    </paper>
+    <paper id="4">
+      <title><fixed-case>MUMOSA</fixed-case>, Interactive Dashboard for <fixed-case>MU</fixed-case>lti-<fixed-case>MO</fixed-case>dal Situation Awareness</title>
+      <author><first>Stephanie M.</first><last>Lukin</last><affiliation>U.S. Army Research Laboratory</affiliation></author>
+      <author><first>Shawn</first><last>Bowser</last><affiliation>U.S. Army Research Laboratory</affiliation></author>
+      <author><first>Reece</first><last>Suchocki</last><affiliation>University of Colorado Boulder</affiliation></author>
+      <author><first>Douglas</first><last>Summers-Stay</last><affiliation>U.S. Army Research Laboratory</affiliation></author>
+      <author><first>Francis</first><last>Ferraro</last><affiliation>University of Maryland, Baltimore County</affiliation></author>
+      <author><first>Cynthia</first><last>Matuszek</last><affiliation>UMBC</affiliation></author>
+      <author><first>Clare</first><last>Voss</last><affiliation>Army Research Laboratory</affiliation></author>
+      <pages>32-47</pages>
+      <abstract>enter abstract here</abstract>
+      <url hash="6eec9198">2024.futured-1.4</url>
+      <bibkey>lukin-etal-2024-mumosa</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Reasoning and Tools for Human-Level Forecasting</title>
+      <author><first>Elvis</first><last>Hsieh</last><affiliation>UC Berkeley</affiliation></author>
+      <author><first>Preston</first><last>Fu</last><affiliation>UC Berkeley</affiliation></author>
+      <author><first>Jonathan</first><last>Chen</last><affiliation>UC Berkeley</affiliation></author>
+      <pages>48-57</pages>
+      <abstract>enter abstract here</abstract>
+      <url hash="13ec8aa2">2024.futured-1.5</url>
+      <bibkey>hsieh-etal-2024-reasoning</bibkey>
+    </paper>
+    <paper id="6">
+      <title>A Comprehensive Survey on Document-Level Information Extraction</title>
+      <author><first>Hanwen</first><last>Zheng</last><affiliation>Virginia Tech</affiliation></author>
+      <author><first>Sijia</first><last>Wang</last><affiliation>Virginia Tech</affiliation></author>
+      <author><first>Lifu</first><last>Huang</last><affiliation>Virginia Tech</affiliation></author>
+      <pages>58-72</pages>
+      <abstract>enter abstract here</abstract>
+      <url hash="b62a5dc3">2024.futured-1.6</url>
+      <bibkey>zheng-etal-2024-comprehensive</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Generative Approaches to Event Extraction: Survey and Outlook</title>
+      <author><first>Étienne</first><last>Simon</last><affiliation>University of Oslo</affiliation></author>
+      <author><first>Helene</first><last>Olsen</last><affiliation>University of Oslo</affiliation></author>
+      <author><first>Huiling</first><last>You</last><affiliation>University of Oslo</affiliation></author>
+      <author><first>Samia</first><last>Touileb</last><affiliation>University of Bergen</affiliation></author>
+      <author><first>Lilja</first><last>Øvrelid</last><affiliation>Dept of Informatics, University of Oslo</affiliation></author>
+      <author><first>Erik</first><last>Velldal</last><affiliation>University of Oslo</affiliation></author>
+      <pages>73-86</pages>
+      <abstract>enter abstract here</abstract>
+      <url hash="eb8c092b">2024.futured-1.7</url>
+      <bibkey>simon-etal-2024-generative</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.genbench.xml b/data/xml/2024.genbench.xml
new file mode 100644
index 0000000000..6a97e72505
--- /dev/null
+++ b/data/xml/2024.genbench.xml
@@ -0,0 +1,171 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.genbench">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the 2nd GenBench Workshop on Generalisation (Benchmarking) in NLP</booktitle>
+      <editor><first>Dieuwke</first><last>Hupkes</last></editor>
+      <editor><first>Verna</first><last>Dankers</last></editor>
+      <editor><first>Khuyagbaatar</first><last>Batsuren</last></editor>
+      <editor><first>Amirhossein</first><last>Kazemnejad</last></editor>
+      <editor><first>Christos</first><last>Christodoulopoulos</last></editor>
+      <editor><first>Mario</first><last>Giulianelli</last></editor>
+      <editor><first>Ryan</first><last>Cotterell</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="b126796d">2024.genbench-1</url>
+      <venue>genbench</venue>
+    </meta>
+    <frontmatter>
+      <url hash="6192890a">2024.genbench-1.0</url>
+      <bibkey>genbench-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Evaluating the fairness of task-adaptive pretraining on unlabeled test data before few-shot text classification</title>
+      <author><first>Kush</first><last>Dubey</last><affiliation>Independent</affiliation></author>
+      <pages>1-26</pages>
+      <abstract>Few-shot learning benchmarks are critical for evaluating modern NLP techniques. It is possible, however, that benchmarks favor methods which easily make use of unlabeled text, because researchers can use unlabeled text from the test set to pretrain their models. Given the dearth of research on this potential problem, we run experiments to quantify the bias caused by pretraining on unlabeled test set text instead of on unlabeled, independently drawn text. Controlled few-shot and zero-shot experiments on 25 classification tasks and 3 language models—BERT, GPT-2, and Mistral 7B—do not find evidence of overoptimism. Furthermore, we demonstrate the importance of repeated subsampling when studying few-shot text classification, and recommend that few-shot learning benchmarks include multiple training folds. Code and data are available here: <url>https://github.com</url> (currently omitted for anonymity).</abstract>
+      <url hash="df3886ed">2024.genbench-1.1</url>
+      <bibkey>dubey-2024-evaluating</bibkey>
+    </paper>
+    <paper id="2">
+      <title>From Language to Pixels: Task Recognition and Task Learning in <fixed-case>LLM</fixed-case>s</title>
+      <author><first>Janek</first><last>Falkenstein</last></author>
+      <author><first>Carolin</first><last>Schuster</last></author>
+      <author><first>Alexander</first><last>Berger</last><affiliation>Technische Universität München</affiliation></author>
+      <author><first>Georg</first><last>Groh</last><affiliation>Technical University Munich</affiliation></author>
+      <pages>27-41</pages>
+      <abstract>LLMs can perform unseen tasks by learning from a few in-context examples. How in-context learning works is still uncertain. We investigate the mechanisms of in-context learning on a challenging non-language task. The task requires the LLM to generate pixel matrices representing images of basic shapes. We introduce a framework to analyze if this task is solved by recognizing similar formats from the training data (task recognition) or by understanding the instructions and learning the skill de novo during inference (task learning). Our experiments demonstrate that LLMs generate meaningful pixel matrices with task recognition and fail to learn such tasks when encountering unfamiliar formats. Our findings offer insights into LLMs’ learning mechanisms and their generalization ability to guide future research on their seemingly human-like behavior.</abstract>
+      <url hash="21d8c9f0">2024.genbench-1.2</url>
+      <bibkey>falkenstein-etal-2024-language</bibkey>
+    </paper>
+    <paper id="3">
+      <title>The <fixed-case>S</fixed-case>lay<fixed-case>QA</fixed-case> benchmark of social reasoning: testing gender-inclusive generalization with neopronouns</title>
+      <author><first>Bastian</first><last>Bunzeck</last><affiliation>Universität Bielefeld</affiliation></author>
+      <author><first>Sina</first><last>Zarrieß</last><affiliation>Bielefeld University</affiliation></author>
+      <pages>42-53</pages>
+      <abstract>We introduce SlayQA, a novel benchmark data set designed to evaluate language models’ ability to handle gender-inclusive language, specifically the use of neopronouns, in a question-answering setting. Derived from the Social IQa data set, SlayQA modifies context-question-answer triples to include gender-neutral pronouns, creating a significant linguistic distribution shift in comparison to common pre-training corpora like C4 or Dolma. Our results show that state-of-the-art language models struggle with the challenge, exhibiting small, but noticeable performance drops when answering question containing neopronouns compared to those without.</abstract>
+      <url hash="9a3359d4">2024.genbench-1.3</url>
+      <bibkey>bunzeck-zarriess-2024-slayqa</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Automated test generation to evaluate tool-augmented <fixed-case>LLM</fixed-case>s as conversational <fixed-case>AI</fixed-case> agents</title>
+      <author><first>Samuel</first><last>Arcadinho</last><affiliation>Zendesk</affiliation></author>
+      <author><first>David</first><last>Aparicio</last></author>
+      <author><first>Mariana</first><last>Almeida</last></author>
+      <pages>54-68</pages>
+      <abstract>Tool-augmented LLMs are a promising approach to create AI agents that can have realistic conversations, follow procedures, and call appropriate functions. However, evaluating them is challenging due to the diversity of possible conversations, and existing datasets focus only on single interactions and function-calling. We present a test generation pipeline to evaluate LLMs as conversational AI agents. Our framework uses LLMs to generate diverse tests grounded on user-defined procedures. For that, we use intermediate graphs to limit the LLM test generator’s tendency to hallucinate content that is not grounded on input procedures, and enforces high coverage of the possible conversations. Additionally, we put forward ALMITA, a manually curated dataset for evaluating AI agents in customer support, and use it to evaluate existing LLMs. Our results show that while tool-augmented LLMs perform well in single interactions, they often struggle to handle complete conversations. While our focus is on customer support, our test generation pipeline is general enough to evaluate different AI agents.</abstract>
+      <url hash="8c65d47b">2024.genbench-1.4</url>
+      <bibkey>arcadinho-etal-2024-automated</bibkey>
+    </paper>
+    <paper id="5">
+      <title><fixed-case>MMLU</fixed-case>-<fixed-case>SR</fixed-case>: A Benchmark for Stress-Testing Reasoning Capability of Large Language Models</title>
+      <author><first>Wentian</first><last>Wang</last></author>
+      <author><first>Sarthak</first><last>Jain</last></author>
+      <author><first>Paul</first><last>Kantor</last><affiliation>University of Wisconsin - Madison, Rutgers University, New Brunswick and Paul B Kantor, Consultant</affiliation></author>
+      <author><first>Jacob</first><last>Feldman</last><affiliation>Rutgers University</affiliation></author>
+      <author><first>Lazaros</first><last>Gallos</last><affiliation>Rutgers University</affiliation></author>
+      <author><first>Hao</first><last>Wang</last><affiliation>Rutgers University</affiliation></author>
+      <pages>69-85</pages>
+      <abstract>We propose MMLU-SR, a novel dataset designed to measure the true comprehension abilities of Large Language Models (LLMs) by challenging their performance in question-answering tasks with modified terms. We reasoned that an agent that “truly” understands a concept can still evaluate it when key terms are replaced by suitably defined alternate terms, and sought to differentiate such comprehension from mere text replacement. In our study, we modified standardized test questions by replacing a key term with a dummy word along with its definition. The key term could be in the context of questions, answers, or both questions and answers. Notwithstanding the high scores achieved by recent popular LLMs on the MMLU leaderboard, we found a substantial reduction in model performance after such replacement, suggesting poor comprehension. This new benchmark provides a rigorous benchmark for testing true model comprehension, and poses a challenge to the broader scientific community.</abstract>
+      <url hash="98323862">2024.genbench-1.5</url>
+      <bibkey>wang-etal-2024-mmlu</bibkey>
+    </paper>
+    <paper id="6">
+      <title><fixed-case>ML</fixed-case>issard: Multilingual Long and Simple Sequential Reasoning Benchmarks</title>
+      <author><first>Mirelle</first><last>Bueno</last></author>
+      <author><first>Roberto</first><last>Lotufo</last><affiliation>University of Campinas, Universidade Estadual de Campinas</affiliation></author>
+      <author><first>Rodrigo</first><last>Frassetto Nogueira</last><affiliation>Universidade Estadual de Campinas</affiliation></author>
+      <pages>86-95</pages>
+      <abstract>Language models are now capable of solving tasks that require dealing with long sequences consisting of hundreds of thousands of tokens. However, they often fail on tasks that require repetitive use of simple rules, even on sequences that are much shorter than those seen during training. For example, state-of-the-art LLMs can find common items in two lists with up to 20 items but fail when lists have 80 items. In this paper, we introduce MLissard, a multilingual benchmark designed to evaluate models’ abilities to process and generate texts of varied lengths and offers a mechanism for controlling sequence complexity. Our evaluation of open-source and proprietary models show a consistent decline in performance across all models and languages as the complexity of the sequence increases. Surprisingly, the use of in-context examples in languages other than English helps increase extrapolation performance significantly.</abstract>
+      <url hash="feff4847">2024.genbench-1.6</url>
+      <bibkey>bueno-etal-2024-mlissard</bibkey>
+    </paper>
+    <paper id="7">
+      <title><fixed-case>M</fixed-case>ulti<fixed-case>P</fixed-case>rag<fixed-case>E</fixed-case>val: Multilingual Pragmatic Evaluation of Large Language Models</title>
+      <author><first>Dojun</first><last>Park</last><affiliation>Seoul National University</affiliation></author>
+      <author><first>Jiwoo</first><last>Lee</last><affiliation>NA</affiliation></author>
+      <author><first>Seohyun</first><last>Park</last><affiliation>NA</affiliation></author>
+      <author><first>Hyeyun</first><last>Jeong</last><affiliation>NA</affiliation></author>
+      <author><first>Youngeun</first><last>Koo</last><affiliation>NA</affiliation></author>
+      <author><first>Soonha</first><last>Hwang</last><affiliation>Yonsei University</affiliation></author>
+      <author><first>Seonwoo</first><last>Park</last><affiliation>NA</affiliation></author>
+      <author><first>Sungeun</first><last>Lee</last><affiliation>NA</affiliation></author>
+      <pages>96-119</pages>
+      <abstract>As the capabilities of Large Language Models (LLMs) expand, it becomes increasingly important to evaluate them beyond basic knowledge assessment, focusing on higher-level language understanding. This study introduces MultiPragEval, the first multilingual pragmatic evaluation of LLMs, designed for English, German, Korean, and Chinese. Comprising 1200 question units categorized according to Grice’s Cooperative Principle and its four conversational maxims, MultiPragEval enables an in-depth assessment of LLMs’ contextual awareness and their ability to infer implied meanings. Our findings demonstrate that Claude3-Opus significantly outperforms other models in all tested languages, establishing a state-of-the-art in the field. Among open-source models, Solar-10.7B and Qwen1.5-14B emerge as strong competitors. By analyzing pragmatic inference, we provide valuable insights into the capabilities essential for advanced language comprehension in AI systems.</abstract>
+      <url hash="30e88042">2024.genbench-1.7</url>
+      <bibkey>park-etal-2024-multiprageval</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Beyond the Numbers: Transparency in Relation Extraction Benchmark Creation and Leaderboards</title>
+      <author><first>Varvara</first><last>Arzt</last></author>
+      <author><first>Allan</first><last>Hanbury</last><affiliation>Complexity Science Hub and Technische Universität Wien</affiliation></author>
+      <pages>120-130</pages>
+      <abstract>This paper investigates the transparency in the creation of benchmarks and the use of leaderboards for measuring progress in NLP, with a focus on the relation extraction (RE) task. Existing RE benchmarks often suffer from insufficient documentation, lacking crucial details such as data sources, inter-annotator agreement, the algorithms used for the selection of instances for datasets, and information on potential biases like dataset imbalance. Progress in RE is frequently measured by leaderboards that rank systems based on evaluation methods, typically limited to aggregate metrics like F1-score. However, the absence of detailed performance analysis beyond these metrics can obscure the true generalisation capabilities of models. Our analysis reveals that widely used RE benchmarks, such as TACRED and NYT, tend to be highly imbalanced and contain noisy labels. Moreover, the lack of class-based performance metrics fails to accurately reflect model performance across datasets with a large number of relation types. These limitations should be carefully considered when reporting progress in RE. While our discussion centers on the transparency of RE benchmarks and leaderboards, the observations we discuss are broadly applicable to other NLP tasks as well. Rather than undermining the significance and value of existing RE benchmarks and the development of new models, this paper advocates for improved documentation and more rigorous evaluation to advance the field.</abstract>
+      <url hash="50a9cc85">2024.genbench-1.8</url>
+      <bibkey>arzt-hanbury-2024-beyond</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Is artificial intelligence still intelligence? <fixed-case>LLM</fixed-case>s generalize to novel adjective-noun pairs, but don’t mimic the full human distribution</title>
+      <author><first>Hayley</first><last>Ross</last><affiliation>Harvard University, Harvard University</affiliation></author>
+      <author><first>Kathryn</first><last>Davidson</last><affiliation>Harvard University</affiliation></author>
+      <author><first>Najoung</first><last>Kim</last><affiliation>Boston University and Google</affiliation></author>
+      <pages>131-153</pages>
+      <abstract>Inferences from adjective-noun combinations like “Is artificial intelligence still intelligence?” provide a good test bed for LLMs’ understanding of meaning and compositional generalization capability, since there are many combinations which are novel to both humans and LLMs but nevertheless elicit convergent human judgments. We study a range of LLMs and find that the largest models we tested are able to draw human-like inferences when the inference is determined by context and can generalize to unseen adjective-noun combinations. We also propose three methods to evaluate LLMs on these inferences out of context, where there is a distribution of human-like answers rather than a single correct answer. We find that LLMs show a human-like distribution on at most 75% of our dataset, which is promising but still leaves room for improvement.</abstract>
+      <url hash="efb202e2">2024.genbench-1.9</url>
+      <bibkey>ross-etal-2024-artificial</bibkey>
+    </paper>
+    <paper id="10">
+      <title><fixed-case>CHIE</fixed-case>: Generative <fixed-case>MRC</fixed-case> Evaluation for in-context <fixed-case>QA</fixed-case> with Correctness, Helpfulness, Irrelevancy, and Extraneousness Aspects</title>
+      <author><first>Wannaphong</first><last>Phatthiyaphaibun</last><affiliation>Vidyasirimedhi Institute of Science and Technology</affiliation></author>
+      <author><first>Surapon</first><last>Nonesung</last><affiliation>SCB 10X</affiliation></author>
+      <author><first>Peerat</first><last>Limkonchotiwat</last><affiliation>AI Singapore</affiliation></author>
+      <author><first>Can</first><last>Udomcharoenchaikit</last><affiliation>Vidyasirimedhi Institute of Science and Technology (VISTEC)</affiliation></author>
+      <author><first>Jitkapat</first><last>Sawatphol</last></author>
+      <author><first>Ekapol</first><last>Chuangsuwanich</last><affiliation>Chulalongkorn University</affiliation></author>
+      <author><first>Sarana</first><last>Nutanong</last></author>
+      <pages>154-164</pages>
+      <abstract>The evaluation of generative models in Machine Reading Comprehension (MRC) presents distinct difficulties, as traditional metrics like BLEU, ROUGE, METEOR, Exact Match, and F1 score often struggle to capture the nuanced and diverse responses. While embedding-based metrics such as BERTScore and BARTScore focus on semantic similarity, they still fail to fully address aspects such as recognizing additional helpful information and rewarding contextual faithfulness. Recent advances in large language model (LLM) based metrics offer more fine-grained evaluations, but challenges such as score clustering remain. This paper introduces a multi-aspect evaluation framework, CHIE,incorporating aspects of <b>C</b>orrectness, <b>H</b>elpfulness, <b>I</b>rrelevance, and <b>E</b>xtraneousness. Our approach, which uses binary categorical values rather than continuous rating scales, aligns well with human judgments, indicating its potential as a comprehensive and effective evaluation method.</abstract>
+      <url hash="2509b060">2024.genbench-1.10</url>
+      <bibkey>phatthiyaphaibun-etal-2024-chie</bibkey>
+    </paper>
+    <paper id="11">
+      <title>Investigating the Generalizability of Pretrained Language Models across Multiple Dimensions: A Case Study of <fixed-case>NLI</fixed-case> and <fixed-case>MRC</fixed-case></title>
+      <author><first>Ritam</first><last>Dutt</last><affiliation>Carnegie Mellon University</affiliation></author>
+      <author><first>Sagnik</first><last>Choudhury</last><affiliation>National Board of Medical Examiners</affiliation></author>
+      <author><first>Varun</first><last>Rao</last></author>
+      <author><first>Carolyn</first><last>Rose</last><affiliation>School of Computer Science, Carnegie Mellon University</affiliation></author>
+      <author><first>V.G.Vinod</first><last>Vydiswaran</last><affiliation>University of Michigan - Ann Arbor</affiliation></author>
+      <pages>165-182</pages>
+      <abstract>Generalization refers to the ability of machine learning models to perform well on dataset distributions different from the one it was trained on. While several pre-existing works have characterized the generalizability of NLP models across different dimensions, such as domain shift, adversarial perturbations, or compositional variations, most studies were carried out in a stand-alone setting, emphasizing a single dimension of interest. We bridge this gap by systematically investigating the generalizability of pre-trained language models across different architectures, sizes, and training strategies, over multiple dimensions for the task of natural language inference and question answering. Our results indicate that model instances typically exhibit consistent generalization trends, i.e., they generalize equally well (or poorly) across most scenarios, and this ability is correlated with model architecture, base dataset performance, size, and training mechanism. We hope this research motivates further work in a) developing a multi-dimensional generalization benchmark for systematic evaluation and b) examining the reasons behind models’ generalization abilities. The code and data are available at https://github.com/sagnik/md-gen-nlp, and the trained models are released at https://huggingface.co/varun-v-rao.</abstract>
+      <url hash="684866b1">2024.genbench-1.11</url>
+      <bibkey>dutt-etal-2024-investigating</bibkey>
+    </paper>
+    <paper id="12">
+      <title><fixed-case>O</fixed-case>mni<fixed-case>D</fixed-case>ialog: A Multimodal Benchmark for Generalization Across Text, Visual, and Audio Modalities</title>
+      <author><first>Anton</first><last>Razzhigaev</last></author>
+      <author><first>Maxim</first><last>Kurkin</last><affiliation>Skolkovo Institute of Science and Technology and Artificial Intelligence Research Institute</affiliation></author>
+      <author><first>Elizaveta</first><last>Goncharova</last><affiliation>Higher School of Economics</affiliation></author>
+      <author><first>Irina</first><last>Abdullaeva</last></author>
+      <author><first>Anastasia</first><last>Lysenko</last></author>
+      <author><first>Alexander</first><last>Panchenko</last><affiliation>Skoltech</affiliation></author>
+      <author><first>Andrey</first><last>Kuznetsov</last><affiliation>AIRI, Sber and Samara National Research University</affiliation></author>
+      <author><first>Denis</first><last>Dimitrov</last><affiliation>AIRI and Sber</affiliation></author>
+      <pages>183-195</pages>
+      <abstract>We introduce <tex-math>\textit{OmniDialog}</tex-math> — the first trimodal comprehensive benchmark grounded in a knowledge graph (Wikidata) to evaluate the generalization of Large Multimodal Models (LMMs) across three modalities. Our benchmark consists of more than 4,000 dialogues, each averaging 10 turns, all annotated and cross-validated by human experts. The dialogues in our dataset are designed to prevent shortcut learning by incorporating various formats and misleading or irrelevant multimodal cues. We also evaluate both multimodal and unimodal models to gain insights into how they process modality inputs introduced in the conversation.</abstract>
+      <url hash="83c14ef2">2024.genbench-1.12</url>
+      <bibkey>razzhigaev-etal-2024-omnidialog</bibkey>
+    </paper>
+    <paper id="13">
+      <title>Towards a new Benchmark for Emotion Detection in <fixed-case>NLP</fixed-case>: A Unifying Framework of Recent Corpora</title>
+      <author><first>Anna</first><last>Koufakou</last><affiliation>Florida Gulf Coast University</affiliation></author>
+      <author><first>Elijah</first><last>Nieves</last><affiliation>NA</affiliation></author>
+      <author><first>John</first><last>Peller</last><affiliation>NA</affiliation></author>
+      <pages>196-206</pages>
+      <abstract>Emotion recognition in text is a complex and evolving field that has garnered considerable interest. This paper addresses the pressing need to explore and experiment with new corpora annotated with emotions. We identified several corpora presented since 2018. We restricted this study to English single-labeled data. Nevertheless, the datasets vary in source, domain, topic, emotion types, and distributions. As a basis for benchmarking, we conducted emotion detection experiments by fine-tuning a pretrained model and compared our outcomes with results from the original publications. More importantly, in our efforts to combine existing resources, we created a unified corpus from these diverse datasets and evaluated the impact of training on that corpus versus on the training set for each corpus. Our approach aims to streamline research by offering a unified platform for emotion detection to aid comparisons and benchmarking, addressing a significant gap in the current landscape. Additionally, we present a discussion of related practices and challenges. Our code and dataset information are available at https://github.com/a-koufakou/EmoDetect-Unify. We hope this will enable the NLP community to leverage this unified framework towards a new benchmark in emotion detection.</abstract>
+      <url hash="658db773">2024.genbench-1.13</url>
+      <bibkey>koufakou-etal-2024-towards</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.mrl.xml b/data/xml/2024.mrl.xml
new file mode 100644
index 0000000000..3130a35455
--- /dev/null
+++ b/data/xml/2024.mrl.xml
@@ -0,0 +1,365 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.mrl">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)</booktitle>
+      <editor><first>Jonne</first><last>Sälevä</last></editor>
+      <editor><first>Abraham</first><last>Owodunni</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="f2d55793">2024.mrl-1</url>
+      <venue>mrl</venue>
+    </meta>
+    <frontmatter>
+      <url hash="951799a0">2024.mrl-1.0</url>
+      <bibkey>mrl-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title><fixed-case>S</fixed-case>amba<fixed-case>L</fixed-case>ingo: Teaching Large Language Models New Languages</title>
+      <author><first>Zoltan</first><last>Csaki</last><affiliation>Sambanova Systems</affiliation></author>
+      <author id="bo-li"><first>Bo</first><last>Li</last></author>
+      <author><first>Jonathan</first><last>Li</last></author>
+      <author><first>Qiantong</first><last>Xu</last><affiliation>Sambanova Systems</affiliation></author>
+      <author><first>Pian</first><last>Pawakapan</last></author>
+      <author><first>Leon</first><last>Zhang</last><affiliation>Sambanova Systems</affiliation></author>
+      <author><first>Yun</first><last>Du</last><affiliation>Sambanova Systems</affiliation></author>
+      <author><first>Hengyu</first><last>Zhao</last><affiliation>Sambanova Systems</affiliation></author>
+      <author><first>Changran</first><last>Hu</last><affiliation>Sambanova Systems, Inc</affiliation></author>
+      <author><first>Urmish</first><last>Thakker</last><affiliation>SambaNova Systems</affiliation></author>
+      <pages>1-21</pages>
+      <abstract>Despite the widespread availability of LLMs, there remains a substantial gap in their capabilities and availability across diverse languages. One approach to address these issues has been to take an existing pre-trained LLM and continue to train it on new languages. While prior works have experimented with language adaptation, many questions around best practices and methodology have not been covered. In this paper, we present a comprehensive investigation into the adaptation of LLMs to new languages. Our study covers the key components in this process, including vocabulary extension, direct preference optimization and the data scarcity problem for human alignment in low resource languages. We scale these experiments across 9 languages and 2 parameter scales (7B and 70B). We compare our models against Llama 2, Aya-101, XGLM, BLOOM and existing language experts, outperforming all prior published baselines. Additionally, all evaluation code and checkpoints are made public to facilitate future research.</abstract>
+      <url hash="e7d9365c">2024.mrl-1.1</url>
+      <bibkey>csaki-etal-2024-sambalingo</bibkey>
+    </paper>
+    <paper id="2">
+      <title>What an Elegant Bridge: Multilingual <fixed-case>LLM</fixed-case>s are Biased Similarly in Different Languages</title>
+      <author><first>Viktor</first><last>Mihaylov</last></author>
+      <author><first>Aleksandar</first><last>Shtedritski</last></author>
+      <pages>22-29</pages>
+      <abstract>This paper investigates biases of Large Language Models (LLMs) through the lens of grammatical gender. Drawing inspiration from seminal works in psycholinguistics, particularly the study of gender’s influence on language perception, we leverage multilingual LLMs to revisit and expand upon the foundational experiments of Boroditsky (2003). Employing LLMs as a novel method for examining psycholinguistic biases related to grammatical gender, we prompt a model to describe nouns with adjectives in various languages, focusing specifically on languages with grammatical gender. In particular, we look at adjective co-occurrences across gender and languages, and train a binary classifier to predict grammatical gender given adjectives an LLM uses to describe a noun. Surprisingly, we find that a simple classifier can not only predict noun gender above chance but also exhibit cross-language transferability. We show that while LLMs may describe words differently in different languages, they are biased similarly.</abstract>
+      <url hash="f276a732">2024.mrl-1.2</url>
+      <bibkey>mihaylov-shtedritski-2024-elegant-bridge</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Adapting Open-Source Generative Large Language Models for Low-Resource Languages: A Case Study for <fixed-case>T</fixed-case>urkish</title>
+      <author><first>Cagri</first><last>Toraman</last><affiliation>METU, Middle East Technical University</affiliation></author>
+      <pages>30-44</pages>
+      <abstract>Despite advancements in English-dominant generative large language models, further development is needed for low-resource languages to enhance global accessibility. The primary methods for representing these languages are monolingual and multilingual pretraining. Monolingual pretraining is expensive due to hardware requirements, and multilingual models often have uneven performance across languages. This study explores an alternative solution by adapting large language models, primarily trained on English, to low-resource languages. We assess various strategies, including continual training, instruction fine-tuning, task-specific fine-tuning, and vocabulary extension. The results show that continual training improves language comprehension, as reflected in perplexity scores, and task-specific tuning generally enhances performance of downstream tasks. However, extending the vocabulary shows no substantial benefits. Additionally, while larger models improve task performance with few-shot tuning, multilingual models perform worse than their monolingual counterparts when adapted.</abstract>
+      <url hash="940e79f1">2024.mrl-1.3</url>
+      <bibkey>toraman-2024-adapting</bibkey>
+    </paper>
+    <paper id="4">
+      <title>An Efficient Approach for Studying Cross-Lingual Transfer in Multilingual Language Models</title>
+      <author><first>Fahim</first><last>Faisal</last><affiliation>, George Mason University</affiliation></author>
+      <author><first>Antonios</first><last>Anastasopoulos</last><affiliation>Athena Research Center and George Mason University</affiliation></author>
+      <pages>45-92</pages>
+      <abstract>The capacity and effectiveness of pre-trained multilingual models (MLMs) for zero-shot cross-lingual transfer is well established. However, phenomena of positive or negative transfer, and the effect of language choice still need to be fully understood, especially in the complex setting of massively multilingual LMs. We propose an <i>efficient</i> method to study transfer language influence in zero-shot performance on another target language. Unlike previous work, our approach <i>disentangles downstream tasks from language</i>, using dedicated adapter units. Our findings suggest that some languages do not largely affect others, while some languages, especially ones unseen during pre-training, can be extremely beneficial or detrimental for different target languages. We find that no transfer language is beneficial for all target languages. We do, curiously, observe languages previously unseen by MLMs consistently benefit from transfer from <i>almost any</i> language. We additionally use our modular approach to quantify negative interference efficiently and categorize languages accordingly. Furthermore, we provide a list of promising transfer-target language configurations that consistently lead to target language performance improvements.</abstract>
+      <url hash="27eca014">2024.mrl-1.4</url>
+      <bibkey>faisal-anastasopoulos-2024-efficient</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Are You Sure? Rank Them Again: Repeated Ranking For Better Preference Datasets</title>
+      <author><first>Peter</first><last>Devine</last></author>
+      <pages>93-105</pages>
+      <abstract>Training Large Language Models (LLMs) with Reinforcement Learning from AI Feedback (RLAIF) aligns model outputs more closely with human preferences. This involves an evaluator model ranking multiple candidate responses to user prompts. However, the rankings from popular evaluator models such as GPT-4 can be inconsistent.We propose the Repeat Ranking method, in which we evaluate the same responses multiple times and train only on those responses which are consistently ranked. Using 2,714 training prompts in 62 languages, we generated responses from 7 top multilingual LLMs and had GPT-4 rank them five times each. Evaluating on MT-Bench chat benchmarks in six languages, our method outperformed the standard practice of training on all available prompts.Our work highlights the quality versus quantity trade-off in RLAIF dataset generation and offers a stackable strategy for enhancing dataset and thus model quality.</abstract>
+      <url hash="96892825">2024.mrl-1.5</url>
+      <bibkey>devine-2024-sure</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Tagengo: A Multilingual Chat Dataset</title>
+      <author><first>Peter</first><last>Devine</last></author>
+      <pages>106-113</pages>
+      <abstract>Open source large language models (LLMs) have shown great improvements in recent times. However, many of these models are focused solely on popular spoken languages. We present a high quality dataset of more than 70k prompt-response pairs in 74 languages which consist of human generated prompts and synthetic responses. We use this dataset to train a state-of-the-art open source English LLM to chat multilingually.We evaluate our model on MT-Bench chat benchmarks in 6 languages, finding that our multilingual model outperforms previous state-of-the-art open source LLMs across each language. We further find that training on more multilingual data is beneficial to the performance in a chosen target language (Japanese) compared to simply training on only data in that language.These results indicate the necessity of training on large amounts of high quality multilingual data to make a more accessible LLM.</abstract>
+      <url hash="c4d19d45">2024.mrl-1.6</url>
+      <bibkey>devine-2024-tagengo</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Language and Task Arithmetic with Parameter-Efficient Layers for Zero-Shot Summarization</title>
+      <author><first>Alexandra</first><last>Chronopoulou</last><affiliation>Google</affiliation></author>
+      <author><first>Jonas</first><last>Pfeiffer</last><affiliation>Google DeepMind</affiliation></author>
+      <author><first>Joshua</first><last>Maynez</last><affiliation>Google</affiliation></author>
+      <author><first>Xinyi</first><last>Wang</last><affiliation>Google</affiliation></author>
+      <author><first>Sebastian</first><last>Ruder</last><affiliation>Cohere and Google</affiliation></author>
+      <author><first>Priyanka</first><last>Agrawal</last><affiliation>Google Deepmind</affiliation></author>
+      <pages>114-126</pages>
+      <abstract>Parameter-efficient fine-tuning (PEFT) using labeled task data can significantly improve the performance of large language models (LLMs) on the downstream task. However, there are 7000 languages in the world and many of these languages lack labeled data for real-world language generation tasks. In this paper, we propose to improve zero-shot cross-lingual transfer by composing expert modules trained separately on language or task data. Our method composes <tex-math>\textit{language}</tex-math> and <tex-math>\textit{task}</tex-math> PEFT adapters via element-wise arithmetic operations to leverage unlabeled data and English labeled data. We extend our approach to cases where labeled data from more languages is available and propose to arithmetically compose PEFT adapters trained on languages related to the target. Empirical results on summarization demonstrate that our method is a strategy that obtains consistent gains using minimal training of PEFT parameters.</abstract>
+      <url hash="0e215985">2024.mrl-1.7</url>
+      <bibkey>chronopoulou-etal-2024-language</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Modeling Bilingual Sentence Processing: Evaluating <fixed-case>RNN</fixed-case> and Transformer Architectures for Cross-Language Structural Priming</title>
+      <author><first>Demi</first><last>Zhang</last></author>
+      <author><first>Bushi</first><last>Xiao</last></author>
+      <author><first>Chao</first><last>Gao</last></author>
+      <author><first>Sangpil</first><last>Youm</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Bonnie</first><last>Dorr</last><affiliation>University of Florida</affiliation></author>
+      <pages>127-136</pages>
+      <abstract>This study evaluates the performance of Recurrent Neural Network (RNN) and Transformer models in replicating cross-language structural priming, a key indicator of abstract grammatical representations in human language processing. Focusing on Chinese-English priming, which involves two typologically distinct languages, we examine how these models handle the robust phenomenon of structural priming, where exposure to a particular sentence structure increases the likelihood of selecting a similar structure subsequently. Our findings indicate that transformers outperform RNNs in generating primed sentence structures, with accuracy rates that exceed 25.84% to 33. 33%. This challenges the conventional belief that human sentence processing primarily involves recurrent and immediate processing and suggests a role for cue-based retrieval mechanisms. This work contributes to our understanding of how computational models may reflect human cognitive processes across diverse language families.</abstract>
+      <url hash="a2e7f7b4">2024.mrl-1.8</url>
+      <bibkey>zhang-etal-2024-modeling-bilingual</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Recipe for Zero-shot <fixed-case>POS</fixed-case> Tagging: Is It Useful in Realistic Scenarios?</title>
+      <author><first>Zeno</first><last>Vandenbulcke</last><affiliation>KU Leuven, KU Leuven</affiliation></author>
+      <author><first>Lukas</first><last>Vermeire</last></author>
+      <author><first>Miryam</first><last>De Lhoneux</last><affiliation>KU Leuven</affiliation></author>
+      <pages>137-147</pages>
+      <abstract>POS tagging plays a fundamental role in numerous applications. While POS taggers are highly accurate in well-resourced settings, they lag behind in cases of limited or missing training data. This paper focuses on POS tagging for languages with limited data. We seek to identify favourable characteristics of datasets for training POS tagging models using related languages without specific training on the target language. This is a zero-shot approach. We investigate both mono- and multilingual models trained on related languages and compare their accuracies. Additionally, we compare these results with models trained directly on the target language itself. We do this for three target low-resource languages, for each of which we select several support languages. Our research highlights the importance of accurate dataset selection for developing effective zero-shot POS tagging models. Particularly, a strong linguistic relationship and high-quality datasets ensure optimal results. For extremely low-resource languages, zero-shot training proves to be a viable option.</abstract>
+      <url hash="d025a49f">2024.mrl-1.9</url>
+      <bibkey>vandenbulcke-etal-2024-recipe</bibkey>
+    </paper>
+    <paper id="10">
+      <title>Gender-specific Machine Translation with Large Language Models</title>
+      <author><first>Eduardo</first><last>Sánchez</last><affiliation>University College London, University of London and Meta</affiliation></author>
+      <author><first>Pierre</first><last>Andrews</last></author>
+      <author><first>Pontus</first><last>Stenetorp</last><affiliation>University College London</affiliation></author>
+      <author><first>Mikel</first><last>Artetxe</last><affiliation>Reka AI</affiliation></author>
+      <author><first>Marta</first><last>Costa-jussà</last><affiliation>Meta</affiliation></author>
+      <pages>148-158</pages>
+      <abstract>‘While machine translation (MT) systems have seen significant improvements,it is still common for translations to reflect societal biases, such as genderbias. Decoder-only language models (LLMs) have demonstrated potential in MT, albeitwith performance slightly lagging behind traditional encoder-decoder neural machinetranslation (NMT) systems. However, LLMs offer a unique advantage: the abilityto control the properties of the output through prompting. In this study, we leveragethis flexibility to explore Llama”s capability to produce gender-specific translations.Our results indicate that Llama can generate gender-specific translations withtranslation quality and gender bias comparable to NLLB, a state-of-the-art multilingualNMT system.’</abstract>
+      <url hash="d3b7040f">2024.mrl-1.10</url>
+      <bibkey>sanchez-etal-2024-gender</bibkey>
+    </paper>
+    <paper id="11">
+      <title><fixed-case>J</fixed-case>ina-<fixed-case>C</fixed-case>ol<fixed-case>BERT</fixed-case>-v2: A General-Purpose Multilingual Late Interaction Retriever</title>
+      <author><first>Han</first><last>Xiao</last><affiliation>Jina AI</affiliation></author>
+      <author><first>Bo</first><last>Wang</last></author>
+      <author><first>Rohan</first><last>Jha</last></author>
+      <pages>159-166</pages>
+      <abstract>Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model’s retrieval performance and cut storage requirements by up to 50%. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,</abstract>
+      <url hash="0e17920e">2024.mrl-1.11</url>
+      <bibkey>xiao-etal-2024-jina</bibkey>
+    </paper>
+    <paper id="12">
+      <title>Cross-Lingual Named Entity Recognition for Low-Resource Languages: A <fixed-case>H</fixed-case>indi-<fixed-case>N</fixed-case>epali Case Study Using Multilingual <fixed-case>BERT</fixed-case> Models</title>
+      <author><first>Dipendra</first><last>Yadav</last></author>
+      <author><first>Sumaiya</first><last>Suravee</last></author>
+      <author><first>Tobias</first><last>Strauß</last><affiliation>Universität Rostock</affiliation></author>
+      <author><first>Kristina</first><last>Yordanova</last><affiliation>Ernst-Moritz-Arndt Universität Greifswald</affiliation></author>
+      <pages>167-174</pages>
+      <abstract>This study investigates the potential of cross-lingual transfer learning for Named Entity Recognition (NER) between Hindi and Nepali, two languages that, despite their linguistic similarities, face significant disparities in available resources. By leveraging multilingual BERT models, including RemBERT, BERT Multilingual, MuRIL, and DistilBERT Multilingual, the research examines whether pre-training them on a resource-rich language like Hindi can enhance NER performance in a resource-constrained language like Nepali and vice versa. The study conducts experiments in both monolingual and cross-lingual settings to evaluate the models’ effectiveness in transferring linguistic knowledge between the two languages. The findings reveal that while RemBERT and MuRIL perform well in monolingual contexts—RemBERT excelling in Hindi and MuRIL in Nepali—BERT Multilingual performs comparatively best in cross-lingual scenarios, in generalizing features across the languages. Although DistilBERT Multilingual demonstrates slightly lower performance in cross-lingual tasks, it balances efficiency with competitive results. The study underscores the importance of model selection based on linguistic and resource-specific contexts, highlighting that general-purpose models like BERT Multilingual are particularly well-suited for cross-lingual applications.</abstract>
+      <url hash="2420cfd4">2024.mrl-1.12</url>
+      <bibkey>yadav-etal-2024-cross</bibkey>
+    </paper>
+    <paper id="13">
+      <title>Parameter-efficient Adaptation of Multilingual Multimodal Models for Low-resource <fixed-case>ASR</fixed-case></title>
+      <author><first>Abhishek</first><last>Gupta</last></author>
+      <author><first>Amruta</first><last>Parulekar</last></author>
+      <author><first>Sameep</first><last>Chattopadhyay</last></author>
+      <author><first>Preethi</first><last>Jyothi</last><affiliation>Indian Institute of Technology Bombay</affiliation></author>
+      <pages>175-185</pages>
+      <abstract>Automatic speech recognition (ASR) for low-resource languages remains a challenge due to the scarcity of labeled training data. Parameter-efficient fine-tuning and text-only adaptation are two popular methods that have been used to address such low-resource settings. In this work, we investigate how these techniques can be effectively combined using a multilingual multimodal model like SeamlessM4T. Multimodal models are able to leverage unlabeled text via text-only adaptation with further parameter-efficient ASR fine-tuning, thus boosting ASR performance. We also show cross-lingual transfer from a high-resource language, achieving up to a relative 17% WER reduction over baseline in an extremely low-resource setting without any labeled speech.</abstract>
+      <url hash="b084cbfd">2024.mrl-1.13</url>
+      <bibkey>gupta-etal-2024-parameter</bibkey>
+    </paper>
+    <paper id="14">
+      <title>Towards Cross-Linguistic Semantic Grounding using Dictionary Graph Analysis</title>
+      <author><first>Ethan</first><last>Eschrich</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Zoey</first><last>Liu</last><affiliation>University of Florida</affiliation></author>
+      <pages>186-188</pages>
+      <abstract>Previous work has explored the structure of dictionaries as directed graphs, with arcs between words when one word is used in the definition of another. We analyze the efficacy of these methodologies and explore the cross-linguistic patterns of the strongly connected components of multiple monolingual dictionaries. We find that the number of sources in the condensation graph of a directed dictionary graph is roughly stable across multiple different languages, and present future research directions.</abstract>
+      <url hash="fbc9068d">2024.mrl-1.14</url>
+      <bibkey>eschrich-liu-2024-towards</bibkey>
+    </paper>
+    <paper id="15">
+      <title>Vikhr: Constructing a State-of-the-art Bilingual Open-Source Instruction-Following Large Language Model for <fixed-case>R</fixed-case>ussian</title>
+      <author><first>Aleksandr</first><last>Nikolich</last></author>
+      <author><first>Konstantin</first><last>Korolev</last></author>
+      <author><first>Sergei</first><last>Bratchikov</last><affiliation>Misis</affiliation></author>
+      <author><first>Igor</first><last>Kiselev</last><affiliation>University of Waterloo</affiliation></author>
+      <author><first>Artem</first><last>Shelmanov</last><affiliation>Mohamed bin Zayed University of Artificial Intelligence</affiliation></author>
+      <pages>189-199</pages>
+      <abstract>There has been a surge in the development of various Large Language Models (LLMs). However, text generation for languages other than English often faces significant challenges, including poor generation quality and reduced computational performance due to the disproportionate representation of tokens in the model’s vocabulary. In this work, we address these issues by developing a pipeline for adaptation of English-oriented pre-trained models to other languages and constructing efficient bilingual LLMs. Using this pipeline, we construct Vikhr, a state-of-the-art bilingual open-source instruction-following LLM designed specifically for the Russian language. “Vikhr” refers to the name of the Mistral LLM series and means a “strong gust of wind.”Unlike previous Russian-language models that typically rely on LoRA adapters on top of English-oriented models, sacrificing performance for lower training costs, Vikhr features an adapted tokenizer vocabulary and undergoes the continued pre-training and instruction tuning of all weights. This not only enhances the model’s performance but also significantly improves its computational and contextual efficiency.The remarkable performance of Vikhr across various Russian-language benchmarks can also be attributed to our efforts in expanding instruction datasets and corpora for continued pre-training. Vikhr not only sets the new state of the art among open-source LLMs for Russian but even outperforms some proprietary closed-source models on certain benchmarks. The model weights, instruction sets, and code are publicly available.</abstract>
+      <url hash="26790790">2024.mrl-1.15</url>
+      <bibkey>nikolich-etal-2024-vikhr</bibkey>
+    </paper>
+    <paper id="16">
+      <title>Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer</title>
+      <author><first>Haeji</first><last>Jung</last></author>
+      <author><first>Changdae</first><last>Oh</last><affiliation>Department of Computer Science, University of Wisconsin - Madison</affiliation></author>
+      <author><first>Jooeon</first><last>Kang</last><affiliation>Sogang University</affiliation></author>
+      <author><first>Jimin</first><last>Sohn</last><affiliation>NAVER</affiliation></author>
+      <author><first>Kyungwoo</first><last>Song</last><affiliation>Yonsei University</affiliation></author>
+      <author><first>Jinkyu</first><last>Kim</last><affiliation>Korea University</affiliation></author>
+      <author><first>David</first><last>Mortensen</last><affiliation>Carnegie Mellon University</affiliation></author>
+      <pages>200-211</pages>
+      <abstract>Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such gaps, how different input-level representations influence such gaps has not been investigated, particularly with phonemic inputs. We hypothesize that the performance gaps are affected by representation discrepancies between those languages, and revisit the use of phonemic representations as a means to mitigate these discrepancies.To demonstrate the effectiveness of phonemic representations, we present experiments on three representative cross-lingual tasks on 12 languages in total. The results show that phonemic representations exhibit higher similarities between languages compared to orthographic representations, and it consistently outperforms grapheme-based baseline model on languages that are relatively low-resourced.We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representations, and it is further justified by a theoretical analysis of the cross-lingual performance gap.</abstract>
+      <url hash="67d6a405">2024.mrl-1.16</url>
+      <bibkey>jung-etal-2024-mitigating</bibkey>
+    </paper>
+    <paper id="17">
+      <title>Leveraging Adapters for Improved Cross-lingual Transfer for Low-Resource Creole <fixed-case>MT</fixed-case></title>
+      <author><first>Marcell</first><last>Fekete</last></author>
+      <author><first>Ernests</first><last>Lavrinovics</last></author>
+      <author><first>Nathaniel</first><last>Robinson</last><affiliation>Department of Computer Science, Whiting School of Engineering</affiliation></author>
+      <author><first>Heather</first><last>Lent</last><affiliation>Aalborg University</affiliation></author>
+      <author><first>Raj</first><last>Dabre</last><affiliation>National Institute of Information and Communications Technology (NICT), National Institute of Advanced Industrial Science and Technology</affiliation></author>
+      <author><first>Johannes</first><last>Bjerva</last><affiliation>Aalborg University</affiliation></author>
+      <pages>212-215</pages>
+      <abstract>———– EXTENDED ABSTRACT INTRODUCTION ———–Creole languages are low-resource languages, often genetically related to languages like English, French, and Portuguese, due to their linguistic histories with colonialism (DeGraff, 2003). As such, Creoles stand to benefit greatly from both data-efficient methods and transfer-learning from high-resource languages. At the same time, it has been observed by Lent et al. (2022b) that machine translation (MT) is a highly desired language technology by speakers of many Creoles. To this end, recent works have contributed new datasets, allowing for the development and evaluation of MT systems for Creoles (Robinson et al., 2024; Lent et al. 2024). In this work, we explore the use of the limited monolingual and parallel data for Creoles using parameter-efficient adaptation methods. Specifically, we compare the performance of different adapter architectures over the set of available benchmarks. We find adapters a promising approach for Creoles because they are parameter-efficient and have been shown to leverage transfer learning between related languages (Faisal and Anastasopoulos, 2022). While we perform experiments across multiple Creoles, we present only on Haitian Creole in this extended abstract. For future work, we aim to explore the potentials for leveraging other high-resourced languages for parameter-efficient transfer learning.</abstract>
+      <url hash="4bf6545b">2024.mrl-1.17</url>
+      <bibkey>fekete-etal-2024-leveraging</bibkey>
+    </paper>
+    <paper id="18">
+      <title>Evaluating Multilingual Long-Context Models for Retrieval and Reasoning</title>
+      <author><first>Ameeta</first><last>Agrawal</last><affiliation>Portland State University</affiliation></author>
+      <author><first>Andy</first><last>Dang</last><affiliation>Portland State University</affiliation></author>
+      <author><first>Sina</first><last>Bagheri Nezhad</last><affiliation>Portland State University</affiliation></author>
+      <author><first>Rhitabrat</first><last>Pokharel</last></author>
+      <author><first>Russell</first><last>Scheinberg</last></author>
+      <pages>216-231</pages>
+      <abstract>Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset – mLongRR – to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.</abstract>
+      <url hash="d37f5461">2024.mrl-1.18</url>
+      <bibkey>agrawal-etal-2024-evaluating</bibkey>
+    </paper>
+    <paper id="19">
+      <title>Community <fixed-case>OSCAR</fixed-case>: A Community Effort for Multilingual Web Data</title>
+      <author><first>Manuel</first><last>Brack</last><affiliation>German Research Center for AI and Technische Universität Darmstadt</affiliation></author>
+      <author><first>Malte</first><last>Ostendorff</last><affiliation>Deutsche Telekom</affiliation></author>
+      <author><first>Pedro</first><last>Ortiz Suarez</last><affiliation>Common Crawl Foundation</affiliation></author>
+      <author><first>José</first><last>Saiz</last><affiliation>Barcelona Supercomputing Center</affiliation></author>
+      <author><first>Iñaki</first><last>Castilla</last><affiliation>Barcelona Supercomputing Center</affiliation></author>
+      <author><first>Jorge</first><last>Palomar-Giner</last><affiliation>Barcelona Supercomputing Center</affiliation></author>
+      <author><first>Alexander</first><last>Shvets</last><affiliation>Barcelona Supercomputing Center</affiliation></author>
+      <author><first>Patrick</first><last>Schramowski</last><affiliation>German Research Center for AI</affiliation></author>
+      <author><first>Georg</first><last>Rehm</last><affiliation>Humboldt Universität Berlin and Deutsches Forschungszentrum für Künstliche Intelligenz</affiliation></author>
+      <author><first>Marta</first><last>Villegas</last><affiliation>Barcelona Supercomputing Center, Universitat Pompeu Fabra and Universitat Autònoma de Barcelona</affiliation></author>
+      <author><first>Kristian</first><last>Kersting</last><affiliation>German Research Center for AI, The Hessian Center for AI and TU Darmstadt</affiliation></author>
+      <pages>232-235</pages>
+      <abstract>The development of large language models (LLMs) relies heavily on extensive, high-quality datasets. Publicly available datasets focus predominantly on English, leaving other language communities behind. To address this issue, we introduce Community OSCAR, a multilingual dataset initiative designed to address the gap between English and non-English data availability. Through a collective effort, Community OSCAR covers over 150 languages with 45 billion documents, totaling over 345 TiB of data. Initial results indicate that Community OSCAR provides valuable raw data for training LLMs and enhancing the performance of multilingual models. This work aims to contribute to the ongoing advancements in multilingual NLP and to support a more inclusive AI ecosystem by making high-quality, multilingual data more accessible to those working with low-resource languages.</abstract>
+      <url hash="365cc018">2024.mrl-1.19</url>
+      <bibkey>brack-etal-2024-community</bibkey>
+    </paper>
+    <paper id="20">
+      <title>Leveraging <fixed-case>LLM</fixed-case>s for Translating and Classifying Mental Health Data</title>
+      <author><first>Konstantinos</first><last>Skianis</last><affiliation>University of Ioannina</affiliation></author>
+      <author><first>A.</first><last>Doğruöz</last><affiliation>Ghent University</affiliation></author>
+      <author><first>John</first><last>Pavlopoulos</last><affiliation>Athens University of Economics and Business</affiliation></author>
+      <pages>236-241</pages>
+      <abstract>Large language models (LLMs) are increasingly used in medical fields. In mental health support, the early identification of linguistic markers associated with mental health conditions can provide valuable support to mental health professionals, and reduce long waiting times for patients.Despite the benefits of LLMs for mental health support, there is limited research on their application in mental health systems for languages other than English. Our study addresses this gap by focusing on the detection of depression severity in Greek through user-generated posts which are automatically translated from English. Our results show that GPT3.5-turbo is not very successful in identifying the severity of depression in English, and it has a varying performance in Greek as well. Our study underscores the necessity for further research, especially in languages with less resources.Also, careful implementation is necessary to ensure that LLMs are used effectively in mental health platforms, and human supervision remains crucial to avoid misdiagnosis.</abstract>
+      <url hash="3f4ba26b">2024.mrl-1.20</url>
+      <bibkey>skianis-etal-2024-leveraging</bibkey>
+    </paper>
+    <paper id="21">
+      <title>Bridging the Bosphorus: Advancing <fixed-case>T</fixed-case>urkish Large Language Models through Strategies for Low-Resource Language Adaptation and Benchmarking</title>
+      <author><first>Emre Can</first><last>Acikgoz</last></author>
+      <author><first>Mete</first><last>Erdogan</last></author>
+      <author><first>Deniz</first><last>Yuret</last><affiliation>Koc University</affiliation></author>
+      <pages>242-268</pages>
+      <abstract>Large Language Models (LLMs) are becoming crucial across various fields, emphasizing the urgency for high-quality models in underrepresented languages. This study explores the unique challenges faced by low-resource languages, such as data scarcity, model selection, evaluation, and computational limitations, with a special focus on Turkish. We conduct an in-depth analysis to evaluate the impact of training strategies, model choices, and data availability on the performance of LLMs designed for underrepresented languages. Our approach includes two methodologies: (i) adapting existing LLMs originally pretrained in English to understand Turkish, and (ii) developing a model from the ground up using Turkish pretraining data, both supplemented with supervised fine-tuning on a novel Turkish instruction-tuning dataset aimed at enhancing reasoning capabilities. The relative performance of these methods is evaluated through the creation of a new leaderboard for Turkish LLMs, featuring benchmarks that assess different reasoning and knowledge skills. Furthermore, we conducted experiments on data and model scaling, both during pretraining and fine-tuning, simultaneously emphasizing the capacity for knowledge transfer across languages and addressing the challenges of catastrophic forgetting encountered during fine-tuning on a different language. Our goal is to offer a detailed guide for advancing the LLM framework in low-resource linguistic contexts, thereby making natural language processing (NLP) benefits more globally accessible.</abstract>
+      <url hash="72943dbe">2024.mrl-1.21</url>
+      <bibkey>acikgoz-etal-2024-bridging</bibkey>
+    </paper>
+    <paper id="22">
+      <title>Unsupervised Text Representation Learning via Instruction-Tuning for Zero-Shot Dense Retrieval</title>
+      <author><first>Qiuhai</first><last>Zeng</last></author>
+      <author><first>Zimeng</first><last>Qiu</last><affiliation>Amazon</affiliation></author>
+      <author><first>Dae Yon</first><last>Hwang</last><affiliation>Amazon AGI</affiliation></author>
+      <author><first>Xin</first><last>He</last><affiliation>Amazon</affiliation></author>
+      <author><first>William</first><last>Campbell</last></author>
+      <pages>269-279</pages>
+      <abstract>Dense retrieval systems are commonly used for information retrieval (IR). They rely on learning text representations through an encoder and usually require supervised modeling via labelled data which can be costly to obtain or simply unavailable. In this study, we introduce a novel unsupervised text representation learning technique via instruction-tuning the pre-trained encoder-decoder large language model (LLM) under the dual-encoder retrieval framework. We demonstrate on multiple languages that the corpus representation can be augmented by the representations of relevant synthetic queries generated by the instruct-tuned LLM founded on the Rao-Blackwell theorem. Furthermore, we effectively align the query and corpus text representation with self-instruct tuning. We evaluate our proposed method under low-resource settings on three English, two German and one Portuguese retrieval datasets measuring NDCG@10, MRR@100, Recall@100. We significantly improve the average zero-shot retrieval performance on all metrics, increasing out-of-box FLAN-T5 model variations by [4.73%, 6.15%] in absolute NDCG@10 and exceeding four supervised dense retrievers.</abstract>
+      <url hash="56052a9e">2024.mrl-1.22</url>
+      <bibkey>zeng-etal-2024-unsupervised</bibkey>
+    </paper>
+    <paper id="23">
+      <title>Language Bias in Multilingual Information Retrieval: The Nature of the Beast and Mitigation Methods</title>
+      <author><first>Jinrui</first><last>Yang</last><affiliation>The University of Melbourne</affiliation></author>
+      <author><first>Fan</first><last>Jiang</last></author>
+      <author><first>Timothy</first><last>Baldwin</last><affiliation>Mohamed bin Zayed University of Artificial Intelligence and The University of Melbourne</affiliation></author>
+      <pages>280-292</pages>
+      <abstract>Language fairness in multilingual information retrieval (MLIR) systems is crucial for ensuring equitable access to information across diverse languages. This paper sheds light on the issue, based on the assumption that queries in different languages, but with identical semantics, should yield equivalent ranking lists when retrieving on the same multilingual documents. We evaluate the degree of fairness using both traditional retrieval methods, and a DPR neural ranker based on mBERT and XLM-R. Additionally, we introduce ‘LaKDA’, a novel loss designed to mitigate language biases in neural MLIR approaches. Our analysis exposes intrinsic language biases in current MLIR technologies, with notable disparities across the retrieval methods, and the effectiveness of LaKDA in enhancing language fairness.</abstract>
+      <url hash="d828a1c5">2024.mrl-1.23</url>
+      <bibkey>yang-etal-2024-language-bias</bibkey>
+    </paper>
+    <paper id="24">
+      <title>Representational Isomorphism and Alignment of Multilingual Large Language Models</title>
+      <author><first>Di</first><last>Wu</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Yibin</first><last>Lei</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Andrew</first><last>Yates</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Christof</first><last>Monz</last><affiliation>University of Amsterdam, University of Amsterdam</affiliation></author>
+      <pages>293-297</pages>
+      <abstract>In this extended abstract, we investigate the capability of Large Language Models (LLMs) to represent texts in multilingual contexts. Our findings reveal that sentence representations derived from LLMs exhibit a high degree of isomorphism across languages. This existing isomorphism facilitates representational alignments in few-shot settings. Specifically, by applying a contrastive objective at the representation level with only a small number (e.g., 100) of translation pairs, we significantly improve models’ performance on Semantic Textual Similarity (STS) tasks across languages.</abstract>
+      <url hash="a6aced4f">2024.mrl-1.24</url>
+      <bibkey>wu-etal-2024-representational-isomorphism</bibkey>
+    </paper>
+    <paper id="25">
+      <title>Generalization Measures for Zero-Shot Cross-Lingual Transfer</title>
+      <author><first>Saksham</first><last>Bassi</last><affiliation>New York University</affiliation></author>
+      <author><first>Duygu</first><last>Ataman</last><affiliation>New York University</affiliation></author>
+      <author><first>Kyunghyun</first><last>Cho</last><affiliation>Genentech and New York University</affiliation></author>
+      <pages>298-309</pages>
+      <abstract>Building robust and reliable machine learning systems requires models with the capacity to generalize their knowledge to interpret unseen inputs with different characteristics. Traditional language model evaluation tasks lack informative metrics about model generalization, and their applicability in new settings is often measured using task and language-specific downstream performance, which is lacking in many languages and tasks. To address this gap, we explore a set of efficient and reliable measures that could aid in computing more information related to the generalization capability of language models, particularly in cross-lingual zero-shot settings. Our central hypothesis is that the sharpness of a model’s loss landscape, i.e., the representation of loss values over its weight space, can indicate its generalization potential, with a flatter landscape suggesting better generalization. We propose a novel and stable algorithm to reliably compute the sharpness of a model optimum, and demonstrate its correlation with successful cross-lingual transfer.</abstract>
+      <url hash="ac86dd0f">2024.mrl-1.25</url>
+      <bibkey>bassi-etal-2024-generalization</bibkey>
+    </paper>
+    <paper id="26">
+      <title>Detecting and Translating Language Ambiguity with Multilingual <fixed-case>LLM</fixed-case>s</title>
+      <author><first>Behrang</first><last>Mehrparvar</last></author>
+      <author><first>Sandro</first><last>Pezzelle</last><affiliation>University of Amsterdam</affiliation></author>
+      <pages>310-323</pages>
+      <abstract>Most languages could be ambiguous, which means the same conveyed text or speech, results in different actions by different readers or listeners. In this project, we propose a method to detect the ambiguity of a sentence using translation by multilingual LLMs. In particular, we hypothesize that a good machine translator should preserve the ambiguity of sentences in all target languages. Therefore, we investigate whether ambiguity is encoded in the hidden representation of a translation model or, instead, if only a single meaning is encoded. In our experiments, we have been able to predict ambiguity of sentences with high accuracy using machine translation without direct use of semantics and only based on the reconstruction error of a function that maps the forward and backward translation hidden representations to each other. The potential applications of the proposed approach span i) detecting ambiguous sentences, ii) fine-tuning existing multilingual LLMs to preserve ambiguous information, and iii) developing AI systems that can generate ambiguity-free languages when needed.</abstract>
+      <url hash="0714c1a1">2024.mrl-1.26</url>
+      <bibkey>mehrparvar-pezzelle-2024-detecting</bibkey>
+    </paper>
+    <paper id="27">
+      <title><fixed-case>MLT</fixed-case>-<fixed-case>DR</fixed-case>: Multi-Lingual/Task Demonstration Retrieval
+
+An Attempt towards Generalized Retriever for In-Context Learning</title>
+      <author><first>Kazuma</first><last>Hashimoto</last><affiliation>Google Research</affiliation></author>
+      <author><first>Arjun</first><last>Akula</last><affiliation>Google Research</affiliation></author>
+      <author><first>Karthik</first><last>Raman</last><affiliation>Google</affiliation></author>
+      <author><first>Michael</first><last>Bendersky</last><affiliation>Google</affiliation></author>
+      <pages>324-345</pages>
+      <abstract>This paper presents Multi-Lingual/Task Demonstration Retrieval (MLT-DR) for in-context learning with Large Language Models (LLMs).Our goal is to investigate how dense demonstration retrieval models are generalized across languages and tasks.We first convert 81 tasks into a common format, covering various languages, task types, and domains.For 8 English-based tasks among them, we use machine translation to create synthetic multi/cross-lingual tasks, by translating the examples into non-English languages to explicitly cover more than 130 languages.We then use an instruction-tuned LLM to estimate utility of demonstrations for all the tasks to train the demonstration retrieval models.In our experiments, we show an interesting counterintuitive observation; to compute embeddings of demonstrations, using both the input and ground-truth output hurts the generalization ability of the retriever on unseen tasks whose output space is quite different from those in the seen task set.We also examine that our retriever robustly works even with LLMs that we did not touch during the development of the models.The retrieval models’ checkpoints are publicly available at <url>URL-available-upon-publication</url>.</abstract>
+      <url hash="33eea348">2024.mrl-1.27</url>
+      <bibkey>hashimoto-etal-2024-mlt</bibkey>
+    </paper>
+    <paper id="28">
+      <title><fixed-case>M</fixed-case>c<fixed-case>G</fixed-case>ill <fixed-case>NLP</fixed-case> Group Submission to the <fixed-case>MRL</fixed-case> 2024 Shared Task: Ensembling Enhances Effectiveness of Multilingual Small <fixed-case>LM</fixed-case>s</title>
+      <author><first>Senyu</first><last>Li</last></author>
+      <author><first>Hao</first><last>Yu</last></author>
+      <author><first>Jessica</first><last>Ojo</last><affiliation>Lelapa AI</affiliation></author>
+      <author><first>David</first><last>Adelani</last></author>
+      <pages>346-356</pages>
+      <abstract>We present our systems for the three tasks and five languages included in the MRL 2024 Shared Task on Multilingual Multi-task Information Retrieval: (1) Named Entity Recognition, (2) Free-form Question Answering, and (3) Multiple-choice Question Answering. For each task, we explored the impact of selecting different multilingual language models for fine-tuning across various target languages, and implemented an ensemble system that generates final outputs based on predictions from multiple fine-tuned models. All models are large language models fine-tuned on task-specific data. Our experimental results show that a more balanced dataset would yield better results. However, when training data for certain languages are scarce, fine-tuning on a large amount of English data supplemented by a small amount of “triggering data” in the target language can produce decent results.</abstract>
+      <url hash="8e0685c1">2024.mrl-1.28</url>
+      <bibkey>li-etal-2024-mcgill</bibkey>
+    </paper>
+    <paper id="29">
+      <title><fixed-case>CUNI</fixed-case> and <fixed-case>LMU</fixed-case> Submission to the <fixed-case>MRL</fixed-case> 2024 Shared Task on Multi-lingual Multi-task Information Retrieval</title>
+      <author><first>Katharina</first><last>Hämmerl</last></author>
+      <author><first>Andrei-Alexandru</first><last>Manea</last></author>
+      <author><first>Gianluca</first><last>Vico</last><affiliation>Charles University Prague</affiliation></author>
+      <author><first>Jindřich</first><last>Helcl</last><affiliation>Charles University</affiliation></author>
+      <author><first>Jindřich</first><last>Libovický</last><affiliation>Charles University Prague</affiliation></author>
+      <pages>357-364</pages>
+      <abstract>We present the joint CUNI and LMU submission to the MRL 2024 Shared Task on Multi-lingual Multi-task Information Retrieval.The shared task objective was to explore how we can deploy modern methods in NLP in multi-lingual low-resource settings, tested on two sub-tasks: Named-entity recognition and question answering.Our solutions to the subtasks are based on data acquisition and model adaptation.We compare the performance of our submitted systems with the translate-test approachwhich proved to be the most useful in the previous edition of the shared task.Our results show that using more data as well as fine-tuning recent multilingual pre-trained models leads to considerable improvements over the translate-test baseline.Our code is available at https://github.com/ufal/mrl2024-multilingual-ir-shared-task.</abstract>
+      <url hash="fd2bd981">2024.mrl-1.29</url>
+      <bibkey>hammerl-etal-2024-cuni</bibkey>
+    </paper>
+    <paper id="30">
+      <title>Findings of the 2nd Shared Task on Multi-lingual Multi-task Information Retrieval at <fixed-case>MRL</fixed-case> 2024</title>
+      <author><first>Francesco</first><last>Tinner</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Raghav</first><last>Mantri</last><affiliation>New York University</affiliation></author>
+      <author><first>Mammad</first><last>Hajili</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Chiamaka</first><last>Chukwuneke</last><affiliation>Lancaster University, UK</affiliation></author>
+      <author><first>Dylan</first><last>Massey</last><affiliation>University of Zurich</affiliation></author>
+      <author><first>Benjamin</first><last>Ajibade</last><affiliation>University of Alabama</affiliation></author>
+      <author><first>Bilge</first><last>Kocak</last><affiliation>Villanova University</affiliation></author>
+      <author><first>Abolade</first><last>Dawud</last><affiliation>Masakhane</affiliation></author>
+      <author><first>Jonathan</first><last>Atala</last><affiliation>Anglia Ruskin University</affiliation></author>
+      <author><first>Hale</first><last>Sirin</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Kayode</first><last>Olaleye</last><affiliation>University of Pretoria</affiliation></author>
+      <author><first>Anar</first><last>Rzayev</last><affiliation>KAIST</affiliation></author>
+      <author><first>David</first><last>Adelani</last><affiliation>McGill University</affiliation></author>
+      <author><first>Duygu</first><last>Ataman</last><affiliation>New York University</affiliation></author>
+      <pages>365-376</pages>
+      <abstract>Large language models (LLMs) demonstrate exceptional proficiency in both the comprehension and generation of textual data, particularly in English, a language for which extensive public benchmarks have been established across a wide range of natural language processing (NLP) tasks. Nonetheless, their performance in multilingual contexts and specialized domains remains less rigorously validated, raising questions about their reliability and generalizability across linguistically diverse and domain-specific settings. The second edition of the Shared Task on Multilingual Multitask Information Retrieval aims to provide a comprehensive and inclusive multilingual evaluation benchmark which aids assessing the ability of multilingual LLMs to capture logical, factual, or causal relationships within lengthy text contexts and generate language under sparse settings, particularly in scenarios with under-resourced languages. The shared task consists of two subtasks crucial to information retrieval: Named entity recognition (NER) and reading comprehension (RC), in 7 data-scarce languages: Azerbaijani, Swiss German, Turkish and , which previously lacked annotated resources in information retrieval tasks. This year specifally focus on the multiple-choice question answering evaluation setting which provides a more objective setting for comparing different methods across languages.</abstract>
+      <url hash="0c6cac0a">2024.mrl-1.30</url>
+      <bibkey>tinner-etal-2024-findings</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.nllp.xml b/data/xml/2024.nllp.xml
new file mode 100644
index 0000000000..babbf8b0e3
--- /dev/null
+++ b/data/xml/2024.nllp.xml
@@ -0,0 +1,383 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.nllp">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Natural Legal Language Processing Workshop 2024</booktitle>
+      <editor><first>Nikolaos</first><last>Aletras</last></editor>
+      <editor><first>Ilias</first><last>Chalkidis</last></editor>
+      <editor><first>Leslie</first><last>Barrett</last></editor>
+      <editor><first>Cătălina</first><last>Goanță</last></editor>
+      <editor><first>Daniel</first><last>Preoțiuc-Pietro</last></editor>
+      <editor><first>Gerasimos</first><last>Spanakis</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, FL, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="7bfed25e">2024.nllp-1</url>
+      <venue>nllp</venue>
+    </meta>
+    <frontmatter>
+      <url hash="7177692e">2024.nllp-1.0</url>
+      <bibkey>nllp-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title><fixed-case>L</fixed-case>e<fixed-case>G</fixed-case>en: Complex Information Extraction from Legal sentences using Generative Models</title>
+      <author><first>Chaitra</first><last>C R</last><affiliation>BITS Pilani, Hyderabad Campus</affiliation></author>
+      <author><first>Sankalp</first><last>Kulkarni</last><affiliation>BITS Hyderabad</affiliation></author>
+      <author><first>Sai Rama Akash Varma</first><last>Sagi</last><affiliation>Birla Institute of Technology and Science, Pilani - Hyderabad Campus</affiliation></author>
+      <author><first>Shashank</first><last>Pandey</last><affiliation>BITS Pilani, Hyderabad</affiliation></author>
+      <author><first>Rohit</first><last>Yalavarthy</last><affiliation>BITS Pilani, Hyderabad Campus</affiliation></author>
+      <author><first>Dipanjan</first><last>Chakraborty</last><affiliation>BITS Pilani, Hyderabad</affiliation></author>
+      <author><first>Prajna Devi</first><last>Upadhyay</last><affiliation>BITS Pilani Hyderabad</affiliation></author>
+      <pages>1-17</pages>
+      <abstract>Constructing legal knowledge graphs from unstructured legal texts is a complex challenge due to the intricate nature of legal language. While open information extraction (OIE) techniques can convert text into triples of the form subject, relation, object, they often fall short of capturing the nuanced relationships within lengthy legal sentences, necessitating more sophisticated approaches known as complex information extraction. This paper proposes <tex-math>LeGen</tex-math> – an end-to-end approach leveraging pre-trained large language models (GPT-4o, T5, BART) to perform complex information extraction from legal sentences. <tex-math>LeGen</tex-math> learns and represents the discourse structure of legal sentences, capturing both their complexity and semantics. It minimizes error propagation typical in multi-step pipelines and achieves up to a 32.2% gain on the Indian Legal benchmark. Additionally, it demonstrates competitive performance on open information extraction benchmarks. A promising application of the resulting legal knowledge graphs is in developing question-answering systems for government schemes, tailored to the Next Billion Users who struggle with the complexity of legal language. Our code and data are available at https://github.com/prajnaupadhyay/LegalIE</abstract>
+      <url hash="90d7a6cc">2024.nllp-1.1</url>
+      <bibkey>c-r-etal-2024-legen</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Summarizing Long Regulatory Documents with a Multi-Step Pipeline</title>
+      <author><first>Mika</first><last>Sie</last><affiliation>Utrecht University</affiliation></author>
+      <author><first>Ruby</first><last>Beek</last><affiliation>Power2X</affiliation></author>
+      <author><first>Michiel</first><last>Bots</last><affiliation>Power2X</affiliation></author>
+      <author><first>Sjaak</first><last>Brinkkemper</last><affiliation>Utrecht University</affiliation></author>
+      <author><first>Albert</first><last>Gatt</last><affiliation>Utrecht University</affiliation></author>
+      <pages>18-32</pages>
+      <abstract>Due to their length and complexity, long regulatory texts are challenging to summarize. To address this, a multi-step extractive-abstractive architecture is proposed to handle lengthy regulatory documents more effectively. In this paper, we show that the effectiveness of a two-step architecture for summarizing long regulatory texts varies significantly depending on the model used. Specifically, the two-step architecture improves the performance of decoder-only models. For abstractive encoder-decoder models with short context lengths, the effectiveness of an extractive step varies, whereas for long-context encoder-decoder models, the extractive step worsens their performance. This research also highlights the challenges of evaluating generated texts, as evidenced by the differing results from human and automated evaluations. Most notably, human evaluations favoured language models pretrained on legal text, while automated metrics rank general-purpose language models higher. The results underscore the importance of selecting the appropriate summarization strategy based on model architecture and context length.</abstract>
+      <url hash="c9065a3f">2024.nllp-1.2</url>
+      <bibkey>sie-etal-2024-summarizing</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Enhancing Legal Expertise in Large Language Models through Composite Model Integration: The Development and Evaluation of Law-Neo</title>
+      <author><first>Zhihao</first><last>Liu</last><affiliation>Shandong University of Finance and Economics</affiliation></author>
+      <author><first>Yanzhen</first><last>Zhu</last><affiliation>Shandong University of Finance and Economics</affiliation></author>
+      <author><first>Mengyuan</first><last>Lu</last><affiliation>Shandong University of Finance and Economics</affiliation></author>
+      <pages>33-41</pages>
+      <abstract>Although large language models (LLMs) like ChatGPT have demonstrated considerable capabilities in general domains, they often lack proficiency in specialized fields. Enhancing a model’s performance in a specific domain, such as law, while maintaining low costs, has been a significant challenge. Existing methods, such as fine-tuning or building mixture of experts (MoE) models, often struggle to balance model parameters, training costs, and domain-specific performance. Inspired by composition to augment language models, we have developed Law-Neo, a novel model designed to enhance legal LLMs. This model significantly improves the model’s legal domain expertise at minimal training costs, while retaining the logical capabilities of a large-scale anchor model. Our Law-Neo model outperformed other models in comprehensive experiments on multiple legal task benchmarks, demonstrating the effectiveness of this approach.</abstract>
+      <url hash="ffbb43a1">2024.nllp-1.3</url>
+      <bibkey>liu-etal-2024-enhancing-legal</bibkey>
+    </paper>
+    <paper id="4">
+      <title>u<fixed-case>O</fixed-case>ttawa at <fixed-case>L</fixed-case>egal<fixed-case>L</fixed-case>ens-2024: Transformer-based Classification Experiments</title>
+      <author><first>Nima</first><last>Meghdadi</last><affiliation>University of Ottawa</affiliation></author>
+      <author><first>Diana</first><last>Inkpen</last><affiliation>University of Ottawa</affiliation></author>
+      <pages>42-47</pages>
+      <abstract>This paper presents the methods used for LegalLens-2024, which focused on detecting legal violations within unstructured textual data and associating these violations with potentially affected individuals. The shared task included two subtasks: A) Legal Named Entity Recognition (L-NER) and B) Legal Natural Language Inference (L-NLI). For subtask A, we utilized the spaCy library, while for subtask B, we employed a combined model incorporating RoBERTa and CNN. Our results were 86.3% in the L-NER subtask and 88.25% in the L-NLI subtask. Overall, our paper demonstrates the effectiveness of transformer models in addressing complex tasks in the legal domain.</abstract>
+      <url hash="c6ee1e20">2024.nllp-1.4</url>
+      <bibkey>meghdadi-inkpen-2024-uottawa</bibkey>
+    </paper>
+    <paper id="5">
+      <title><fixed-case>Q</fixed-case>uebec Automobile Insurance Question-Answering With Retrieval-Augmented Generation</title>
+      <author><first>David</first><last>Beauchemin</last><affiliation>Universite Laval</affiliation></author>
+      <author><first>Richard</first><last>Khoury</last><affiliation>Université Laval</affiliation></author>
+      <author><first>Zachary</first><last>Gagnon</last><affiliation>Université Laval</affiliation></author>
+      <pages>48-60</pages>
+      <abstract>Large Language Models (LLMs) perform outstandingly in various downstream tasks, and the use of the Retrieval-Augmented Generation (RAG) architecture has been shown to improve performance for legal question answering (Nuruzzaman and Hussain, 2020; Louis et al., 2024). However, there are limited applications in insurance questions-answering, a specific type of legal document. This paper introduces two corpora: the Quebec Automobile Insurance Expertise Reference Corpus and a set of 82 Expert Answers to Layperson Automobile Insurance Questions. Our study leverages both corpora to automatically and manually assess a GPT4-o, a state-of-the-art (SOTA) LLM, to answer Quebec automobile insurance questions. Our results demonstrate that, on average, using our expertise reference corpus generates better responses on both automatic and manual evaluation metrics. However, they also highlight that LLM QA is unreliable enough for mass utilization in critical areas. Indeed, our results show that between 5% to 13% of answered questions include a false statement that could lead to customer misunderstanding.</abstract>
+      <url hash="3315a982">2024.nllp-1.5</url>
+      <bibkey>beauchemin-etal-2024-quebec</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Rethinking Legal Judgement Prediction in a Realistic Scenario in the Era of Large Language Models</title>
+      <author><first>Shubham Kumar</first><last>Nigam</last><affiliation>Indian Institute of Technology</affiliation></author>
+      <author><first>Aniket</first><last>Deroy</last><affiliation>IIT Kharagpur</affiliation></author>
+      <author><first>Subhankar</first><last>Maity</last><affiliation>IIT Kharagpur</affiliation></author>
+      <author><first>Arnab</first><last>Bhattacharya</last><affiliation>Dept. of Computer Science and Engineering, IIT Kanpur</affiliation></author>
+      <pages>61-80</pages>
+      <abstract>This study investigates judgment prediction in a realistic scenario within the context of Indian judgments, utilizing a range of transformer-based models, including InLegalBERT, BERT, and XLNet, alongside LLMs such as Llama-2 and GPT-3.5 Turbo. In this realistic scenario, we simulate how judgments are predicted at the point when a case is presented for a decision in court, using only the information available at that time, such as the facts of the case, statutes, precedents, and arguments. This approach mimics real-world conditions, where decisions must be made without the benefit of hindsight, unlike retrospective analyses often found in previous studies. For transformer models, we experiment with hierarchical transformers and the summarization of judgment facts to optimize input for these models. Our experiments with LLMs reveal that GPT-3.5 Turbo excels in realistic scenarios, demonstrating robust performance in judgment prediction. Furthermore, incorporating additional legal information, such as statutes and precedents, significantly improves the outcome of the prediction task. The LLMs also provide explanations for their predictions. To evaluate the quality of these predictions and explanations, we introduce two human evaluation metrics: Clarity and Linking. Our findings from both automatic and human evaluations indicate that, despite advancements in LLMs, they are yet to achieve expert-level performance in judgment prediction and explanation tasks.</abstract>
+      <url hash="14ca6b39">2024.nllp-1.6</url>
+      <bibkey>nigam-etal-2024-rethinking</bibkey>
+    </paper>
+    <paper id="7">
+      <title>The <fixed-case>CLC</fixed-case>-<fixed-case>UKET</fixed-case> Dataset: Benchmarking Case Outcome Prediction for the <fixed-case>UK</fixed-case> Employment Tribunal</title>
+      <author><first>Huiyuan</first><last>Xie</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Felix</first><last>Steffek</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Joana</first><last>De Faria</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Christine</first><last>Carter</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Jonathan</first><last>Rutherford</last><affiliation>University of Cambridge</affiliation></author>
+      <pages>81-96</pages>
+      <abstract>This paper explores the intersection of technological innovation and access to justice by developing a benchmark for predicting case outcomes in the UK Employment Tribunal (UKET). To address the challenge of extensive manual annotation, the study employs a large language model (LLM) for automatic annotation, resulting in the creation of the CLC-UKET dataset. The dataset consists of approximately 19,000 UKET cases and their metadata. Comprehensive legal annotations cover facts, claims, precedent references, statutory references, case outcomes, reasons and jurisdiction codes. Facilitated by the CLC-UKET data, we examine a multi-class case outcome prediction task in the UKET. Human predictions are collected to establish a performance reference for model comparison. Empirical results from baseline models indicate that finetuned transformer models outperform zero-shot and few-shot LLMs on the UKET prediction task. The performance of zero-shot LLMs can be enhanced by integrating task-related information into few-shot examples. We hope that the CLC-UKET dataset, along with human annotations and empirical findings, can serve as a valuable benchmark for employment-related dispute resolution.</abstract>
+      <url hash="bbb2457e">2024.nllp-1.7</url>
+      <bibkey>xie-etal-2024-clc</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Information Extraction for Planning Court Cases</title>
+      <author><first>Drish</first><last>Mali</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Rubash</first><last>Mali</last><affiliation>Himalaya College Of Engineering</affiliation></author>
+      <author><first>Claire</first><last>Barale</last><affiliation>School of Informatics, University of Edinburgh</affiliation></author>
+      <pages>97-114</pages>
+      <abstract>Legal documents are often long and unstructured, making them challenging and time-consuming to apprehend. An automatic system that can identify relevant entities and labels within legal documents, would significantly reduce the legal research time. We developed a system to streamline legal case analysis from planning courts by extracting key information from XML files using Named Entity Recognition (NER) and multi-label classification models to convert them into structured form. This research contributes three novel datasets for the Planning Court cases: a NER dataset, a multi-label dataset fully annotated by humans, and newly re-annotated multi-label datasets partially annotated using LLMs. We experimented with various general-purpose and legal domain-specific models with different maximum sequence lengths. It was noted that incorporating paragraph position information improved the performance of models for the multi-label classification task. Our research highlighted the importance of domain-specific models, with LegalRoBERTa and LexLM demonstrating the best performance.</abstract>
+      <url hash="2462d79d">2024.nllp-1.8</url>
+      <bibkey>mali-etal-2024-information</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Automated Anonymization of Parole Hearing Transcripts</title>
+      <author><first>Abed</first><last>Itani</last><affiliation>University of Passau</affiliation></author>
+      <author><first>Wassiliki</first><last>Siskou</last><affiliation>University of Konstanz</affiliation></author>
+      <author><first>Annette</first><last>Hautli-Janisz</last><affiliation>University of Passau</affiliation></author>
+      <pages>115-128</pages>
+      <abstract>Responsible natural language processing is more and more concerned with preventing the violation of personal rights that language technology can entail (CITATION). In this paper we illustrate the case of parole hearings in California, the verbatim transcripts of which are made available to the general public upon a request sent to the California Board of Parole Hearings. The parole hearing setting is highly sensitive: inmates face a board of legal representatives who discuss highly personal matters not only about the inmates themselves but also about victims and their relatives, such as spouses and children. Participants have no choice in contributing to the data collection process, since the disclosure of the transcripts is mandated by law. As researchers who are interested in understanding and modeling the communication in these hierarchy-driven settings, we face an ethical dilemma: publishing raw data as is for the community would compromise the privacy of all individuals affected, but manually cleaning the data requires a substantive effort. In this paper we present an automated anonymization process which reliably removes and pseudonymizes sensitive data in verbatim transcripts, while at the same time preserving the structure and content of the data. Our results show that the process exhibits little to no leakage of sensitive information when applied to more than 300 hearing transcripts.</abstract>
+      <url hash="72218f74">2024.nllp-1.9</url>
+      <bibkey>itani-etal-2024-automated</bibkey>
+    </paper>
+    <paper id="10">
+      <title>Towards an Automated Pointwise Evaluation Metric for Generated Long-Form Legal Summaries</title>
+      <author><first>Shao Min</first><last>Tan</last><affiliation>Thomson Reuters Labs</affiliation></author>
+      <author><first>Quentin</first><last>Grail</last><affiliation>Thomson Reuters</affiliation></author>
+      <author><first>Lee</first><last>Quartey</last><affiliation>Thomson Reuters</affiliation></author>
+      <pages>129-142</pages>
+      <abstract>Long-form abstractive summarization is a task that has particular importance in the legal domain. Automated evaluation metrics are important for the development of text generation models, but existing research on the evaluation of generated summaries has focused mainly on short summaries. We introduce an automated evaluation methodology for generated long-form legal summaries, which involves breaking each summary into individual points, comparing the points in a human-written and machine-generated summary, and calculating a recall and precision score for the latter. The method is designed to be particularly suited for the complexities of legal text, and is also fully interpretable. We also create and release a small meta-dataset for the benchmarking of evaluation methods, focusing on long-form legal summarization. Our evaluation metric corresponds better with human evaluation compared to existing metrics which were not developed for legal data.</abstract>
+      <url hash="c073e32a">2024.nllp-1.10</url>
+      <bibkey>tan-etal-2024-towards-automated</bibkey>
+    </paper>
+    <paper id="11">
+      <title>Enhancing Contract Negotiations with <fixed-case>LLM</fixed-case>-Based Legal Document Comparison</title>
+      <author><first>Savinay</first><last>Narendra</last><affiliation>JP Morgan Chase &amp; Co.</affiliation></author>
+      <author><first>Kaushal</first><last>Shetty</last><affiliation>JP Morgan Chase</affiliation></author>
+      <author><first>Adwait</first><last>Ratnaparkhi</last><affiliation>JPMorganChase</affiliation></author>
+      <pages>143-153</pages>
+      <abstract>We present a large language model (LLM) based approach for comparing legal contracts with their corresponding template documents. Legal professionals use commonly observed deviations between templates and contracts to help with contract negotiations, and also to refine the template documents. Our comparison approach, based on the well-studied natural language inference (NLI) task, first splits a template into key concepts and then uses LLMs to decide if the concepts are entailed by the contract document. We also repeat this procedure in the opposite direction - contract clauses are tested for entailment against the template clause to see if they contain additional information. The non-entailed concepts are labelled, organized and filtered by frequency, and placed into a clause library, which is used to suggest changes to the template documents. We first show that our LLM-based approach outperforms all previous work on a publicly available dataset designed for NLI in the legal domain. We then apply it to a private real-world legal dataset, achieve an accuracy of 96.46%. Our approach is the first in the literature to produce a natural language comparison between legal contracts and their template documents.</abstract>
+      <url hash="37c3f127">2024.nllp-1.11</url>
+      <bibkey>narendra-etal-2024-enhancing</bibkey>
+    </paper>
+    <paper id="12">
+      <title>Attributed Question Answering for Preconditions in the <fixed-case>D</fixed-case>utch Law</title>
+      <author><first>Felicia</first><last>Redelaar</last><affiliation>Leiden University/TNO</affiliation></author>
+      <author><first>Romy</first><last>Van Drie</last><affiliation>TNO</affiliation></author>
+      <author><first>Suzan</first><last>Verberne</last><affiliation>LIACS, Leiden University</affiliation></author>
+      <author><first>Maaike</first><last>De Boer</last><affiliation>TNO</affiliation></author>
+      <pages>154-165</pages>
+      <abstract>In this paper, we address the problem of answering questions about preconditions in the law, e.g. “When can the court terminate the guardianship of a natural person?”. When answering legal questions, it is important to attribute the relevant part of the law; we therefore not only generate answers but also references to law articles. We implement a retrieval augmented generation (RAG) pipeline for long-form answers based on the Dutch law, using several state-of-the-art retrievers and generators. For evaluating our pipeline, we create a dataset containing legal QA pairs with attributions. Our experiments show promising results on our extended version for the automatic evaluation metrics from the Automatic LLMs’ Citation Evaluation (ALCE) Framework and the G-EVAL Framework. Our findings indicate that RAG has significant potential in complex, citation-heavy domains like law, as it helps laymen understand legal preconditions and rights by generating high-quality answers with accurate attributions.</abstract>
+      <url hash="7f8382ee">2024.nllp-1.12</url>
+      <bibkey>redelaar-etal-2024-attributed</bibkey>
+    </paper>
+    <paper id="13">
+      <title>Algorithm for Automatic Legislative Text Consolidation</title>
+      <author><first>Matias</first><last>Etcheverry</last><affiliation>Doctrine</affiliation></author>
+      <author><first>Thibaud</first><last>Real-del-Sarte</last><affiliation>Doctrine</affiliation></author>
+      <author><first>Pauline</first><last>Chavallard</last><affiliation>Doctrine</affiliation></author>
+      <pages>166-175</pages>
+      <abstract>This study introduces a method for automating the consolidation process in a legal context, a time-consuming task traditionally performed by legal professionals. We present a generative approach that processes legislative texts to automatically apply amendments. Our method employs light quantized generative model, finetuned with LoRA, to generate accurate and reliable amended texts. To the authors knowledge, this is the first time generative models are used on legislative text consolidation. Our dataset is publicly available on HuggingFace. Experimental results demonstrate a significant improvement in efficiency, offering faster updates to legal documents. A full automated pipeline of legislative text consolidation can be done in a few hours, with a success rate of more than 63% on a difficult bill.</abstract>
+      <url hash="3b6a2e85">2024.nllp-1.13</url>
+      <bibkey>etcheverry-etal-2024-algorithm</bibkey>
+    </paper>
+    <paper id="14">
+      <title>Measuring the Groundedness of Legal Question-Answering Systems</title>
+      <author><first>Dietrich</first><last>Trautmann</last><affiliation>Center for Information and Language Processing, Ludwig Maximilian University of Munich</affiliation></author>
+      <author><first>Natalia</first><last>Ostapuk</last><affiliation>Thomson Reuters</affiliation></author>
+      <author><first>Quentin</first><last>Grail</last><affiliation>Thomson Reuters</affiliation></author>
+      <author><first>Adrian</first><last>Pol</last><affiliation>Thomson Reuters</affiliation></author>
+      <author><first>Guglielmo</first><last>Bonifazi</last><affiliation>Thomson Reuters</affiliation></author>
+      <author><first>Shang</first><last>Gao</last><affiliation>Thomson Reuters</affiliation></author>
+      <author><first>Martin</first><last>Gajek</last><affiliation>Thomson Reuters</affiliation></author>
+      <pages>176-186</pages>
+      <abstract>In high-stakes domains like legal question-answering, the accuracy and trustworthiness of generative AI systems are of paramount importance. This work presents a comprehensive benchmark of various methods to assess the groundedness of AI-generated responses, aiming to significantly enhance their reliability. Our experiments include similarity-based metrics and natural language inference models to evaluate whether responses are well-founded in the given contexts. We also explore different prompting strategies for large language models to improve the detection of ungrounded responses. We validated the effectiveness of these methods using a newly created grounding classification corpus, designed specifically for legal queries and corresponding responses from retrieval-augmented prompting, focusing on their alignment with source material. Our results indicate potential in groundedness classification of generated responses, with the best method achieving a macro-F1 score of 0.8. Additionally, we evaluated the methods in terms of their latency to determine their suitability for real-world applications, as this step typically follows the generation process. This capability is essential for processes that may trigger additional manual verification or automated response regeneration. In summary, this study demonstrates the potential of various detection methods to improve the trustworthiness of generative AI in legal settings.</abstract>
+      <url hash="5ba2f79b">2024.nllp-1.14</url>
+      <bibkey>trautmann-etal-2024-measuring</bibkey>
+    </paper>
+    <paper id="15">
+      <title>Transductive Legal Judgment Prediction Combining <fixed-case>BERT</fixed-case> Embeddings with Delaunay-Based <fixed-case>GNN</fixed-case>s</title>
+      <author><first>Hugo</first><last>Attali</last><affiliation>LIPN, Universite Sorbonne Nord</affiliation></author>
+      <author><first>Nadi</first><last>Tomeh</last><affiliation>LIPN-CNRS, Université Sorbonne Paris Nord</affiliation></author>
+      <pages>187-193</pages>
+      <abstract>This paper presents a novel approach to legal judgment prediction by combining BERT embeddings with a Delaunay-based Graph Neural Network (GNN). Unlike inductive methods that classify legal documents independently, our transductive approach models the entire document set as a graph, capturing both contextual and relational information. This method significantly improves classification accuracy by enabling effective label propagation across connected documents. Evaluated on the Swiss-Judgment-Prediction (SJP) dataset, our model outperforms established baselines, including larger models with cross-lingual training and data augmentation techniques, while maintaining efficiency with minimal computational overhead.</abstract>
+      <url hash="7441e233">2024.nllp-1.15</url>
+      <bibkey>attali-tomeh-2024-transductive</bibkey>
+    </paper>
+    <paper id="16">
+      <title>Cross Examine: An Ensemble-based approach to leverage Large Language Models for Legal Text Analytics</title>
+      <author><first>Saurav</first><last>Chowdhury</last><affiliation>Indian Institute of Technology, Jodhpur</affiliation></author>
+      <author><first>Lipika</first><last>Dey</last><affiliation>Ashoka University</affiliation></author>
+      <author><first>Suyog</first><last>Joshi</last><affiliation>Ashoka University</affiliation></author>
+      <pages>194-204</pages>
+      <abstract>Legal documents are complex in nature, describing a course of argumentative reasoning that is followed to settle a case. Churning through large volumes of legal documents is a daily requirement for a large number of professionals who need access to the information embedded in them. Natural language processing methods that help in document summarization with key information components, insight extraction and question answering play a crucial role in legal text processing. Most of the existing document analysis systems use supervised machine learning, which require large volumes of annotated training data for every different application and are expensive to build. In this paper we propose a legal text analytics pipeline using Large Language Models (LLM), which can work with little or no training data. For document summarization, we propose an iterative pipeline using retrieval augmented generation to ensure that the generated text remains contextually relevant. For question answering, we propose a novel ontology-driven ensemble approach similar to cross-examination that exploits questioning and verification principles. A knowledge graph, created with the extracted information, stores the key entities and relationships reflecting the repository content structure. A new dataset is created with Indian court documents related to bail applications for cases filed under Protection of Children from Sexual Offences (POCSO) Act, 2012 an Indian law to protect children from sexual abuse and offences. Analysis of insights extracted from the answers reveal patterns of crime and social conditions leading to those crimes, which are important inputs for social scientists as well as legal system.</abstract>
+      <url hash="1b73d074">2024.nllp-1.16</url>
+      <bibkey>chowdhury-etal-2024-cross</bibkey>
+    </paper>
+    <paper id="17">
+      <title><fixed-case>LLM</fixed-case>s to the Rescue: Explaining <fixed-case>DSA</fixed-case> Statements of Reason with Platform’s Terms of Services</title>
+      <author><first>Marco</first><last>Aspromonte</last><affiliation>Alma AI, Alma Mater Studiorum, University of Bologna</affiliation></author>
+      <author><first>Andrea</first><last>Ferraris</last><affiliation>Unibo</affiliation></author>
+      <author><first>Federico</first><last>Galli</last><affiliation>Alma AI, Alma Mater Studiorum, University of Bologna</affiliation></author>
+      <author><first>Giuseppe</first><last>Contissa</last><affiliation>Alma AI, Alma Mater Studiorum, University of Bologna</affiliation></author>
+      <pages>205-215</pages>
+      <abstract>The Digital Services Act (DSA) requires online platforms in the EU to provide “statements of reason” (SoRs) when restricting user content, but their effectiveness in ensuring transparency is still debated due to vague and complex terms of service (ToS). This paper explores the use of NLP techniques, specifically multi-agent systems based on large language models (LLMs), to clarify SoRs by linking them to relevant ToS sections. Analysing SoRs from platforms like Booking.com, Reddit, and LinkedIn, our findings show that LLMs can enhance the interpretability of content moderation decisions, improving user understanding and engagement with DSA requirements.</abstract>
+      <url hash="28ba476b">2024.nllp-1.17</url>
+      <bibkey>aspromonte-etal-2024-llms</bibkey>
+    </paper>
+    <paper id="18">
+      <title><fixed-case>BLT</fixed-case>: Can Large Language Models Handle Basic Legal Text?</title>
+      <author><first>Andrew</first><last>Blair-Stanek</last><affiliation>University of Maryland Law School</affiliation></author>
+      <author><first>Nils</first><last>Holzenberger</last><affiliation>Télécom Paris</affiliation></author>
+      <author><first>Benjamin</first><last>Van Durme</last><affiliation>Johns Hopkins University / Microsoft</affiliation></author>
+      <pages>216-232</pages>
+      <abstract>We find that the best publicly available LLMs like GPT-4 and Claude currently perform poorly on basic legal text handling. This motivates the creation of a benchmark consisting of examples that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs’ poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning on our training set brings even a small model to near-perfect performance. This benchmark will be useful for fine-tuning LLMs for downstream legal tasks, as well as for tracking LLMs’ reliability as-is for basic legal tasks.</abstract>
+      <url hash="ad13e97b">2024.nllp-1.18</url>
+      <bibkey>blair-stanek-etal-2024-blt</bibkey>
+    </paper>
+    <paper id="19">
+      <title>Multi-Property Multi-Label Documents Metadata Recommendation based on Encoder Embeddings</title>
+      <author><first>Nasredine</first><last>Cheniki</last><affiliation>Publications Office of the European Union</affiliation></author>
+      <author><first>Vidas</first><last>Daudaravicius</last><affiliation>European Commission Joint Research Centre</affiliation></author>
+      <author><first>Abdelfettah</first><last>Feliachi</last><affiliation>Publications Office of the European Union</affiliation></author>
+      <author><first>Didier</first><last>Hardy</last><affiliation>Publications Office of the European Union</affiliation></author>
+      <author><first>Marc Wilhelm</first><last>Küster</last><affiliation>Publications Office of the European Union</affiliation></author>
+      <pages>233-242</pages>
+      <abstract>The task of document classification, particularly multi-label classification, presents a significant challenge due to the complexity of assigning multiple relevant labels to each document. This complexity is further amplified in multi-property multi-label classification tasks, where documents must be categorized across various sets of labels. In this research, we introduce an innovative encoder embedding-driven approach to multi-property multi-label document classification that leverages semantic-text similarity and the reuse of pre-existing annotated data to enhance the efficiency and accuracy of the document annotation process. Our method requires only a single model for text similarity, eliminating the need for multiple property-specific classifiers and thereby reducing computational demands and simplifying deployment. We evaluate our approach through a prototype deployed for daily operations, which demonstrates superior performance over existing classification systems. Our contributions include improved accuracy without additional training, increased efficiency, and demonstrated effectiveness in practical applications. The results of our study indicate the potential of our approach to be applied across various domains requiring multi-property multi-label document classification, offering a scalable and adaptable solution for metadata annotation tasks.</abstract>
+      <url hash="164fbc64">2024.nllp-1.19</url>
+      <bibkey>cheniki-etal-2024-multi</bibkey>
+    </paper>
+    <paper id="20">
+      <title>Comparative Study of Explainability Methods for Legal Outcome Prediction</title>
+      <author><first>Ieva</first><last>Staliunaite</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Josef</first><last>Valvoda</last><affiliation>University of Cambridge</affiliation></author>
+      <author><first>Ken</first><last>Satoh</last><affiliation>National Institute of Informatics</affiliation></author>
+      <pages>243-258</pages>
+      <abstract>This paper investigates explainability in Natural Legal Language Processing (NLLP). We study the task of legal outcome prediction of the European Court of Human Rights cases in a ternary classification setup, where a language model is fine-tuned to predict whether an article has been claimed and violated (positive outcome), claimed but not violated (negative outcome) or not claimed at all (null outcome). Specifically, we experiment with three popular NLP explainability methods. Correlating the attribution scores of input-level methods (Integrated Gradients and Contrastive Explanations) with rationales from court rulings, we show that the correlations are very weak, with absolute values of Spearman and Kendall correlation coefficients ranging between 0.003 and 0.094. Furthermore, we use a concept-level interpretability method (Concept Erasure) with human expert annotations of legal reasoning, to show that obscuring legal concepts from the model representation has an insignificant effect on model performance (at most a decline of 0.26 F1). Therefore, our results indicate that automated legal outcome prediction models are not reliably grounded in legal reasoning.</abstract>
+      <url hash="3e22f441">2024.nllp-1.20</url>
+      <bibkey>staliunaite-etal-2024-comparative</bibkey>
+    </paper>
+    <paper id="21">
+      <title>Bonafide at <fixed-case>L</fixed-case>egal<fixed-case>L</fixed-case>ens 2024 Shared Task: Using Lightweight <fixed-case>D</fixed-case>e<fixed-case>BERT</fixed-case>a Based Encoder For Legal Violation Detection and Resolution</title>
+      <author><first>Shikha</first><last>Bordia</last><affiliation>Individual Contributor</affiliation></author>
+      <pages>259-266</pages>
+      <abstract>In this work, we present two systems—Named Entity Resolution (NER) and Natural Language Inference (NLI)—for detecting legal violations within unstructured textual data and for associating these violations with potentially affected individuals, respectively. Both these systems are lightweight DeBERTa based encoders that outperform the LLM baselines. The proposed NER system achieved an F1 score of 60.01% on Subtask A of the LegalLens challenge, which focuses on identifying violations. The proposed NLI system achieved an F1 score of 84.73% on Subtask B of the LegalLens challenge, which focuses on resolving these violations by matching them with pre-existing legal complaints of class action cases. Our NER system ranked sixth and NLI system ranked fifth on the LegalLens leaderboard. We release the trained models and inference scripts.</abstract>
+      <url hash="234b9b81">2024.nllp-1.21</url>
+      <bibkey>bordia-2024-bonafide</bibkey>
+    </paper>
+    <paper id="22">
+      <title><fixed-case>LAR</fixed-case>-<fixed-case>ECHR</fixed-case>: A New Legal Argument Reasoning Task and Dataset for Cases of the <fixed-case>E</fixed-case>uropean Court of Human Rights</title>
+      <author><first>Odysseas</first><last>Chlapanis</last><affiliation>Department of Informatics, Athens University of Economics and Business &amp; Archimedes Unit, Athena Research Center</affiliation></author>
+      <author><first>Dimitris</first><last>Galanis</last><affiliation>Institute for Language and Speech Processing, Athena Research Center</affiliation></author>
+      <author><first>Ion</first><last>Androutsopoulos</last><affiliation>Athens University of Economics and Business</affiliation></author>
+      <pages>267-279</pages>
+      <abstract>We present Legal Argument Reasoning (LAR), a novel task designed to evaluate the legal reasoning capabilities of Large Language Models (LLMs). The task requires selecting the correct next statement (from multiple choice options) in a chain of legal arguments from court proceedings, given the facts of the case. We constructed a dataset (LAR-ECHR) for this task using cases from the European Court of Human Rights (ECHR). We evaluated seven general-purpose LLMs on LAR-ECHR and found that (a) the ranking of the models is aligned with that of LegalBench, an established US-based legal reasoning benchmark, even though LAR-ECHR is based on EU law, (b) LAR-ECHR distinguishes top models more clearly, compared to LegalBench, (c) even the best model (GPT-4o) obtains 75.8% accuracy on LAR-ECHR, indicating significant potential for further model improvement. The process followed to construct LAR-ECHR can be replicated with cases from other legal systems.</abstract>
+      <url hash="5e225dc3">2024.nllp-1.22</url>
+      <bibkey>chlapanis-etal-2024-lar</bibkey>
+    </paper>
+    <paper id="24">
+      <title>Gaps or Hallucinations? Scrutinizing Machine-Generated Legal Analysis for Fine-grained Text Evaluations</title>
+      <author><first>Abe</first><last>Hou</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>William</first><last>Jurayj</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Nils</first><last>Holzenberger</last><affiliation>Télécom Paris, Institut Polytechnique de Paris</affiliation></author>
+      <author><first>Andrew</first><last>Blair-Stanek</last><affiliation>University of Maryland Law School / Johns Hopkins University</affiliation></author>
+      <author><first>Benjamin</first><last>Van Durme</last><affiliation>Johns Hopkins University / Microsoft / HLTCOE</affiliation></author>
+      <pages>280-302</pages>
+      <abstract>Large Language Models (LLMs) show promise as a writing aid for professionals performing legal analyses. However, LLMs can often hallucinate in this setting, in ways difficult to recognize by non-professionals and existing text evaluation metrics. In this work, we pose the question: when can machine-generated legal analysis be evaluated as acceptable? We introduce the neutral notion of gaps – as opposed to hallucinations in a strict erroneous sense – to refer to the difference between human-written and machine-generated legal analysis. Gaps do not always equate to invalid generation. Working with legal experts, we consider the CLERC generation task proposed in Hou et al. (2024b), leading to a taxonomy, a fine-grained detector for predicting gap categories, and an annotated dataset for automatic evaluation. Our best detector achieves 67% F1 score and 80% precision on the test set. Employing this detector as an automated metric on legal analysis generated by SOTA LLMs, we find around 80% contain hallucinations of different kinds.</abstract>
+      <url hash="c4159aad">2024.nllp-1.24</url>
+      <bibkey>hou-etal-2024-gaps</bibkey>
+    </paper>
+    <paper id="25">
+      <title>Classify First, and Then Extract: Prompt Chaining Technique for Information Extraction</title>
+      <author><first>Alice</first><last>Kwak</last><affiliation>University of Arizona</affiliation></author>
+      <author><first>Clayton</first><last>Morrison</last><affiliation>University of Arizona</affiliation></author>
+      <author><first>Derek</first><last>Bambauer</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Mihai</first><last>Surdeanu</last><affiliation>University of Arizona</affiliation></author>
+      <pages>303-317</pages>
+      <abstract>This work presents a new task-aware prompt design and example retrieval approach for information extraction (IE) using a prompt chaining technique. Our approach divides IE tasks into two steps: (1) text classification to understand what information (e.g., entity or event types) is contained in the underlying text and (2) information extraction for the identified types. Initially, we use a large language model (LLM) in a few-shot setting to classify the contained information. The classification output is used to select the relevant prompt and retrieve the examples relevant to the input text. Finally, we ask a LLM to do the information extraction with the generated prompt. By evaluating our approach on legal IE tasks with two different LLMs, we demonstrate that the prompt chaining technique improves the LLM’s overall performance in a few-shot setting when compared to the baseline in which examples from all possible classes are included in the prompt. Our approach can be used in a low-resource setting as it does not require a large amount of training data. Also, it can be easily adapted to many different IE tasks by simply adjusting the prompts. Lastly, it provides a cost benefit by reducing the number of tokens in the prompt.</abstract>
+      <url hash="ad969fa6">2024.nllp-1.25</url>
+      <bibkey>kwak-etal-2024-classify</bibkey>
+    </paper>
+    <paper id="26">
+      <title>Augmenting Legal Decision Support Systems with <fixed-case>LLM</fixed-case>-based <fixed-case>NLI</fixed-case> for Analyzing Social Media Evidence</title>
+      <author><first>Ram Mohan Rao</first><last>Kadiyala</last><affiliation>N/A</affiliation></author>
+      <author><first>Siddartha</first><last>Pullakhandam</last><affiliation>University of Wisconsin, Milwaukee</affiliation></author>
+      <author><first>Kanwal</first><last>Mehreen</last><affiliation>Traversaal.ai</affiliation></author>
+      <author><first>Subhasya</first><last>Tippareddy</last><affiliation>University of South Florida</affiliation></author>
+      <author><first>Ashay</first><last>Srivastava</last><affiliation>University of Maryland</affiliation></author>
+      <pages>318-325</pages>
+      <abstract>This paper presents our system description and error analysis of our entry for NLLP 2024 shared task on Legal Natural Language Inference (L-NLI). The task required classifying these relationships as entailed, contradicted, or neutral, indicating any association between the review and the complaint. Our system emerged as the winning submission, significantly outperforming other entries with a substantial margin and demonstrating the effectiveness of our approach in legal text analysis. We provide a detailed analysis of the strengths and limitations of each model and approach tested, along with a thorough error analysis and suggestions for future improvements. This paper aims to contribute to the growing field of legal NLP by offering insights into advanced techniques for natural language inference in legal contexts, making it accessible to both experts and newcomers in the field.</abstract>
+      <url hash="b86eeee8">2024.nllp-1.26</url>
+      <bibkey>kadiyala-etal-2024-augmenting</bibkey>
+    </paper>
+    <paper id="27">
+      <title>Empowering Air Travelers: A Chatbot for <fixed-case>C</fixed-case>anadian Air Passenger Rights</title>
+      <author><first>Maksym</first><last>Taranukhin</last><affiliation>Dalhousie University</affiliation></author>
+      <author><first>Sahithya</first><last>Ravi</last><affiliation>The University of British Columbia, Vancouver</affiliation></author>
+      <author><first>Gabor</first><last>Lukacs</last><affiliation>Air Passenger Rights</affiliation></author>
+      <author><first>Evangelos</first><last>Milios</last><affiliation>Dalhousie University</affiliation></author>
+      <author><first>Vered</first><last>Shwartz</last><affiliation>University of British Columbia</affiliation></author>
+      <pages>326-335</pages>
+      <abstract>The Canadian air travel sector has seen a significant increase in flight delays, cancellations, and other issues concerning passenger rights. Recognizing this demand, we present a chatbot to assist passengers and educate them about their rights. Our system breaks a complex user input into simple queries which are used to retrieve information from a collection of documents detailing air travel regulations. The most relevant passages from these documents are presented along with links to the original documents and the generated queries, enabling users to dissect and leverage the information for their unique circumstances. The system successfully overcomes two predominant challenges: understanding complex user inputs, and delivering accurate answers, free of hallucinations, that passengers can rely on for making informed decisions. A user study comparing the chatbot to a Google search demonstrated the chatbot’s usefulness and ease of use. Beyond the primary goal of providing accurate and timely information to air passengers regarding their rights, we hope that this system will also enable further research exploring the tradeoff between the user-friendly conversational interface of chatbots and the accuracy of retrieval systems.</abstract>
+      <url hash="f0ee01fa">2024.nllp-1.27</url>
+      <bibkey>taranukhin-etal-2024-empowering</bibkey>
+    </paper>
+    <paper id="28">
+      <title>Enhancing Legal Violation Identification with <fixed-case>LLM</fixed-case>s and Deep Learning Techniques: Achievements in the <fixed-case>L</fixed-case>egal<fixed-case>L</fixed-case>ens 2024 Competition</title>
+      <author><first>Nguyen</first><last>Tan Minh</last><affiliation>VNU University of Engineering and Technology</affiliation></author>
+      <author><first>Duy</first><last>Ngoc Mai</last><affiliation>VNU University of Engineering and Technology</affiliation></author>
+      <author><first>Le</first><last>Xuan Bach</last><affiliation>VNU University of Engineering and Technology</affiliation></author>
+      <author><first>Nguyen</first><last>Huu Dung</last><affiliation>VNU University of Engineering and Technology</affiliation></author>
+      <author><first>Pham</first><last>Cong Minh</last><affiliation>VNU University of Engineering and Technology</affiliation></author>
+      <author><first>Ha Thanh</first><last>Nguyen</last><affiliation>National Institute of Informatics</affiliation></author>
+      <author><first>Thi Hai Yen</first><last>Vuong</last><affiliation>University of Engineering and Technology, Vietnam national university Hanoi</affiliation></author>
+      <pages>336-345</pages>
+      <abstract>LegalLens is a competition organized to encourage advancements in automatically detecting legal violations. This paper presents our solutions for two tasks Legal Named Entity Recognition (L-NER) and Legal Natural Language Inference (L-NLI). Our approach involves fine-tuning BERT-based models, designing methods based on data characteristics, and a novel prompting template for data augmentation using LLMs. As a result, we secured first place in L-NER and third place in L-NLI among thirty-six participants. We also perform error analysis to provide valuable insights and pave the way for future enhancements in legal NLP. Our implementation is available at https://github.com/lxbach10012004/legal-lens/tree/main</abstract>
+      <url hash="28e7eef8">2024.nllp-1.28</url>
+      <bibkey>tan-minh-etal-2024-enhancing</bibkey>
+    </paper>
+    <paper id="30">
+      <title><fixed-case>L</fixed-case>egal<fixed-case>L</fixed-case>ens 2024 Shared Task: Masala-chai Submission</title>
+      <author><first>Khalid</first><last>Rajan</last><affiliation>Georgian</affiliation></author>
+      <author><first>Royal</first><last>Sequiera</last><affiliation>Georgian</affiliation></author>
+      <pages>346-354</pages>
+      <abstract>In this paper, we present the masala-chai team’s participation in the LegalLens 2024 shared task and detail our approach to predicting legal entities and performing natural language inference (NLI) in the legal domain. We experimented with various transformer-based models, including BERT, RoBERTa, Llama 3.1, and GPT-4o. Our results show that state-of-the-art models like GPT-4o underperformed in NER and NLI tasks, even when using advanced techniques such as bootstrapping and prompt optimization. The best performance in NER (accuracy: 0.806, F1 macro: 0.701) was achieved with a fine-tuned RoBERTa model, while the highest NLI results (accuracy: 0.825, F1 macro: 0.833) came from a fine-tuned Llama 3.1 8B model. Notably, RoBERTa, despite having significantly fewer parameters than Llama 3.1 8B, delivered comparable results. We discuss key findings and insights from our experiments and provide our results and code for reproducibility and further analysis at https://github.com/rosequ/masala-chai</abstract>
+      <url hash="57a6fb6f">2024.nllp-1.30</url>
+      <bibkey>rajan-sequiera-2024-legallens</bibkey>
+    </paper>
+    <paper id="31">
+      <title>Semantists at <fixed-case>L</fixed-case>egal<fixed-case>L</fixed-case>ens-2024: Data-efficient Training of <fixed-case>LLM</fixed-case>’s for Legal Violation Identification</title>
+      <author><first>Kanagasabai</first><last>Rajaraman</last><affiliation>Institute for Infocomm Research, A*STAR</affiliation></author>
+      <author><first>Hariram</first><last>Veeramani</last><affiliation>UCLA</affiliation></author>
+      <pages>355-360</pages>
+      <abstract>In this paper, we describe our system for LegalLens-2024 Shared Task on automatically identifying legal violations from unstructured text sources. We participate in Subtask B, called Legal Natural Language Inference (L-NLI), that aims to predict the relationship between a given premise summarizing a class action complaint and a hypothesis from an online media text, indicating any association between the review and the complaint. This task is challenging as it provides only limited labelled data. In our work, we adopt LLM based methods and explore various data-efficient learning approaches for maximizing performance. In the end, our best model employed an ensemble of LLM’s fine-tuned on the task-specific data, and achieved a Macro F1 score of 78.5% on test data, and ranked 2nd among all teams submissions.</abstract>
+      <url hash="a07cb011">2024.nllp-1.31</url>
+      <bibkey>rajaraman-veeramani-2024-semantists</bibkey>
+    </paper>
+    <paper id="33">
+      <title><fixed-case>L</fixed-case>egal<fixed-case>L</fixed-case>ens Shared Task 2024: Legal Violation Identification in Unstructured Text</title>
+      <author><first>Ben</first><last>Hagag</last><affiliation>Bar Ilan University</affiliation></author>
+      <author><first>Gil</first><last>Gil Semo</last><affiliation>TAU</affiliation></author>
+      <author><first>Dor</first><last>Bernsohn</last><affiliation>HUJI</affiliation></author>
+      <author><first>Liav</first><last>Harpaz</last><affiliation>Darrow</affiliation></author>
+      <author><first>Pashootan</first><last>Vaezipoor</last><affiliation>Georgian</affiliation></author>
+      <author><first>Rohit</first><last>Saha</last><affiliation>Georgian</affiliation></author>
+      <author><first>Kyryl</first><last>Truskovskyi</last><affiliation>Scoreinforce Inc.</affiliation></author>
+      <author><first>Gerasimos</first><last>Spanakis</last><affiliation>Maastricht University</affiliation></author>
+      <pages>361-370</pages>
+      <abstract>This paper presents the results of the LegalLens Shared Task, focusing on detecting legal violations within text in the wild across two sub-tasks: LegalLens-NER for identifying legal violation entities and LegalLens-NLI for associating these violations with relevant legal contexts and affected individuals. Using an enhanced LegalLens dataset covering labor, privacy, and consumer protection domains, 38 teams participated in the task. Our analysis reveals that while a mix of approaches was used, the top-performing teams in both tasks consistently relied on fine-tuning pre-trained language models, outperforming legal-specific models and few-shot methods. The top-performing team achieved a 7.11% improvement in NER over the baseline, while NLI saw a more marginal improvement of 5.7%. Despite these gains, the complexity of legal texts leaves room for further advancements.</abstract>
+      <url hash="13b89526">2024.nllp-1.33</url>
+      <bibkey>hagag-etal-2024-legallens</bibkey>
+    </paper>
+    <paper id="34">
+      <title><fixed-case>D</fixed-case>e<fixed-case>BERT</fixed-case>a Beats Behemoths: A Comparative Analysis of Fine-Tuning, Prompting, and <fixed-case>PEFT</fixed-case> Approaches on <fixed-case>L</fixed-case>egal<fixed-case>L</fixed-case>ens<fixed-case>NER</fixed-case></title>
+      <author><first>Hanh Thi Hong</first><last>Tran</last><affiliation>La Rochelle University</affiliation></author>
+      <author><first>Nishan</first><last>Chatterjee</last><affiliation>Institut Jožef Stefan, University of La Rochelle</affiliation></author>
+      <author><first>Senja</first><last>Pollak</last><affiliation>Jožef Stefan Institute</affiliation></author>
+      <author><first>Antoine</first><last>Doucet</last><affiliation>University of La Rochelle</affiliation></author>
+      <pages>371-380</pages>
+      <abstract>This paper summarizes the participation of our team (Flawless Lawgic) in the legal named entity recognition (L-NER) task at LegalLens 2024: Detecting Legal Violations. Given possible unstructured texts (e.g., online media texts), we aim to identify legal violations by extracting legal entities such as “violation”, “violation by”, “violation on”, and “law”. This system-description paper discusses our approaches to address the task, empirically highlighting the performances of fine-tuning models from the Transformers family (e.g., RoBERTa and DeBERTa) against open-sourced LLMs (e.g., Llama, Mistral) with different tuning settings (e.g., LoRA, Supervised Fine-Tuning (SFT) and prompting strategies). Our best results, with a weighted F1 of 0.705 on the test set, show a 30 percentage points increase in F1 compared to the baseline and rank 2 on the leaderboard, leaving a marginal gap of only 0.4 percentage points lower than the top solution. Our solutions are available at github.com/honghanhh/lner.</abstract>
+      <url hash="f902c5af">2024.nllp-1.34</url>
+      <bibkey>tran-etal-2024-deberta</bibkey>
+    </paper>
+    <paper id="35">
+      <title><fixed-case>L</fixed-case>ex<fixed-case>S</fixed-case>umm and <fixed-case>L</fixed-case>ex<fixed-case>T</fixed-case>5: Benchmarking and Modeling Legal Summarization Tasks in <fixed-case>E</fixed-case>nglish</title>
+      <author><first>Santosh</first><last>T.y.s.s</last><affiliation>Technical University of Munich</affiliation></author>
+      <author><first>Cornelius</first><last>Weiss</last><affiliation>Technical University Munich</affiliation></author>
+      <author><first>Matthias</first><last>Grabmair</last><affiliation>Technical University of Munich</affiliation></author>
+      <pages>381-403</pages>
+      <abstract>In the evolving NLP landscape, benchmarks serve as yardsticks for gauging progress. However, existing Legal NLP benchmarks only focus on predictive tasks, overlooking generative tasks. This work curates LexSumm, a benchmark designed for evaluating legal summarization tasks in English. It comprises eight English legal summarization datasets, from diverse jurisdictions, such as the US, UK, EU and India. Additionally, we release LexT5, legal oriented sequence-to-sequence model, addressing the limitation of the existing BERT-style encoder-only models in the legal domain. We assess its capabilities through zero-shot probing on LegalLAMA and fine-tuning on LexSumm. Our analysis reveals abstraction and faithfulness errors even in summaries generated by zero-shot LLMs, indicating opportunities for further improvements. LexSumm benchmark and LexT5 model are available at https://github.com/TUMLegalTech/LexSumm-LexT5.</abstract>
+      <url hash="3582ffae">2024.nllp-1.35</url>
+      <bibkey>t-y-s-s-etal-2024-lexsumm</bibkey>
+    </paper>
+    <paper id="36">
+      <title>Towards Supporting Legal Argumentation with <fixed-case>NLP</fixed-case>: Is More Data Really All You Need?</title>
+      <author><first>Santosh</first><last>T.y.s.s</last><affiliation>Technical University of Munich</affiliation></author>
+      <author><first>Kevin</first><last>Ashley</last><affiliation>University of Pittsburgh</affiliation></author>
+      <author><first>Katie</first><last>Atkinson</last><affiliation>University of Liverpool</affiliation></author>
+      <author><first>Matthias</first><last>Grabmair</last><affiliation>Technical University of Munich</affiliation></author>
+      <pages>404-421</pages>
+      <abstract>Modeling legal reasoning and argumentation justifying decisions in cases has always been central to AI &amp; Law, yet contemporary developments in legal NLP have increasingly focused on statistically classifying legal conclusions from text. While conceptually “simpler’, these approaches often fall short in providing usable justifications connecting to appropriate legal concepts. This paper reviews both traditional symbolic works in AI &amp; Law and recent advances in legal NLP, and distills possibilities of integrating expert-informed knowledge to strike a balance between scalability and explanation in symbolic vs. data-driven approaches. We identify open challenges and discuss the potential of modern NLP models and methods that integrate conceptual legal knowledge.</abstract>
+      <url hash="32d447f0">2024.nllp-1.36</url>
+      <bibkey>t-y-s-s-etal-2024-towards-supporting</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.nlp4dh.xml b/data/xml/2024.nlp4dh.xml
new file mode 100644
index 0000000000..34b82e804b
--- /dev/null
+++ b/data/xml/2024.nlp4dh.xml
@@ -0,0 +1,559 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.nlp4dh">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the 4th International Conference on Natural Language Processing for Digital Humanities</booktitle>
+      <editor><first>Mika</first><last>Hämäläinen</last></editor>
+      <editor><first>Emily</first><last>Öhman</last></editor>
+      <editor><first>So</first><last>Miyagawa</last></editor>
+      <editor><first>Khalid</first><last>Alnajjar</last></editor>
+      <editor><first>Yuri</first><last>Bizzoni</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="1699d0ba">2024.nlp4dh-1</url>
+      <venue>nlp4dh</venue>
+    </meta>
+    <frontmatter>
+      <url hash="6f2349b4">2024.nlp4dh-1.0</url>
+      <bibkey>nlp4dh-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Text Length and the Function of Intentionality: A Case Study of Contrastive Subreddits</title>
+      <author><first>Emily Sofi</first><last>Ohman</last></author>
+      <author><first>Aatu</first><last>Liimatta</last></author>
+      <pages>1–8</pages>
+      <abstract>Text length is of central concern in natural language processing (NLP) tasks, yet it is very much under-researched. In this paper, we use social media data, specifically Reddit, to explore the function of text length and intentionality by contrasting subreddits of the same topic where one is considered more serious/professional/academic and the other more relaxed/beginner/layperson. We hypothesize that word choices are more deliberate and intentional in the more in-depth and professional subreddits with texts subsequently becoming longer as a function of this intentionality. We argue that this has deep implications for many applied NLP tasks such as emotion and sentiment analysis, fake news and disinformation detection, and other modeling tasks focused on social media and similar platforms where users interact with each other via the medium of text.</abstract>
+      <url hash="1f68e11e">2024.nlp4dh-1.1</url>
+      <bibkey>ohman-liimatta-2024-text</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Tracing the Genealogies of Ideas with Sentence Embeddings</title>
+      <author><first>Lucian</first><last>Li</last></author>
+      <pages>9–16</pages>
+      <abstract>Detecting intellectual influence in unstructured text is an important problem for a wide range of fields, including intellectual history, social science, and bibliometrics. A wide range of previous studies in computational social science and digital humanities have attempted to resolve this through a range of dictionary, embedding, and language model based methods. I introduce an approach which leverages a sentence embedding index to efficiently search for similar ideas in a large historical corpus. This method remains robust in conditions of high OCR error found in real mass digitized historical corpora that disrupt previous published methods, while also capturing paraphrase and indirect influence. I evaluate this method on a large corpus of 250,000 nonfiction texts from the 19th century, and find that discovered influence is in line with history of science literature. By expanding the scope of our search for influence and the origins of ideas beyond traditional structured corpora and canonical works and figures, we can get a more nuanced perspective on influence and idea dissemination that can encompass epistemically marginalized groups.</abstract>
+      <url hash="0a0c88a1">2024.nlp4dh-1.2</url>
+      <bibkey>li-2024-tracing</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Evaluating Computational Representations of Character: An Austen Character Similarity Benchmark</title>
+      <author><first>Funing</first><last>Yang</last></author>
+      <author><first>Carolyn Jane</first><last>Anderson</last></author>
+      <pages>17–30</pages>
+      <abstract>Several systems have been developed to extract information about characters to aid computational analysis of English literature. We propose character similarity grouping as a holistic evaluation task for these pipelines. We present AustenAlike, a benchmark suite of character similarities in Jane Austen’s novels. Our benchmark draws on three notions of character similarity: a structurally defined notion of similarity; a socially defined notion of similarity; and an expert defined set extracted from literary criticism. We use AustenAlike to evaluate character features extracted using two pipelines, BookNLP and FanfictionNLP. We build character representations from four kinds of features and compare them to the three AustenAlike benchmarks and to GPT-4 similarity rankings. We find that though computational representations capture some broad similarities based on shared social and narrative roles, the expert pairings in our third benchmark are challenging for all systems, highlighting the subtler aspects of similarity noted by human readers.</abstract>
+      <url hash="5364f486">2024.nlp4dh-1.3</url>
+      <bibkey>yang-anderson-2024-evaluating</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Investigating Expert-in-the-Loop <fixed-case>LLM</fixed-case> Discourse Patterns for Ancient Intertextual Analysis</title>
+      <author><first>Ray</first><last>Umphrey</last></author>
+      <author><first>Jesse</first><last>Roberts</last></author>
+      <author><first>Lindsey</first><last>Roberts</last></author>
+      <pages>31–40</pages>
+      <abstract>This study explores the potential of large language models (LLMs) for identifying and examining intertextual relationships within biblical, koine Greek texts. By evaluating the performance of LLMs on various intertextuality scenarios the study demonstrates that these models can detect direct quotations, allusions, and echoes between texts. The LLM’s ability to generate novel intertextual observations and connections highlights its potential to uncover new insights. However, the model also struggles with long query passages and the inclusion of false intertextual dependences, emphasizing the importance of expert evaluation. The expert-in-the-loop methodology presented offers a scalable approach for intertextual research into the complex web of intertextuality within and beyond the biblical corpus.</abstract>
+      <url hash="9752ed8e">2024.nlp4dh-1.4</url>
+      <bibkey>umphrey-etal-2024-investigating</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Extracting Relations from Ecclesiastical Cultural Heritage Texts</title>
+      <author><first>Giulia</first><last>Cruciani</last></author>
+      <pages>41–50</pages>
+      <abstract>Motivated by the increasing volume of data and the necessity of getting valuable insights, this research describes the process of extracting entities and relations from Italian texts in the context of ecclesiastical cultural heritage data. Named Entity Recognition (NER) and Relation Extraction (RE) are paramount tasks in Natural Language Processing. This paper presents a traditional methodology based on a two-step procedure: firstly, a custom model for Named Entity Recognition extracts entities from data, and then, a multi-input neural network model is trained to perform Relation Classification as a multi-label classification problem. Data are provided by IDS&amp;Unitelm (technological partner of the IT Services and National Office for Ecclesiastical Cultural Heritage and Religious Buildings of CEI, the Italian Episcopal Conference) and concerns biographical texts of 9,982 entities of type person, which can be accessed by the online portal BeWeb. This approach aims to enhance the organization and accessibility of ecclesiastical cultural heritage data, offering deeper insights into historical biographical records.</abstract>
+      <url hash="2801d82c">2024.nlp4dh-1.5</url>
+      <bibkey>cruciani-2024-extracting</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Constructing a Sentiment-Annotated Corpus of <fixed-case>A</fixed-case>ustrian Historical Newspapers: Challenges, Tools, and Annotator Experience</title>
+      <author><first>Lucija</first><last>Krusic</last></author>
+      <pages>51–62</pages>
+      <abstract>This study presents the development of a sentiment-annotated corpus of historical newspaper texts in Austrian German, addressing a gap in annotated corpora for Natural Language Processing in the field of Digital Humanities. Three annotators categorised 1005 sentences from two 19th-century periodicals into four sentiment categories: positive, negative, neutral, and mixed. The annotators, Masters and PhD students in Linguistics and Digital Humanities, are considered semi-experts and have received substantial training during this annotation study. Three tools were used and compared in the annotation process: Google Sheets, Google Forms and Doccano, and resulted in a gold standard corpus. The analysis revealed a fair to moderate inter-rater agreement (Fleiss’ kappa = 0.405) and an average percentage agreement of 45.7% for full consensus and 92.5% for majority vote. As majority vote is needed for the creation of a gold standard corpus, these results are considered sufficient, and the annotations reliable. The study also introduced comprehensive guidelines for sentiment annotation, which were essential to overcome the challenges posed by historical language and context. The annotators’ experience was assessed through a combination of standardised usability tests (NASA-TLX and UEQ-S) and a detailed custom-made user experience questionnaire, which provided qualitative insights into the difficulties and usability of the tools used. The questionnaire is an additional resource that can be used to assess usability and user experience assessments in future annotation studies. The findings demonstrate the effectiveness of semi-expert annotators and dedicated tools in producing reliable annotations and provide valuable resources, including the annotated corpus, guidelines, and a user experience questionnaire, for future sentiment analysis and annotation of Austrian historical texts. The sentiment-annotated corpus will be used as the gold standard for fine-tuning and evaluating machine learning models for sentiment analysis of Austrian historical newspapers with the topic of migration and minorities in a subsequent study.</abstract>
+      <url hash="d2b22a2b">2024.nlp4dh-1.6</url>
+      <bibkey>krusic-2024-constructing</bibkey>
+    </paper>
+    <paper id="7">
+      <title>It is a Truth Individually Acknowledged: Cross-references On Demand</title>
+      <author><first>Piper</first><last>Vasicek</last></author>
+      <author><first>Courtni</first><last>Byun</last></author>
+      <author><first>Kevin</first><last>Seppi</last></author>
+      <pages>63–74</pages>
+      <abstract>Cross-references link source passages of text to other passages that elucidate the source passage in some way and can deepen human understanding. Despite their usefulness, however, good cross-references are hard to find, and extensive sets of cross-references only exist for the few most highly studied books such as the Bible, for which scholars have been collecting cross-references for hundreds of years. Therefore, we propose a new task: generate cross-references for user-selected text on demand. We define a metric, coverage, to evaluate task performance. We adapt several models to generate cross references, including an Anchor Words topic model, SBERT SentenceTransformers, and ChatGPT, and evaluate their coverage in both English and German on existing cross-reference datasets. While ChatGPT outperforms other models on these datasets, this is likely due to data contamination. We hand-evaluate performance on the well-known works of Jane Austen and a less-known science fiction series Sons of the Starfarers by Joe Vasicek, finding that ChatGPT does not perform as well on these works; sentence embeddings perform best. We experiment with newer LLMs and large context windows, and suggest that future work should focus on deploying cross-references on-demand with readers to determine their effectiveness in the wild.</abstract>
+      <url hash="b3ce607d">2024.nlp4dh-1.7</url>
+      <bibkey>vasicek-etal-2024-truth</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Extracting position titles from unstructured historical job advertisements</title>
+      <author><first>Klara</first><last>Venglarova</last></author>
+      <author><first>Raven</first><last>Adam</last></author>
+      <author><first>Georg</first><last>Vogeler</last></author>
+      <pages>75–84</pages>
+      <abstract>This paper explores the automated extraction of job titles from unstructured historical job advertisements, using a corpus of digitized German-language newspapers from 1850-1950. The study addresses the challenges of working with unstructured, OCR-processed historical data, contrasting with contemporary approaches that often use structured, digitally-born datasets when dealing with this text type. We compare four extraction methods: a dictionary-based approach, a rule-based approach, a named entity recognition (NER) mode, and a text-generation method. The NER approach, trained on manually annotated data, achieved the highest F1 score (0.944 using transformers model trained on GPU, 0.884 model trained on CPU), demonstrating its flexibility and ability to correctly identify job titles. The text-generation approach performs similarly (0.920). However, the rule-based (0.69) and dictionary-based (0.632) methods reach relatively high F1 Scores as well, while offering the advantage of not requiring extensive labeling of training data. The results highlight the complexities of extracting meaningful job titles from historical texts, with implications for further research into labor market trends and occupational history.</abstract>
+      <url hash="6fac1b91">2024.nlp4dh-1.8</url>
+      <bibkey>venglarova-etal-2024-extracting</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Language Resources From Prominent Born-Digital Humanities Texts are Still Needed in the Age of <fixed-case>LLM</fixed-case>s</title>
+      <author><first>Natalie</first><last>Hervieux</last></author>
+      <author><first>Peiran</first><last>Yao</last></author>
+      <author><first>Susan</first><last>Brown</last></author>
+      <author><first>Denilson</first><last>Barbosa</last></author>
+      <pages>85–104</pages>
+      <abstract>The digital humanities (DH) community fundamentally embraces the use of computerized tools for the study and creation of knowledge related to language, history, culture, and human values, in which natural language plays a prominent role. Many successful DH tools rely heavily on Natural Language Processing methods, and several efforts exist within the DH community to promote the use of newer and better tools. Nevertheless, most NLP research is driven by web corpora that are noticeably different from texts commonly found in DH artifacts, which tend to use richer language and refer to rarer entities. Thus, the near-human performance achieved by state-of-the-art NLP tools on web texts might not be achievable on DH texts. We introduce a dataset carefully created by computer scientists and digital humanists intended to serve as a reference point for the development and evaluation of NLP tools. The dataset is a subset of a born-digital textbase resulting from a prominent and ongoing experiment in digital literary history, containing thousands of multi-sentence excerpts that are suited for information extraction tasks. We fully describe the dataset and show that its language is demonstrably different than the corpora normally used in training language resources in the NLP community.</abstract>
+      <url hash="5429afac">2024.nlp4dh-1.9</url>
+      <bibkey>hervieux-etal-2024-language</bibkey>
+    </paper>
+    <paper id="10">
+      <title><fixed-case>NLP</fixed-case> for Digital Humanities: Processing Chronological Text Corpora</title>
+      <author><first>Adam</first><last>Pawłowski</last></author>
+      <author><first>Tomasz</first><last>Walkowiak</last></author>
+      <pages>105–112</pages>
+      <abstract>The paper focuses on the integration of Natural Language Processing (NLP) techniques to analyze extensive chronological text corpora. This research underscores the synergy between humanistic inquiry and computational methods, especially in the processing and analysis of sequential textual data known as lexical series. A reference workflow for chronological corpus analysis is introduced, outlining the methodologies applicable to the ChronoPress corpus, a data set that encompasses 22 years of Polish press from 1945 to 1966. The study showcases the potential of this approach in uncovering cultural and historical patterns through the analysis of lexical series. The findings highlight both the challenges and opportunities present in leveraging lexical series analysis within Digital Humanities, emphasizing the necessity for advanced data filtering and anomaly detection algorithms to effectively manage the vast and intricate datasets characteristic of this field.</abstract>
+      <url hash="cf672d5e">2024.nlp4dh-1.10</url>
+      <bibkey>pawlowski-walkowiak-2024-nlp</bibkey>
+    </paper>
+    <paper id="11">
+      <title>A Multi-task Framework with Enhanced Hierarchical Attention for Sentiment Analysis on Classical <fixed-case>C</fixed-case>hinese Poetry: Utilizing Information from Short Lines</title>
+      <author><first>Quanqi</first><last>Du</last></author>
+      <author><first>Veronique</first><last>Hoste</last></author>
+      <pages>113–122</pages>
+      <abstract>Classical Chinese poetry has a long history, dating back to the 11th century BC. By investigating the sentiment expressed in the poetry, we can gain more insights in the emotional life and history development in ancient Chinese culture. To help improve the sentiment analysis performance in the field of classical Chinese poetry, we propose to utilize the unique information from the individual short lines that compose the poem, and introduce a multi-task framework with hierarchical attention enhanced with short line sentiment labels. Specifically, the multi-task framework comprises sentiment analysis for both the overall poem and the short lines, while the hierarchical attention consists of word- and sentence-level attention, with the latter enhanced with additional information from short line sentiments. Our experimental results showcase that our approach leveraging more fine-grained information from short lines outperforms the state-of-the-art, achieving an accuracy score of 72.88% and an F1-macro score of 71.05%.</abstract>
+      <url hash="15651f93">2024.nlp4dh-1.11</url>
+      <bibkey>du-hoste-2024-multi</bibkey>
+    </paper>
+    <paper id="12">
+      <title>Exploring Similarity Measures and Intertextuality in <fixed-case>V</fixed-case>edic <fixed-case>S</fixed-case>anskrit Literature</title>
+      <author><first>So</first><last>Miyagawa</last></author>
+      <author><first>Yuki</first><last>Kyogoku</last></author>
+      <author><first>Yuzuki</first><last>Tsukagoshi</last></author>
+      <author><first>Kyoko</first><last>Amano</last></author>
+      <pages>123–131</pages>
+      <abstract>This paper examines semantic similarity and intertextuality in selected texts from the Vedic Sanskrit corpus, specifically the Maitrāyaṇī Saṃhitā (MS) and Kāṭhaka-Saṃhitā (KS). Three computational methods are employed: Word2Vec for word embeddings, stylo package for stylometric analysis, and TRACER for text reuse detection. By comparing various sections of the texts at different granularities, patterns of similarity and structural alignment are uncovered, providing insights into textual relationships and chronology. Word embeddings capture semantic similarities, while stylometric analysis reveals clusters and components that differentiate the texts. TRACER identifies parallel passages, indicating probable instances of text reuse. The computational analysis corroborates previous philological studies, suggesting a shared period of composition between MS.1.9 and MS.1.7. This research highlights the potential of computational methods in studying ancient Sanskrit literature, complementing traditional approaches. The agreement among the methods strengthens the validity of the findings, and the visualizations offer a nuanced understanding of textual connections. The study demonstrates that smaller chunk sizes are more effective for detecting intertextual parallels, showcasing the power of these techniques in unraveling the complexities of ancient texts.</abstract>
+      <url hash="f5cd7682">2024.nlp4dh-1.12</url>
+      <bibkey>miyagawa-etal-2024-exploring</bibkey>
+    </paper>
+    <paper id="13">
+      <title>Historical Ink: 19th Century <fixed-case>L</fixed-case>atin <fixed-case>A</fixed-case>merican <fixed-case>S</fixed-case>panish Newspaper Corpus with <fixed-case>LLM</fixed-case> <fixed-case>OCR</fixed-case> Correction</title>
+      <author><first>Laura</first><last>Manrique-Gomez</last></author>
+      <author><first>Tony</first><last>Montes</last></author>
+      <author><first>Arturo</first><last>Rodriguez Herrera</last></author>
+      <author><first>Ruben</first><last>Manrique</last></author>
+      <pages>132–139</pages>
+      <abstract>This paper presents two significant contributions: First, it introduces a novel dataset of 19th-century Latin American newspaper texts, addressing a critical gap in specialized corpora for historical and linguistic analysis in this region. Second, it develops a flexible framework that utilizes a Large Language Model for OCR error correction and linguistic surface form detection in digitized corpora. This semi-automated framework is adaptable to various contexts and datasets and is applied to the newly created dataset.</abstract>
+      <url hash="09a9eeaa">2024.nlp4dh-1.13</url>
+      <bibkey>manrique-gomez-etal-2024-historical</bibkey>
+    </paper>
+    <paper id="14">
+      <title>Canonical Status and Literary Influence: A Comparative Study of <fixed-case>D</fixed-case>anish Novels from the Modern Breakthrough (1870–1900)</title>
+      <author><first>Pascale</first><last>Feldkamp</last></author>
+      <author><first>Alie</first><last>Lassche</last></author>
+      <author><first>Jan</first><last>Kostkan</last></author>
+      <author><first>Márton</first><last>Kardos</last></author>
+      <author><first>Kenneth</first><last>Enevoldsen</last></author>
+      <author><first>Katrine</first><last>Baunvig</last></author>
+      <author><first>Kristoffer</first><last>Nielbo</last></author>
+      <pages>140–155</pages>
+      <abstract>We examine the relationship between the canonization of Danish novels and their textual innovation and influence, taking the Danish Modern Breakthrough era (1870–1900) as a case study. We evaluate whether canonical novels introduced a significant textual novelty in their time, and explore their influence on the overall literary trend of the period. By analyzing the positions of canonical versus non-canonical novels in semantic space, we seek to better understand the link between a novel’s canonical status and its literary impact. Additionally, we examine the overall diversification of Modern Breakthrough novels during this significant period of rising literary readership. We find that canonical novels stand out from both the historical novel genre and non-canonical novels of the period. Our findings on diversification within and across groups indicate that the novels now regarded as canonical served as literary trendsetters of their time.</abstract>
+      <url hash="73e383b3">2024.nlp4dh-1.14</url>
+      <bibkey>feldkamp-etal-2024-canonical</bibkey>
+    </paper>
+    <paper id="15">
+      <title>Deciphering psycho-social effects of Eating Disorder : Analysis of <fixed-case>R</fixed-case>eddit Posts using Large Language Model(<fixed-case>LLM</fixed-case>)s and Topic Modeling</title>
+      <author><first>Medini</first><last>Chopra</last></author>
+      <author><first>Anindita</first><last>Chatterjee</last></author>
+      <author><first>Lipika</first><last>Dey</last></author>
+      <author><first>Partha Pratim</first><last>Das</last></author>
+      <pages>156–164</pages>
+      <abstract>Eating disorders are a global health concern as they manifest in increasing numbers across all sections of society. Social network platforms have emerged as a dependable source of information about the disease, its effect, and its prevalence among different sections. This work lays the foundation for large-scale analysis of social media data using large language models (LLMs). We show that using LLMs can drastically reduce the time and resource requirements for garnering insights from large data repositories. With respect to ED, this work focuses on understanding its psychological impacts on both patients and those who live in their proximity. Social scientists can utilize the proposed approach to design more focused studies with better representative groups.</abstract>
+      <url hash="0783b185">2024.nlp4dh-1.15</url>
+      <bibkey>chopra-etal-2024-deciphering</bibkey>
+    </paper>
+    <paper id="16">
+      <title>Topic-Aware Causal Intervention for Counterfactual Detection</title>
+      <author><first>Thong Thanh</first><last>Nguyen</last></author>
+      <author><first>Truc-My</first><last>Nguyen</last></author>
+      <pages>165–176</pages>
+      <abstract>Counterfactual statements, which describe events that did not or cannot take place, are beneficial to numerous NLP applications. Hence, we consider the problem of counterfactual detection (CFD) and seek to enhance the CFD models. Previous models are reliant on clue phrases to predict counterfactuality, so they suffer from significant performance drop when clue phrase hints do not exist during testing. Moreover, these models tend to predict non-counterfactuals over counterfactuals. To address these issues, we propose to integrate neural topic model into the CFD model to capture the global semantics of the input statement. We continue to causally intervene the hidden representations of the CFD model to balance the effect of the class labels. Extensive experiments show that our approach outperforms previous state-of-the-art CFD and bias-resolving methods in both the CFD and other bias-sensitive tasks.</abstract>
+      <url hash="b4a3f416">2024.nlp4dh-1.16</url>
+      <bibkey>nguyen-nguyen-2024-topic</bibkey>
+    </paper>
+    <paper id="17">
+      <title><fixed-case>UD</fixed-case> for <fixed-case>G</fixed-case>erman Poetry</title>
+      <author><first>Stefanie</first><last>Dipper</last></author>
+      <author><first>Ronja</first><last>Laarmann-Quante</last></author>
+      <pages>177–188</pages>
+      <abstract>This article deals with the syntactic analysis of German-language poetry from different centuries. We use Universal Dependencies (UD) as our syntactic framework. We discuss particular challenges of the poems in terms of tokenization, sentence boundary recognition and special syntactic constructions. Our annotated corpus currently consists of 20 poems with a total of 2,162 tokens, which originate from the PoeTree.de corpus. We present some statistics on our annotations and also evaluate the automatic UD annotation from PoeTree.de using our annotations.</abstract>
+      <url hash="07768ddf">2024.nlp4dh-1.17</url>
+      <bibkey>dipper-laarmann-quante-2024-ud</bibkey>
+    </paper>
+    <paper id="18">
+      <title>Molyé: A Corpus-based Approach to Language Contact in Colonial <fixed-case>F</fixed-case>rance</title>
+      <author><first>Rasul</first><last>Dent</last></author>
+      <author><first>Juliette</first><last>Janes</last></author>
+      <author><first>Thibault</first><last>Clerice</last></author>
+      <author><first>Pedro</first><last>Ortiz Suarez</last></author>
+      <author><first>Benoît</first><last>Sagot</last></author>
+      <pages>189–199</pages>
+      <abstract>Whether or not several Creole languages which developed during the early modern period can be considered genetic descendants of European languages has been the subject of intense debate. This is in large part due to the absence of evidence of intermediate forms. This work introduces a new open corpus, the Molyé corpus, which combines stereotypical representations of three kinds of language variation in Europe with early attestations of French-based Creole languages across a period of 400 years. It is intended to facilitate future research on the continuity between contact situations in Europe and Creolophone (former) colonies.</abstract>
+      <url hash="b0ba8975">2024.nlp4dh-1.18</url>
+      <bibkey>dent-etal-2024-molye</bibkey>
+    </paper>
+    <paper id="19">
+      <title>Vector Poetics: Parallel Couplet Detection in Classical <fixed-case>C</fixed-case>hinese Poetry</title>
+      <author><first>Maciej</first><last>Kurzynski</last></author>
+      <author><first>Xiaotong</first><last>Xu</last></author>
+      <author><first>Yu</first><last>Feng</last></author>
+      <pages>200–208</pages>
+      <abstract>This paper explores computational approaches for detecting parallelism in classical Chinese poetry, a rhetorical device where two verses mirror each other in syntax, meaning, tone, and rhythm. We experiment with five classification methods: (1) verb position matching, (2) integrated semantic, syntactic, and word-segmentation analysis, (3) difference-based character embeddings, (4) structured examples (inner/outer couplets), and (5) GPT-guided classification. We use a manually annotated dataset, containing 6,125 pentasyllabic couplets, to evaluate performance. The results indicate that parallelism detection poses a significant challenge even for powerful LLMs such as GPT-4o, with the highest F1 score below 0.72. Nevertheless, each method contributes valuable insights into the art of parallelism in Chinese poetry, suggesting a new understanding of parallelism as a verbal expression of principal components in a culturally defined vector space.</abstract>
+      <url hash="7201a94c">2024.nlp4dh-1.19</url>
+      <bibkey>kurzynski-etal-2024-vector</bibkey>
+    </paper>
+    <paper id="20">
+      <title>Adapting Measures of Literality for Use with Historical Language Data</title>
+      <author><first>Adam</first><last>Roussel</last></author>
+      <pages>209–215</pages>
+      <abstract>This paper concerns the adaptation of two existing computational measures relating to the estimation of the literality of expressions to enable their use in scenarios where data is scarce, as is usually the case with historical language data. Being able to determine an expression’s literality via statistical means could support a range of linguistic annotation tasks, such as those relating to metaphor, metonymy, and idiomatic expressions, however making this judgment is especially difficult for modern annotators of historical and ancient texts. Therefore we re-implement these measures using smaller corpora and count-based vectors more suited to these amounts of training data. The adapted measures are evaluated against an existing data set of particle verbs annotated with degrees of literality. The results were inconclusive, yielding low correlations between 0.05 and 0.10 (Spearman’s ρ). Further work is needed to determine which measures and types of data correspond to which aspects of literality.</abstract>
+      <url hash="335e84d7">2024.nlp4dh-1.20</url>
+      <bibkey>roussel-2024-adapting</bibkey>
+    </paper>
+    <paper id="21">
+      <title>Improving <fixed-case>L</fixed-case>atin Dependency Parsing by Combining Treebanks and Predictions</title>
+      <author><first>Hanna-Mari Kristiina</first><last>Kupari</last></author>
+      <author><first>Erik</first><last>Henriksson</last></author>
+      <author><first>Veronika</first><last>Laippala</last></author>
+      <author><first>Jenna</first><last>Kanerva</last></author>
+      <pages>216–228</pages>
+      <abstract>This paper introduces new models designed to improve the morpho-syntactic parsing of the five largest Latin treebanks in the Universal Dependencies (UD) framework. First, using two state-of-the-art parsers, Trankit and Stanza, along with our custom UD tagger, we train new models on the five treebanks both individually and by combining them into novel merged datasets. We also test the models on the CIRCSE test set. In an additional experiment, we evaluate whether this set can be accurately tagged using the novel LASLA corpus (https://github.com/CIRCSE/LASLA). Second, we aim to improve the results by combining the predictions of different models through an atomic morphological feature voting system. The results of our two main experiments demonstrate significant improvements, particularly for the smaller treebanks, with LAS scores increasing by 16.10 and 11.85%-points for UDante and Perseus, respectively (Gamba and Zeman, 2023a). Additionally, the voting system for morphological features (FEATS) brings improvements, especially for the smaller Latin treebanks: Perseus 3.15% and CIRCSE 2.47%-points. Tagging the CIRCSE set with our custom model using the LASLA model improves POS 6.71 and FEATS 11.04%-points respectively, compared to our best-performing UD PROIEL model. Our results show that larger datasets and ensemble predictions can significantly improve performance.</abstract>
+      <url hash="5ac677a9">2024.nlp4dh-1.21</url>
+      <bibkey>kupari-etal-2024-improving</bibkey>
+    </paper>
+    <paper id="22">
+      <title>From N-grams to Pre-trained Multilingual Models For Language Identification</title>
+      <author><first>Thapelo Andrew</first><last>Sindane</last></author>
+      <author><first>Vukosi</first><last>Marivate</last></author>
+      <pages>229–239</pages>
+      <abstract>In this paper, we investigate the use of N-gram models and Large Pre-trained Multilingual models for Language Identification (LID) across 11 South African languages. For N-gram models, this study shows that effective data size selection remains crucial for establishing effective frequency distributions of the target languages, that efficiently model each language, thus, improving language ranking. For pre-trained multilingual models, we conduct extensive experiments covering a diverse set of massively pre-trained multilingual (PLM) models – mBERT, RemBERT, XLM-r, and Afri-centric multilingual models – AfriBERTa, Afro-XLMr, AfroLM, and Serengeti. We further compare these models with available large-scale Language Identification tools: Compact Language Detector v3 (CLD V3), AfroLID, GlotLID, and OpenLID to highlight the importance of focused-based LID. From these, we show that Serengeti is a superior model across models: N-grams to Transformers on average. Moreover, we propose a lightweight BERT-based LID model (za_BERT_lid) trained with NHCLT + Vukzenzele corpus, which performs on par with our best-performing Afri-centric models.</abstract>
+      <url hash="3f12159f">2024.nlp4dh-1.22</url>
+      <bibkey>sindane-marivate-2024-n</bibkey>
+    </paper>
+    <paper id="23">
+      <title>Visualising Changes in Semantic Neighbourhoods of <fixed-case>E</fixed-case>nglish Noun Compounds over Time</title>
+      <author><first>Malak</first><last>Rassem</last></author>
+      <author><first>Myrto</first><last>Tsigkouli</last></author>
+      <author><first>Chris W.</first><last>Jenkins</last></author>
+      <author><first>Filip</first><last>Miletić</last></author>
+      <author><first>Sabine</first><last>Schulte im Walde</last></author>
+      <pages>240–246</pages>
+      <abstract>This paper provides a framework and tool set for computing and visualising dynamic, time- specific semantic neighbourhoods of English noun-noun compounds and their constituents over time. Our framework not only identifies salient vector-space dimensions and neighbours in notoriously sparse data: we specifically bring together changes in meaning aspects and degrees of (non-)compositionality.</abstract>
+      <url hash="54a497bd">2024.nlp4dh-1.23</url>
+      <bibkey>rassem-etal-2024-visualising</bibkey>
+    </paper>
+    <paper id="24">
+      <title><fixed-case>SEFLAG</fixed-case>: Systematic Evaluation Framework for <fixed-case>NLP</fixed-case> Models and Datasets in <fixed-case>L</fixed-case>atin and <fixed-case>A</fixed-case>ncient <fixed-case>G</fixed-case>reek</title>
+      <author><first>Konstantin</first><last>Schulz</last></author>
+      <author><first>Florian</first><last>Deichsler</last></author>
+      <pages>247–258</pages>
+      <abstract>Literary scholars of Latin and Ancient Greek increasingly use natural language processing for their work, but many models and datasets are hard to use due to a lack of sustainable research data management. This paper introduces the Systematic Evaluation Framework for natural language processing models and datasets in Latin and Ancient Greek (SEFLAG), which consistently assesses language resources using common criteria, such as specific evaluation metrics, metadata and risk analysis. The framework, a work in progress in its initial phase, currently covers lemmatization and named entity recognition for both languages, with plans for adding dependency parsing and other tasks. For increased transparency and sustainability, a thorough documentation is included as well as an integration into the HuggingFace ecosystem. The combination of these efforts is designed to support researchers in their search for suitable models.</abstract>
+      <url hash="88bf6277">2024.nlp4dh-1.24</url>
+      <bibkey>schulz-deichsler-2024-seflag</bibkey>
+    </paper>
+    <paper id="25">
+      <title>A Two-Model Approach for Humour Style Recognition</title>
+      <author><first>Mary Ogbuka</first><last>Kenneth</last></author>
+      <author><first>Foaad</first><last>Khosmood</last></author>
+      <author><first>Abbas</first><last>Edalat</last></author>
+      <pages>259–274</pages>
+      <abstract>Humour, a fundamental aspect of human communication, manifests itself in various styles that significantly impact social interactions and mental health. Recognising different humour styles poses challenges due to the lack of established datasets and machine learning (ML) models. To address this gap, we present a new text dataset for humour style recognition, comprising 1463 instances across four styles (self-enhancing, self-deprecating, affiliative, and aggressive) and non-humorous text, with lengths ranging from 4 to 229 words. Our research employs various computational methods, including classic machine learning classifiers, text embedding models, and DistilBERT, to establish baseline performance. Additionally, we propose a two-model approach to enhance humour style recognition, particularly in distinguishing between affiliative and aggressive styles. Our method demonstrates an 11.61% improvement in f1-score for affiliative humour classification, with consistent improvements in the 14 models tested. Our findings contribute to the computational analysis of humour in text, offering new tools for studying humour in literature, social media, and other textual sources.</abstract>
+      <url hash="4f58fd6e">2024.nlp4dh-1.25</url>
+      <bibkey>kenneth-etal-2024-two</bibkey>
+    </paper>
+    <paper id="26">
+      <title>N-gram-Based Preprocessing for Sandhi Reversion in <fixed-case>V</fixed-case>edic <fixed-case>S</fixed-case>anskrit</title>
+      <author><first>Yuzuki</first><last>Tsukagoshi</last></author>
+      <author><first>Ikki</first><last>Ohmukai</last></author>
+      <pages>275–279</pages>
+      <abstract>This study aims to address the challenges posed by sandhi in Vedic Sanskrit, a phenomenon that complicates the computational analysis of Sanskrit texts. By focusing on sandhi reversion, the research seeks to improve the accuracy of processing Vedic Sanskrit, an older layer of the language. Sandhi, a phonological phenomenon, poses challenges for text processing in Sanskrit due to the fusion of word boundaries or the sound change around word boundaries. In this research, we developed a transformer-based model with a novel n-gram preprocessing strategy to improve the accuracy of sandhi reversion for Vedic. We created character-based n-gram texts of varying lengths (n = 2, 3, 4, 5, 6) from the Rigveda, the oldest Vedic text, and trained models on these texts to perform machine translation from post-sandhi to pre-sandhi forms. In the results, we found that the model trained with 5-gram text achieved the highest accuracy. This success is likely due to the 5-gram’s ability to capture the maximum phonemic context in which Vedic sandhi occurs, making it more effective for the task. These findings suggest that by leveraging the inherent characteristics of phonological changes in language, even simple preprocessing methods like n-gram segmentation can significantly improve the accuracy of complex linguistic tasks.</abstract>
+      <url hash="33d6f011">2024.nlp4dh-1.26</url>
+      <bibkey>tsukagoshi-ohmukai-2024-n</bibkey>
+    </paper>
+    <paper id="27">
+      <title>Enhancing <fixed-case>S</fixed-case>wedish Parliamentary Data: Annotation, Accessibility, and Application in Digital Humanities</title>
+      <author><first>Shafqat Mumtaz</first><last>Virk</last></author>
+      <author><first>Claes</first><last>Ohlsson</last></author>
+      <author><first>Nina</first><last>Tahmasebi</last></author>
+      <author><first>Henrik</first><last>Björck</last></author>
+      <author><first>Leif</first><last>Runefelt</last></author>
+      <pages>280–288</pages>
+      <abstract>The Swedish bicameral parliament data presents a valuable textual resource that is of interest for many researches and scholars. The parliamentary texts offer many avenues for research including the study of how various affairs were run by governments over time. The Parliament proceedings are available in textual format, but in their original form, they are noisy and unstructured and thus hard to explore and investigate. In this paper, we report the transformation of the raw bicameral parliament data (1867-1970) into a structured lexical resource annotated with various word and document level attributes. The annotated data is then made searchable through two modern corpus infrastructure components which provide a wide array of corpus exploration, visualization, and comparison options. To demonstrate the practical utility of this resource, we present a case study examining the transformation of the concept of ‘market’ over time from a tangible physical entity to an abstract idea.</abstract>
+      <url hash="a5de7229">2024.nlp4dh-1.27</url>
+      <bibkey>virk-etal-2024-enhancing</bibkey>
+    </paper>
+    <paper id="28">
+      <title>Evaluating Open-Source <fixed-case>LLM</fixed-case>s in Low-Resource Languages: Insights from <fixed-case>L</fixed-case>atvian High School Exams</title>
+      <author><first>Roberts</first><last>Darģis</last></author>
+      <author><first>Guntis</first><last>Bārzdiņš</last></author>
+      <author><first>Inguna</first><last>Skadiņa</last></author>
+      <author><first>Baiba</first><last>Saulite</last></author>
+      <pages>289–293</pages>
+      <abstract>The latest large language models (LLM) have significantly advanced natural language processing (NLP) capabilities across various tasks. However, their performance in low-resource languages, such as Latvian with 1.5 million native speakers, remains substantially underexplored due to both limited training data and the absence of comprehensive evaluation benchmarks. This study addresses this gap by conducting a systematic assessment of prominent open-source LLMs on natural language understanding (NLU) and natural language generation (NLG) tasks in Latvian. We utilize standardized high school centralized graduation exams as a benchmark dataset, offering relatable and diverse evaluation scenarios that encompass multiple-choice questions and complex text analysis tasks. Our experimental setup involves testing models from the leading LLM families, including Llama, Qwen, Gemma, and Mistral, with OpenAI’s GPT-4 serving as a performance reference. The results reveal that certain open-source models demonstrate competitive performance in NLU tasks, narrowing the gap with GPT-4. However, all models exhibit notable deficiencies in NLG tasks, specifically in generating coherent and contextually appropriate text analyses, highlighting persistent challenges in NLG for low-resource languages. These findings contribute to efforts to develop robust multilingual benchmarks and improve LLM performance in diverse linguistic contexts.</abstract>
+      <url hash="3ec45a29">2024.nlp4dh-1.28</url>
+      <bibkey>dargis-etal-2024-evaluating</bibkey>
+    </paper>
+    <paper id="29">
+      <title>Computational Methods for the Analysis of Complementizer Variability in Language and Literature: The Case of <fixed-case>H</fixed-case>ebrew “she-” and “ki”</title>
+      <author><first>Avi</first><last>Shmidman</last></author>
+      <author><first>Aynat</first><last>Rubinstein</last></author>
+      <pages>294–307</pages>
+      <abstract>We demonstrate a computational method for analyzing complementizer variability within language and literature, focusing on Hebrew as a test case. The primary complementizers in Hebrew are “she-” and “ki”. We first run a large-scale corpus analysis to determine the relative preference for one or the other of these complementizers given the preceding verb. On top of this foundation, we leverage clustering methods to measure the degree of interchangeability between the complementizers for each verb. The resulting tables, which provide this information for all common complement-taking verbs in Hebrew, are a first-of-its-kind lexical resource which we provide to the NLP community. Upon this foundation, we demonstrate a computational method to analyze literary works for unusual and unexpected complementizer usages deserving of literary analysis.</abstract>
+      <url hash="fb80dbdf">2024.nlp4dh-1.29</url>
+      <bibkey>shmidman-rubinstein-2024-computational</bibkey>
+    </paper>
+    <paper id="30">
+      <title>From Discrete to Continuous Classes: A Situational Analysis of Multilingual Web Registers with <fixed-case>LLM</fixed-case> Annotations</title>
+      <author><first>Erik</first><last>Henriksson</last></author>
+      <author><first>Amanda</first><last>Myntti</last></author>
+      <author><first>Saara</first><last>Hellström</last></author>
+      <author><first>Selcen</first><last>Erten-Johansson</last></author>
+      <author><first>Anni</first><last>Eskelinen</last></author>
+      <author><first>Liina</first><last>Repo</last></author>
+      <author><first>Veronika</first><last>Laippala</last></author>
+      <pages>308–318</pages>
+      <abstract>In corpus linguistics, registers–language varieties suited to different contexts–have traditionally been defined by their situations of use, yet recent studies reveal significant situational variation within registers. Previous quantitative studies, however, have been limited to English, leaving this variation in other languages largely unexplored. To address this gap, we apply a quantitative situational analysis to a large multilingual web register corpus, using large language models (LLMs) to annotate texts in English, Finnish, French, Swedish, and Turkish for 23 situational parameters. Using clustering techniques, we identify six situational text types, such as “Advice”, “Opinion” and “Marketing”, each characterized by distinct situational features. We explore the relationship between these text types and traditional register categories, finding partial alignment, though no register maps perfectly onto a single cluster. These results support the quantitative approach to situational analysis and are consistent with earlier findings for English. Cross-linguistic comparisons show that language accounts for only a small part of situational variation within registers, suggesting registers are situationally similar across languages. This study demonstrates the utility of LLMs in multilingual register analysis and deepens our understanding of situational variation within registers.</abstract>
+      <url hash="38fd7eea">2024.nlp4dh-1.30</url>
+      <bibkey>henriksson-etal-2024-discrete</bibkey>
+    </paper>
+    <paper id="31">
+      <title>Testing and Adapting the Representational Abilities of Large Language Models on Folktales in Low-Resource Languages</title>
+      <author><first>J. A.</first><last>Meaney</last></author>
+      <author><first>Beatrice</first><last>Alex</last></author>
+      <author><first>William</first><last>Lamb</last></author>
+      <pages>319–324</pages>
+      <abstract>Folktales are a rich resource of knowledge about the society and culture of a civilisation. Digital folklore research aims to use automated techniques to better understand these folktales, and it relies on abstract representations of the textual data. Although a number of large language models (LLMs) claim to be able to represent low-resource langauges such as Irish and Gaelic, we present two classification tasks to explore how useful these representations are, and three adaptations to improve the performance of these models. We find that adapting the models to work with longer sequences, and continuing pre-training on the domain of folktales improves classification performance, although these findings are tempered by the impressive performance of a baseline SVM with non-contextual features.</abstract>
+      <url hash="35ee0c93">2024.nlp4dh-1.31</url>
+      <bibkey>meaney-etal-2024-testing</bibkey>
+    </paper>
+    <paper id="32">
+      <title>Examining Language Modeling Assumptions Using an Annotated Literary Dialect Corpus</title>
+      <author><first>Craig</first><last>Messner</last></author>
+      <author><first>Thomas</first><last>Lippincott</last></author>
+      <pages>325–330</pages>
+      <abstract>We present a dataset of 19th century American literary orthovariant tokens with a novel layer of human-annotated dialect group tags designed to serve as the basis for computational experiments exploring literarily meaningful orthographic variation. We perform an initial broad set of experiments over this dataset using both token (BERT) and character (CANINE)-level contextual language models. We find indications that the “dialect effect” produced by intentional orthographic variation employs multiple linguistic channels, and that these channels are able to be surfaced to varied degrees given particular language modelling assumptions. Specifically, we find evidence showing that choice of tokenization scheme meaningfully impact the type of orthographic information a model is able to surface.</abstract>
+      <url hash="1fac4d0b">2024.nlp4dh-1.32</url>
+      <bibkey>messner-lippincott-2024-examining</bibkey>
+    </paper>
+    <paper id="33">
+      <title>Evaluating Language Models in Location Referring Expression Extraction from Early Modern and Contemporary <fixed-case>J</fixed-case>apanese Texts</title>
+      <author><first>Ayuki</first><last>Katayama</last></author>
+      <author><first>Yusuke</first><last>Sakai</last></author>
+      <author><first>Shohei</first><last>Higashiyama</last></author>
+      <author><first>Hiroki</first><last>Ouchi</last></author>
+      <author><first>Ayano</first><last>Takeuchi</last></author>
+      <author><first>Ryo</first><last>Bando</last></author>
+      <author><first>Yuta</first><last>Hashimoto</last></author>
+      <author><first>Toshinobu</first><last>Ogiso</last></author>
+      <author><first>Taro</first><last>Watanabe</last></author>
+      <pages>331–338</pages>
+      <abstract>Automatic extraction of geographic information, including Location Referring Expressions (LREs), can aid humanities research in analyzing large collections of historical texts. In this study, to investigate how accurate pretrained Transformer language models (LMs) can extract LREs from historical texts, we evaluate two representative types of LMs, namely, masked language model and causal language model, using early modern and contemporary Japanese datasets. Our experimental results demonstrated the potential of contemporary LMs for historical texts, but also suggest the need for further model enhancement, such as pretraining on historical texts.</abstract>
+      <url hash="7093577b">2024.nlp4dh-1.33</url>
+      <bibkey>katayama-etal-2024-evaluating</bibkey>
+    </paper>
+    <paper id="34">
+      <title>Evaluating <fixed-case>LLM</fixed-case> Performance in Character Analysis: A Study of Artificial Beings in Recent <fixed-case>K</fixed-case>orean Science Fiction</title>
+      <author><first>Woori</first><last>Jang</last></author>
+      <author><first>Seohyon</first><last>Jung</last></author>
+      <pages>339–351</pages>
+      <abstract>Literary works present diverse and complex character behaviors, often implicit or intentionally obscured, making character analysis an inherently challenging task. This study explores LLMs’ capability to identify and interpret behaviors of artificial beings in 11 award-winning contemporary Korean science fiction short stories. Focusing on artificial beings as a distinct class of characters, rather than on conventional human characters, adds to the multi-layered complexity of analysis. We compared two LLMs, Claude 3.5 Sonnet and GPT-4o, with human experts using a custom eight-label system and a unique agreement metric developed to capture the cognitive intricacies of literary interpretation. Human inter-annotator agreement was around 50%, confirming the subjectivity of literary comprehension. LLMs differed from humans in selected text spans but demonstrated high agreement in label assignment for correctly identified spans. LLMs notably excelled at discerning ‘actions’ as semantic units rather than isolated grammatical components. This study reaffirms literary interpretation’s multifaceted nature while expanding the boundaries of NLP, contributing to discussions about AI’s capacity to understand and interpret creative works.</abstract>
+      <url hash="2e34a3d5">2024.nlp4dh-1.34</url>
+      <bibkey>jang-jung-2024-evaluating</bibkey>
+    </paper>
+    <paper id="35">
+      <title>Text vs. Transcription: A Study of Differences Between the Writing and Speeches of <fixed-case>U</fixed-case>.<fixed-case>S</fixed-case>. Presidents</title>
+      <author><first>Mina</first><last>Rajaei Moghadam</last></author>
+      <author><first>Mosab</first><last>Rezaei</last></author>
+      <author><first>Gülşat</first><last>Aygen</last></author>
+      <author><first>Reva</first><last>Freedman</last></author>
+      <pages>352–361</pages>
+      <abstract>Even after many years of research, answering the question of the differences between spoken and written text remains open. This paper aims to study syntactic features that can serve as distinguishing factors. To do so, we focus on the transcribed speeches and written books of United States presidents. We conducted two experiments to analyze high-level syntactic features. In the first experiment, we examine these features while controlling for the effect of sentence length. In the second experiment, we compare the high-level syntactic features with low-level ones. The results indicate that adding high-level syntactic features enhances model performance, particularly in longer sentences. Moreover, the importance of the prepositional phrases in a sentence increases with sentence length. We also find that these longer sentences with more prepositional phrases are more likely to appear in speeches than in written books by U.S. presidents.</abstract>
+      <url hash="395209ae">2024.nlp4dh-1.35</url>
+      <bibkey>rajaei-moghadam-etal-2024-text</bibkey>
+    </paper>
+    <paper id="36">
+      <title>Mitigating Biases to Embrace Diversity: A Comprehensive Annotation Benchmark for Toxic Language</title>
+      <author><first>Xinmeng</first><last>Hou</last></author>
+      <pages>362–376</pages>
+      <abstract>This study introduces a prescriptive annotation benchmark grounded in humanities research to ensure consistent, unbiased labeling of offensive language, particularly for casual and non-mainstream language uses. We contribute two newly annotated datasets that achieve higher inter-annotator agreement between human and language model (LLM) annotations compared to original datasets based on descriptive instructions. Our experiments show that LLMs can serve as effective alternatives when professional annotators are unavailable. Moreover, smaller models fine-tuned on multi-source LLM-annotated data outperform models trained on larger, single-source human-annotated datasets. These findings highlight the value of structured guidelines in reducing subjective variability, maintaining performance with limited data, and embracing language diversity. Content Warning: This article only analyzes offensive language for academic purposes. Discretion is advised.</abstract>
+      <url hash="9dee7f91">2024.nlp4dh-1.36</url>
+      <bibkey>hou-2024-mitigating</bibkey>
+    </paper>
+    <paper id="37">
+      <title>Classification of Buddhist Verses: The Efficacy and Limitations of Transformer-Based Models</title>
+      <author><first>Nikita</first><last>Neveditsin</last></author>
+      <author><first>Ambuja</first><last>Salgaonkar</last></author>
+      <author><first>Pawan</first><last>Lingras</last></author>
+      <author><first>Vijay</first><last>Mago</last></author>
+      <pages>377–385</pages>
+      <abstract>This study assesses the ability of machine learning to classify verses from Buddhist texts into two categories: Therigatha and Theragatha, attributed to female and male authors, respectively. It highlights the difficulties in data preprocessing and the use of Transformer-based models on Devanagari script due to limited vocabulary, demonstrating that simple statistical models can be equally effective. The research suggests areas for future exploration, provides the dataset for further study, and acknowledges existing limitations and challenges.</abstract>
+      <url hash="35b80208">2024.nlp4dh-1.37</url>
+      <bibkey>neveditsin-etal-2024-classification</bibkey>
+    </paper>
+    <paper id="38">
+      <title>Intersecting Register and Genre: Understanding the Contents of Web-Crawled Corpora</title>
+      <author><first>Amanda</first><last>Myntti</last></author>
+      <author><first>Liina</first><last>Repo</last></author>
+      <author><first>Elian</first><last>Freyermuth</last></author>
+      <author><first>Antti</first><last>Kanner</last></author>
+      <author><first>Veronika</first><last>Laippala</last></author>
+      <author><first>Erik</first><last>Henriksson</last></author>
+      <pages>386–397</pages>
+      <abstract>Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&amp;UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature &amp; Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering &amp; Transportation and Politics &amp; Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.</abstract>
+      <url hash="701119c3">2024.nlp4dh-1.38</url>
+      <bibkey>myntti-etal-2024-intersecting</bibkey>
+    </paper>
+    <paper id="39">
+      <title><fixed-case>S</fixed-case>ui Generis: Large Language Models for Authorship Attribution and Verification in <fixed-case>L</fixed-case>atin</title>
+      <author><first>Svetlana</first><last>Gorovaia</last></author>
+      <author><first>Gleb</first><last>Schmidt</last></author>
+      <author><first>Ivan P.</first><last>Yamshchikov</last></author>
+      <pages>398–412</pages>
+      <abstract>This paper evaluates the performance of Large Language Models (LLMs) in authorship attribu- tion and authorship verification tasks for Latin texts of the Patristic Era. The study showcases that LLMs can be robust in zero-shot author- ship verification even on short texts without sophisticated feature engineering. Yet, the mod- els can also be easily “mislead” by semantics. The experiments also demonstrate that steering the model’s authorship analysis and decision- making is challenging, unlike what is reported in the studies dealing with high-resource mod- ern languages. Although LLMs prove to be able to beat, under certain circumstances, the traditional baselines, obtaining a nuanced and truly explainable decision requires at best a lot of experimentation.</abstract>
+      <url hash="25d29a46">2024.nlp4dh-1.39</url>
+      <bibkey>gorovaia-etal-2024-sui</bibkey>
+    </paper>
+    <paper id="40">
+      <title>Enhancing Neural Machine Translation for <fixed-case>A</fixed-case>inu-<fixed-case>J</fixed-case>apanese: A Comprehensive Study on the Impact of Domain and Dialect Integration</title>
+      <author><first>Ryo</first><last>Igarashi</last></author>
+      <author><first>So</first><last>Miyagawa</last></author>
+      <pages>413–422</pages>
+      <abstract>Neural Machine Translation (NMT) has revolutionized language translation, yet significant challenges persist for low-resource languages, particularly those with high dialectal variation and limited standardization. This comprehensive study focuses on the Ainu language, a critically endangered indigenous language of northern Japan, which epitomizes these challenges. We address the limitations of previous research through two primary strategies: (1) extensive corpus expansion encompassing diverse domains and dialects, and (2) development of innovative methods to incorporate dialect and domain information directly into the translation process. Our approach yielded substantial improvements in translation quality, with BLEU scores increasing from 32.90 to 39.06 (+6.16) for Japanese → Ainu and from 10.45 to 31.83 (+21.38) for Ainu → Japanese. Through rigorous experimentation and analysis, we demonstrate the crucial importance of integrating linguistic variation information in NMT systems for languages characterized by high diversity and limited resources. Our findings have broad implications for improving machine translation for other low-resource languages, potentially advancing preservation and revitalization efforts for endangered languages worldwide.</abstract>
+      <url hash="d6a61e80">2024.nlp4dh-1.40</url>
+      <bibkey>igarashi-miyagawa-2024-enhancing</bibkey>
+    </paper>
+    <paper id="41">
+      <title>Exploring Large Language Models for Qualitative Data Analysis</title>
+      <author><first>Tim</first><last>Fischer</last></author>
+      <author><first>Chris</first><last>Biemann</last></author>
+      <pages>423–437</pages>
+      <abstract>This paper explores the potential of Large Language Models (LLMs) to enhance qualitative data analysis (QDA) workflows within the open-source QDA platform developed at our university. We identify several opportunities within a typical QDA workflow where AI assistance can boost researcher productivity and translate these opportunities into corresponding NLP tasks: document classification, information extraction, span classification, and text generation. A benchmark tailored to these QDA activities is constructed, utilizing English and German datasets that align with relevant use cases. Focusing on efficiency and accessibility, we evaluate the performance of three prominent open-source LLMs - Llama 3.1, Gemma 2, and Mistral NeMo - on this benchmark. Our findings reveal the promise of LLM integration for streamlining QDA workflows, particularly for English-language projects. Consequently, we have implemented the LLM Assistant as an opt-in feature within our platform and report the implementation details. With this, we hope to further democratize access to AI capabilities for qualitative data analysis.</abstract>
+      <url hash="e8035a13">2024.nlp4dh-1.41</url>
+      <bibkey>fischer-biemann-2024-exploring</bibkey>
+    </paper>
+    <paper id="42">
+      <title>Cross-Dialectal Transfer and Zero-Shot Learning for <fixed-case>A</fixed-case>rmenian Varieties: A Comparative Analysis of <fixed-case>RNN</fixed-case>s, Transformers and <fixed-case>LLM</fixed-case>s</title>
+      <author><first>Chahan</first><last>Vidal-Gorène</last></author>
+      <author><first>Nadi</first><last>Tomeh</last></author>
+      <author><first>Victoria</first><last>Khurshudyan</last></author>
+      <pages>438–449</pages>
+      <abstract>This paper evaluates lemmatization, POS-tagging, and morphological analysis for four Armenian varieties: Classical Armenian, Modern Eastern Armenian, Modern Western Armenian, and the under-documented Getashen dialect. It compares traditional RNN models, multilingual models like mDeBERTa, and large language models (ChatGPT) using supervised, transfer learning, and zero/few-shot learning approaches. The study finds that RNN models are particularly strong in POS-tagging, while large language models demonstrate high adaptability, especially in handling previously unseen dialect variations. The research highlights the value of cross-variational and in-context learning for enhancing NLP performance in low-resource languages, offering crucial insights into model transferability and supporting the preservation of endangered dialects.</abstract>
+      <url hash="b478dbac">2024.nlp4dh-1.42</url>
+      <bibkey>vidal-gorene-etal-2024-cross</bibkey>
+    </paper>
+    <paper id="43">
+      <title>Increasing the Difficulty of Automatically Generated Questions via Reinforcement Learning with Synthetic Preference for Cost-Effective Cultural Heritage Dataset Generation</title>
+      <author><first>William</first><last>Thorne</last></author>
+      <author><first>Ambrose</first><last>Robinson</last></author>
+      <author><first>Bohua</first><last>Peng</last></author>
+      <author><first>Chenghua</first><last>Lin</last></author>
+      <author><first>Diana</first><last>Maynard</last></author>
+      <pages>450–462</pages>
+      <abstract>As the cultural heritage sector increasingly adopts technologies like Retrieval-Augmented Generation (RAG) to provide more personalised search experiences and enable conversations with collections data, the demand for specialised evaluation datasets has grown. While end-to-end system testing is essential, it’s equally important to assess individual components. We target the final, answering task, which is well-suited to Machine Reading Comprehension (MRC). Although existing MRC datasets address general domains, they lack the specificity needed for cultural heritage information. Unfortunately, the manual creation of such datasets is prohibitively expensive for most heritage institutions. This paper presents a cost-effective approach for generating domain-specific MRC datasets with increased difficulty using Reinforcement Learning from Human Feedback (RLHF) from synthetic preference data. Our method leverages the performance of existing question-answering models on a subset of SQuAD to create a difficulty metric, assuming that more challenging questions are answered correctly less frequently. This research contributes: (1) A methodology for increasing question difficulty using PPO and synthetic data; (2) Empirical evidence of the method’s effectiveness, including human evaluation; (3) An in-depth error analysis and study of emergent phenomena; and (4) An open-source codebase and set of three llama-2-chat adapters for reproducibility and adaptation.</abstract>
+      <url hash="a840052c">2024.nlp4dh-1.43</url>
+      <bibkey>thorne-etal-2024-increasing</bibkey>
+    </paper>
+    <paper id="44">
+      <title>Assessing Large Language Models in Translating <fixed-case>C</fixed-case>optic and <fixed-case>A</fixed-case>ncient <fixed-case>G</fixed-case>reek Ostraca</title>
+      <author><first>Audric-Charles</first><last>Wannaz</last></author>
+      <author><first>So</first><last>Miyagawa</last></author>
+      <pages>463–471</pages>
+      <abstract>The advent of Large Language Models (LLMs) substantially raised the quality and lowered the cost of Machine Translation (MT). Can scholars working with ancient languages draw benefits from this new technology? More specifically, can current MT facilitate multilingual digital papyrology? To answer this question, we evaluate 9 LLMs in the task of MT with 4 Coptic and 4 Ancient Greek ostraca into English using 6 NLP metrics. We argue that some models have already reached a performance apt to assist human experts. As can be expected from the difference in training corpus size, all models seem to perform better with Ancient Greek than with Coptic, where hallucinations are markedly more common. In the Coptic texts, the specialised Coptic Translator (CT) competes closely with Claude 3 Opus for the rank of most promising tool, while Claude 3 Opus and GPT-4o compete for the same position in the Ancient Greek texts. We argue that MT now substantially heightens the incentive to work on multilingual corpora. This could have a positive and long-lasting effect on Classics and Egyptology and help reduce the historical bias in translation availability. In closing, we reflect upon the need to meet AI-generated translations with an adequate critical stance.</abstract>
+      <url hash="5f5db340">2024.nlp4dh-1.44</url>
+      <bibkey>wannaz-miyagawa-2024-assessing</bibkey>
+    </paper>
+    <paper id="45">
+      <title>The Social Lives of Literary Characters: Combining citizen science and language models to understand narrative social networks</title>
+      <author><first>Andrew</first><last>Piper</last></author>
+      <author><first>Michael</first><last>Xu</last></author>
+      <author><first>Derek</first><last>Ruths</last></author>
+      <pages>472–482</pages>
+      <abstract>Characters and their interactions are central to the fabric of narratives, playing a crucial role in developing readers’ social cognition. In this paper, we introduce a novel annotation framework that distinguishes between five types of character interactions, including bilateral and unilateral classifications. Leveraging the crowd-sourcing framework of citizen science, we collect a large dataset of manual annotations (N=13,395). Using this data, we explore how genre and audience factors influence social network structures in a sample of contemporary books. Our findings demonstrate that fictional narratives tend to favor more embodied interactions and exhibit denser and less modular social networks. Our work not only enhances the understanding of narrative social networks but also showcases the potential of integrating citizen science with NLP methodologies for large-scale narrative analysis.</abstract>
+      <url hash="286d791d">2024.nlp4dh-1.45</url>
+      <bibkey>piper-etal-2024-social</bibkey>
+    </paper>
+    <paper id="46">
+      <title>Multi-word expressions in biomedical abstracts and their plain <fixed-case>E</fixed-case>nglish adaptations</title>
+      <author><first>Sergei</first><last>Bagdasarov</last></author>
+      <author><first>Elke</first><last>Teich</last></author>
+      <pages>483–488</pages>
+      <abstract>This study analyzes the use of multi-word expressions (MWEs), prefabricated sequences of words (e.g. in this case, this means that, healthcare service, follow up) in biomedical abstracts and their plain language adaptations. While English academic writing became highly specialized and complex from the late 19th century onwards, recent decades have seen a rising demand for a lay-friendly language in scientific content, especially in the health domain, to bridge a communication gap between experts and laypersons. Based on previous research showing that MWEs are easier to process than non-formulaic word sequences of comparable length, we hypothesize that they can potentially be used to create a more reader-friendly language. Our preliminary results suggest some significant differences between complex and plain abstracts when it comes to the usage patterns and informational load of MWEs.</abstract>
+      <url hash="7bc664dc">2024.nlp4dh-1.46</url>
+      <bibkey>bagdasarov-teich-2024-multi</bibkey>
+    </paper>
+    <paper id="47">
+      <title>Assessing the Performance of <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case>-4, Fine-tuned <fixed-case>BERT</fixed-case> and Traditional <fixed-case>ML</fixed-case> Models on <fixed-case>M</fixed-case>oroccan <fixed-case>A</fixed-case>rabic Sentiment Analysis</title>
+      <author><first>Mohamed</first><last>Hannani</last></author>
+      <author><first>Abdelhadi</first><last>Soudi</last></author>
+      <author><first>Kristof</first><last>Van Laerhoven</last></author>
+      <pages>489–498</pages>
+      <abstract>Large Language Models (LLMs) have demonstrated impressive capabilities in various natural language processing tasks across different languages. However, their performance in low-resource languages and dialects, such as Moroccan Arabic (MA), requires further investigation. This study evaluates the performance of ChatGPT-4, different fine-tuned BERT models, FastText as text representation, and traditional machine learning models on MA sentiment analysis. Experiments were done on two open source MA datasets: an X(Twitter) Moroccan Arabic corpus (MAC) and a Moroccan Arabic YouTube corpus (MYC) datasets to assess their capabilities on sentiment text classification. We compare the performance of fully fine-tuned and pre-trained Arabic BERT-based models with ChatGPT-4 in zero-shot settings.</abstract>
+      <url hash="2d4b86d7">2024.nlp4dh-1.47</url>
+      <bibkey>hannani-etal-2024-assessing</bibkey>
+    </paper>
+    <paper id="48">
+      <title>Analyzing Pokémon and Mario Streamers’ Twitch Chat with <fixed-case>LLM</fixed-case>-based User Embeddings</title>
+      <author><first>Mika</first><last>Hämäläinen</last></author>
+      <author><first>Jack</first><last>Rueter</last></author>
+      <author><first>Khalid</first><last>Alnajjar</last></author>
+      <pages>499–503</pages>
+      <abstract>We present a novel digital humanities method for representing our Twitch chatters as user embeddings created by a large language model (LLM). We cluster these embeddings automatically using affinity propagation and further narrow this clustering down through manual analysis. We analyze the chat of one stream by each Twitch streamer: SmallAnt, DougDoug and PointCrow. Our findings suggest that each streamer has their own type of chatters, however two categories emerge for all of the streamers: supportive viewers and emoji and reaction senders. Repetitive message spammers is a shared chatter category for two of the streamers.</abstract>
+      <url hash="805b19d8">2024.nlp4dh-1.48</url>
+      <bibkey>hamalainen-etal-2024-analyzing</bibkey>
+    </paper>
+    <paper id="49">
+      <title>Corpus Development Based on Conflict Structures in the Security Field and <fixed-case>LLM</fixed-case> Bias Verification</title>
+      <author><first>Keito</first><last>Inoshita</last></author>
+      <pages>504–512</pages>
+      <abstract>This study investigates the presence of biases in large language models (LLMs), specifically focusing on how these models process and reflect inter-state conflict structures. Previous research has often lacked the standardized datasets necessary for a thorough and consistent evaluation of biases in this context. Without such datasets, it is challenging to accurately assess the impact of these biases on critical applications. To address this gap, we developed a diverse and high-quality corpus using a four-phase process. This process included generating texts based on international conflict-related keywords, enhancing emotional diversity to capture a broad spectrum of sentiments, validating the coherence and connections between texts, and conducting final quality assurance through human reviewers who are experts in natural language processing. Our analysis, conducted using this newly developed corpus, revealed subtle but significant negative biases in LLMs, particularly towards Eastern bloc countries such as Russia and China. These biases have the potential to influence decision-making processes in fields like national security and international relations, where accurate, unbiased information is crucial. The findings underscore the importance of evaluating and mitigating these biases to ensure the reliability and fairness of LLMs when applied in sensitive areas.</abstract>
+      <url hash="42c59fe8">2024.nlp4dh-1.49</url>
+      <bibkey>inoshita-2024-corpus</bibkey>
+    </paper>
+    <paper id="50">
+      <title>Generating Interpretations of Policy Announcements</title>
+      <author><first>Andreas</first><last>Marfurt</last></author>
+      <author><first>Ashley</first><last>Thornton</last></author>
+      <author><first>David</first><last>Sylvan</last></author>
+      <author><first>James</first><last>Henderson</last></author>
+      <pages>513–520</pages>
+      <abstract>Recent advances in language modeling have focused on (potentially multiple-choice) question answering, open-ended generation, or math and coding problems. We look at a more nuanced task: the interpretation of statements of political actors. To this end, we present a dataset of policy announcements and corresponding annotated interpretations, on the topic of US foreign policy relations with Russia in the years 1993 up to 2016. We analyze the performance of finetuning standard sequence-to-sequence models of varying sizes on predicting the annotated interpretations and compare them to few-shot prompted large language models. We find that 1) model size is not the main factor for success on this task, 2) finetuning smaller models provides both quantitatively and qualitatively superior results to in-context learning with large language models, but 3) large language models pick up the annotation format and approximate the category distribution with just a few in-context examples.</abstract>
+      <url hash="eb588f62">2024.nlp4dh-1.50</url>
+      <bibkey>marfurt-etal-2024-generating</bibkey>
+    </paper>
+    <paper id="51">
+      <title>Order Up! Micromanaging Inconsistencies in <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case>-4o Text Analyses</title>
+      <author><first>Erkki</first><last>Mervaala</last></author>
+      <author><first>Ilona</first><last>Kousa</last></author>
+      <pages>521–535</pages>
+      <abstract>Large language model (LLM) applications have taken the world by storm in the past two years, and the academic sphere has not been an exception. One common, cumbersome task for researchers to attempt to automatise has been text annotation and, to an extent, analysis. Popular LLMs such as ChatGPT have been examined as a research assistant and as an analysis tool, and several discrepancies regarding both transparency and the generative content have been uncovered. Our research approaches the usability and trustworthiness of ChatGPT for text analysis from the point of view of an “out-of-the-box” zero-shot or few-shot setting, focusing on how the context window and mixed text types affect the analyses generated. Results from our testing indicate that both the types of the texts and the ordering of different kinds of texts do affect the ChatGPT analysis, but also that the context-building is less likely to cause analysis deterioration when analysing similar texts. Though some of these issues are at the core of how LLMs function, many of these caveats can be addressed by transparent research planning.</abstract>
+      <url hash="431dc36d">2024.nlp4dh-1.51</url>
+      <bibkey>mervaala-kousa-2024-order</bibkey>
+    </paper>
+    <paper id="52">
+      <title><fixed-case>CIPHE</fixed-case>: A Framework for Document Cluster Interpretation and Precision from Human Exploration</title>
+      <author><first>Anton</first><last>Eklund</last></author>
+      <author><first>Mona</first><last>Forsman</last></author>
+      <author><first>Frank</first><last>Drewes</last></author>
+      <pages>536–548</pages>
+      <abstract>Document clustering models serve unique application purposes, which turns model quality into a property that depends on the needs of the individual investigator. We propose a framework, Cluster Interpretation and Precision from Human Exploration (CIPHE), for collecting and quantifying human interpretations of cluster samples. CIPHE tasks survey participants to explore actual document texts from cluster samples and records their perceptions. It also includes a novel inclusion task that is used to calculate the cluster precision in an indirect manner. A case study on news clusters shows that CIPHE reveals which clusters have multiple interpretation angles, aiding the investigator in their exploration.</abstract>
+      <url hash="69410fff">2024.nlp4dh-1.52</url>
+      <bibkey>eklund-etal-2024-ciphe</bibkey>
+    </paper>
+    <paper id="53">
+      <title>Empowering Teachers with Usability-Oriented <fixed-case>LLM</fixed-case>-Based Tools for Digital Pedagogy</title>
+      <author><first>Melany Vanessa</first><last>Macias</last></author>
+      <author><first>Lev</first><last>Kharlashkin</last></author>
+      <author><first>Leo Einari</first><last>Huovinen</last></author>
+      <author><first>Mika</first><last>Hämäläinen</last></author>
+      <pages>549–557</pages>
+      <abstract>We present our work on two LLM-based tools that utilize artificial intelligence and creative technology to improve education. The first tool is a Moodle AI plugin, which helps teachers manage their course content more efficiently using AI-driven analysis, content generation, and an interactive chatbot. The second one is a curriculum planning tool that provides insight into the sustainability, work-life relevance, and workload of each course. Both of these tools have the common goal of integrating sustainable development goals (UN SDGs) into teaching, among other things. We will describe the usability-focused and user-centric approach we have embraced when developing these tools.</abstract>
+      <url hash="528e09f6">2024.nlp4dh-1.53</url>
+      <bibkey>macias-etal-2024-empowering</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.nlp4pi.xml b/data/xml/2024.nlp4pi.xml
new file mode 100644
index 0000000000..0091779d65
--- /dev/null
+++ b/data/xml/2024.nlp4pi.xml
@@ -0,0 +1,292 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.nlp4pi">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Third Workshop on NLP for Positive Impact</booktitle>
+      <editor><first>Daryna</first><last>Dementieva</last></editor>
+      <editor><first>Oana</first><last>Ignat</last></editor>
+      <editor><first>Zhijing</first><last>Jin</last></editor>
+      <editor><first>Rada</first><last>Mihalcea</last></editor>
+      <editor><first>Giorgio</first><last>Piatti</last></editor>
+      <editor><first>Joel</first><last>Tetreault</last></editor>
+      <editor><first>Steven</first><last>Wilson</last></editor>
+      <editor><first>Jieyu</first><last>Zhao</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="91afd78c">2024.nlp4pi-1</url>
+      <venue>nlp4pi</venue>
+    </meta>
+    <frontmatter>
+      <url hash="b6451cf3">2024.nlp4pi-1.0</url>
+      <bibkey>nlp4pi-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>What is the social benefit of hate speech detection research? A Systematic Review</title>
+      <author><first>Sidney</first><last>Wong</last><affiliation>University of Canterbury</affiliation></author>
+      <pages>1-12</pages>
+      <abstract>While NLP research into hate speech detection has grown exponentially in the last three decades, there has been minimal uptake or engagement from policy makers and non-profit organisations. We argue the absence of ethical frameworks have contributed to this rift between current practice and best practice. By adopting appropriate ethical frameworks, NLP researchers may enable the social impact potential of hate speech research. This position paper is informed by reviewing forty-eight hate speech detection systems associated with thirty-seven publications from different venues.</abstract>
+      <url hash="da78c2d9">2024.nlp4pi-1.1</url>
+      <bibkey>wong-2024-social</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Multilingual Fact-Checking using <fixed-case>LLM</fixed-case>s</title>
+      <author><first>Aryan</first><last>Singhal</last><affiliation>University of California, Santa Barbara</affiliation></author>
+      <author><first>Thomas</first><last>Law</last></author>
+      <author><first>Coby</first><last>Kassner</last></author>
+      <author><first>Ayushman</first><last>Gupta</last></author>
+      <author><first>Evan</first><last>Duan</last></author>
+      <author><first>Aviral</first><last>Damle</last></author>
+      <author><first>Ryan</first><last>Li</last></author>
+      <pages>13-31</pages>
+      <abstract>Due to the recent rise in digital misinformation, there has been great interest shown in using LLMs for fact-checking and claim verification. In this paper, we answer the question: Do LLMs know multilingual facts and can they use this knowledge for effective fact-checking? To this end, we create a benchmark by filtering multilingual claims from the X-fact dataset and evaluating the multilingual fact-checking capabilities of five LLMs across five diverse languages: Spanish, Italian, Portuguese, Turkish, and Tamil on our benchmark. We employ three different prompting techniques: Zero-Shot, English Chain-of-Thought, and Cross-Lingual Prompting, using both greedy and self-consistency decoding. We extensively analyze our results and find that GPT-4o achieves the highest accuracy, but zero-shot prompting with self-consistency was the most effective overall. We also show that techniques like Chain-of-Thought and Cross-Lingual Prompting, which are designed to improve reasoning abilities, do not necessarily improve the fact-checking abilities of LLMs. Interestingly, we find a strong negative correlation between model accuracy and the amount of internet content for a given language. This suggests that LLMs are better at fact-checking from knowledge in low-resource languages. We hope that this study will encourage more work on multilingual fact-checking using LLMs.</abstract>
+      <url hash="d9250494">2024.nlp4pi-1.2</url>
+      <bibkey>singhal-etal-2024-multilingual</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Transferring Fairness using Multi-Task Learning with Limited Demographic Information</title>
+      <author><first>Carlos</first><last>Aguirre</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Mark</first><last>Dredze</last><affiliation>Department of Computer Science, Whiting School of Engineering</affiliation></author>
+      <pages>32-49</pages>
+      <abstract>Training supervised machine learning systems with a fairness loss can improve prediction fairness across different demographic groups. However, doing so requires demographic annotations for training data, without which we cannot produce debiased classifiers for most tasks. Drawing inspiration from transfer learning methods, we investigate whether we can utilize demographic data from a related task to improve the fairness of a target task. We adapt a single-task fairness loss to a multi-task setting to exploit demographic labels from a related task in debiasing a target task, and demonstrate that demographic fairness objectives transfer fairness within a multi-task framework. Additionally, we show that this approach enables intersectional fairness by transferring between two datasets with different single-axis demographics. We explore different data domains to show how our loss can improve fairness domains and tasks.</abstract>
+      <url hash="f4deea88">2024.nlp4pi-1.3</url>
+      <bibkey>aguirre-dredze-2024-transferring</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Selecting Shots for Demographic Fairness in Few-Shot Learning with Large Language Models</title>
+      <author><first>Carlos</first><last>Aguirre</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Kuleen</first><last>Sasse</last></author>
+      <author><first>Isabel</first><last>Cachola</last><affiliation>Department of Computer Science, Whiting School of Engineering</affiliation></author>
+      <author><first>Mark</first><last>Dredze</last><affiliation>Department of Computer Science, Whiting School of Engineering</affiliation></author>
+      <pages>50-67</pages>
+      <abstract>Recently, work in NLP has shifted to few-shot (in-context) learning, with large language models (LLMs) performing well across a range of tasks. However, while fairness evaluations have become a standard for supervised methods, little is known about the fairness of LLMs as prediction systems. Further, common standard methods for fairness involve access to model weights or are applied during finetuning, which are not applicable in few-shot learning. Do LLMs exhibit prediction biases when used for standard NLP tasks?In this work, we analyze the effect of shots, which directly affect the performance of models, on the fairness of LLMs as NLP classification systems. We consider how different shot selection strategies, both existing and new demographically sensitive methods, affect model fairness across three standard fairness datasets. We find that overall the performance of LLMs is not indicative of their fairness, and there is not a single method that fits all scenarios. In light of these facts, we discuss how future work can include LLM fairness in evaluations.</abstract>
+      <url hash="b7078309">2024.nlp4pi-1.4</url>
+      <bibkey>aguirre-etal-2024-selecting</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Covert Bias: The Severity of Social Views’ Unalignment in Language Models Towards Implicit and Explicit Opinion</title>
+      <author><first>Abeer</first><last>Aldayel</last><affiliation>King Saud University</affiliation></author>
+      <author><first>Areej</first><last>Alokaili</last><affiliation>King Saud University</affiliation></author>
+      <author><first>Rehab</first><last>Alahmadi</last></author>
+      <pages>68-77</pages>
+      <abstract>While various approaches have recently been studied for bias identification, little is known about how implicit language that does not explicitly convey a viewpoint affects bias amplification in large language models. To examine the severity of bias toward a view, we evaluated the performance of two downstream tasks where the implicit and explicit knowledge of social groups were used. First, we present a stress test evaluation by using a biased model in edge cases of excessive bias scenarios. Then, we evaluate how LLMs calibrate linguistically in response to both implicit and explicit opinions when they are aligned with conflicting viewpoints. Our findings reveal a discrepancy in LLM performance in identifying implicit and explicit opinions, with a general tendency of bias toward explicit opinions of opposing stances. Moreover, the bias-aligned models generate more cautious responses using uncertainty phrases compared to the unaligned (zero-shot) base models. The direct, incautious responses of the unaligned models suggest a need for further refinement of decisiveness by incorporating uncertainty markers to enhance their reliability, especially on socially nuanced topics with high subjectivity.</abstract>
+      <url hash="13767b07">2024.nlp4pi-1.6</url>
+      <bibkey>aldayel-etal-2024-covert</bibkey>
+    </paper>
+    <paper id="7">
+      <title><fixed-case>PG</fixed-case>-Story: Taxonomy, Dataset, and Evaluation for Ensuring Child-Safe Content for Story Generation</title>
+      <author><first>Alicia</first><last>Tsai</last><affiliation>University of California Berkeley</affiliation></author>
+      <author><first>Shereen</first><last>Oraby</last><affiliation>Amazon Alexa AI</affiliation></author>
+      <author><first>Anjali</first><last>Narayan-Chen</last><affiliation>Amazon</affiliation></author>
+      <author><first>Alessandra</first><last>Cervone</last><affiliation>Amazon</affiliation></author>
+      <author><first>Spandana</first><last>Gella</last><affiliation>Amazon</affiliation></author>
+      <author><first>Apurv</first><last>Verma</last><affiliation>Bloomberg</affiliation></author>
+      <author><first>Tagyoung</first><last>Chung</last><affiliation>Amazon</affiliation></author>
+      <author><first>Jing</first><last>Huang</last><affiliation>Amazon Alexa AI</affiliation></author>
+      <author><first>Nanyun</first><last>Peng</last><affiliation>University of California, Los Angeles</affiliation></author>
+      <pages>78-97</pages>
+      <abstract>Creating children’s stories through text generation is a creative task that requires stories to be both entertaining and suitable for young audiences. However, since current story generation systems often rely on pre-trained language models fine-tuned with limited story data, they may not always prioritize child-friendliness. This can lead to the unintended generation of stories containing problematic elements such as violence, profanity, and biases. Regrettably, despite the significance of these concerns, there is a lack of clear guidelines and benchmark datasets for ensuring content safety for children. In this paper, we introduce a taxonomy specifically tailored to assess content safety in text, with a strong emphasis on children’s well-being. We present PG-Story, a dataset that includes detailed annotations for both sentence-level and discourse-level safety. We demonstrate the potential of identifying unsafe content through self-diagnosis and employing controllable generation techniques during the decoding phase to minimize unsafe elements in generated stories.</abstract>
+      <url hash="b652732c">2024.nlp4pi-1.7</url>
+      <bibkey>tsai-etal-2024-pg</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Towards Explainable Multi-Label Text Classification: A Multi-Task Rationalisation Framework for Identifying Indicators of Forced Labour</title>
+      <author><first>Erick</first><last>Guzman</last></author>
+      <author><first>Viktor</first><last>Schlegel</last><affiliation>Imperial College London</affiliation></author>
+      <author><first>Riza</first><last>Batista-Navarro</last><affiliation>University of Manchester</affiliation></author>
+      <pages>98-112</pages>
+      <abstract>The importance of rationales, or natural language explanations, lies in their capacity to bridge the gap between machine predictions and human understanding, by providing human-readable insights into why a text classifier makes specific decisions. This paper presents a novel multi-task rationalisation approach tailored to enhancing the explainability of multi-label text classifiers to identify indicators of forced labour. Our framework integrates a rationale extraction task with the classification objective and allows the inclusion of human explanations during training. We conduct extensive experiments using transformer-based models on a dataset consisting of 2,800 news articles, each annotated with labels and human-generated explanations. Our findings reveal a statistically significant difference between the best-performing architecture leveraging human rationales during training and variants using only labels. Specifically, the supervised model demonstrates a 10% improvement in predictive performance measured by the weighted F1 score, a 15% increase in the agreement between human and machine-generated rationales, and a 4% improvement in the generated rationales’ comprehensiveness. These results hold promising implications for addressing complex human rights issues with greater transparency and accountability using advanced NLP techniques.</abstract>
+      <url hash="e24f2fa7">2024.nlp4pi-1.8</url>
+      <bibkey>guzman-etal-2024-towards</bibkey>
+    </paper>
+    <paper id="9">
+      <title>All Models are Wrong, But Some are Deadly: Inconsistencies in Emotion Detection in Suicide-related Tweets</title>
+      <author><first>Annika</first><last>Schoene</last><affiliation>Institute for Experiential AI Northeastern University</affiliation></author>
+      <author><first>Resmi</first><last>Ramachandranpillai</last><affiliation>Institute for Experiential AI and Linköping University</affiliation></author>
+      <author><first>Tomo</first><last>Lazovich</last><affiliation>U.S. Census Bureau</affiliation></author>
+      <author><first>Ricardo</first><last>Baeza-Yates</last><affiliation>Northeastern University, Universitat Pompeu Fabra and Universidad de Chile</affiliation></author>
+      <pages>113-122</pages>
+      <abstract>Recent work in psychology has shown that people who experience mental health challenges are more likely to express their thoughts, emotions, and feelings on social media than share it with a clinical professional. Distinguishing suicide-related content, such as suicide mentioned in a humorous context, from genuine expressions of suicidal ideation is essential to better understanding context and risk. In this paper, we give a first insight and analysis into the differences between emotion labels annotated by humans and labels predicted by three fine-tuned language models (LMs) for suicide-related content. We find that (i) there is little agreement between LMs and humans for emotion labels of suicide-related Tweets and (ii) individual LMs predict similar emotion labels for all suicide-related categories. Our findings lead us to question the credibility and usefulness of such methods in high-risk scenarios such as suicide ideation detection.</abstract>
+      <url hash="6af7d9c3">2024.nlp4pi-1.9</url>
+      <bibkey>schoene-etal-2024-models</bibkey>
+    </paper>
+    <paper id="10">
+      <title>Efficient Aspect-Based Summarization of Climate Change Reports with Small Language Models</title>
+      <author><first>Iacopo</first><last>Ghinassi</last><affiliation>Queen Mary University of London</affiliation></author>
+      <author><first>Leonardo</first><last>Catalano</last><affiliation>University of Pisa</affiliation></author>
+      <author><first>Tommaso</first><last>Colella</last><affiliation>Universita’ di Pisa, University of Pisa</affiliation></author>
+      <pages>123-139</pages>
+      <abstract>The use of Natural Language Processing (NLP) for helping decision-makers with Climate Change action has recently been highlighted as a use case aligning with a broader drive towards NLP technologies for social good. In this context, Aspect-Based Summarization (ABS) systems that extract and summarize relevant information are particularly useful as they provide stakeholders with a convenient way of finding relevant information in expert-curated reports. In this work, we release a new dataset for ABS of Climate Change reports and we employ different Large Language Models (LLMs) and so-called Small Language Models (SLMs) to tackle this problem in an unsupervised way. Considering the problem at hand, we also show how SLMs are not significantly worse for the problem while leading to reduced carbon footprint; we do so by applying for the first time an existing framework considering both energy efficiency and task performance to the evaluation of zero-shot generative models for ABS. Overall, our results show that modern language models, both big and small, can effectively tackle ABS for Climate Change reports but more research is needed when we frame the problem as a Retrieval Augmented Generation (RAG) problem and our work and dataset will help foster efforts in this direction.</abstract>
+      <url hash="493a17ff">2024.nlp4pi-1.10</url>
+      <bibkey>ghinassi-etal-2024-efficient</bibkey>
+    </paper>
+    <paper id="14">
+      <title>An <fixed-case>NLP</fixed-case> Case Study on Predicting the Before and After of the <fixed-case>U</fixed-case>kraine–<fixed-case>R</fixed-case>ussia and Hamas–<fixed-case>I</fixed-case>srael Conflicts</title>
+      <author><first>Jordan</first><last>Miner</last></author>
+      <author><first>John</first><last>Ortega</last><affiliation>Northeastern University, Columbia University and New York University</affiliation></author>
+      <pages>140-151</pages>
+      <abstract>We propose a method to predict toxicity and other textual attributes through the use of natural language processing (NLP) techniques for two recent events: the Ukraine-Russia and Hamas-Israel conflicts. This article provides a basis for exploration in future conflicts with hopes to mitigate risk through the analysis of social media before and after a conflict begins. Our work compiles several datasets from Twitter and Reddit for both conflicts in a before and after separation with an aim of predicting a future state of social media for avoidance. More specifically, we show that: (1) there is a noticeable difference in social media discussion leading up to and following a conflict and (2) social media discourse on platforms like Twitter and Reddit is useful in identifying future conflicts before they arise. Our results show that through the use of advanced NLP techniques (both supervised and unsupervised) toxicity and other attributes about language before and after a conflict is predictable with a low error of nearly 1.2 percent for both conflicts.</abstract>
+      <url hash="09a63dcc">2024.nlp4pi-1.14</url>
+      <bibkey>miner-ortega-2024-nlp</bibkey>
+    </paper>
+    <paper id="15">
+      <title>Exploring the Jungle of Bias: Political Bias Attribution in Language Models via Dependency Analysis</title>
+      <author><first>David</first><last>Jenny</last><affiliation>ETHZ - ETH Zurich and ETHZ - ETH Zurich</affiliation></author>
+      <author><first>Yann</first><last>Billeter</last><affiliation>ZHAW - Zürcher Hochschule für Angewandte Wissenschaften</affiliation></author>
+      <author><first>Bernhard</first><last>Schölkopf</last><affiliation>ELLIS Institute and Max Planck Institute for Intelligent Systems, Max-Planck Institute</affiliation></author>
+      <author><first>Zhijing</first><last>Jin</last><affiliation>Department of Computer Science, University of Toronto</affiliation></author>
+      <pages>152-178</pages>
+      <abstract>The rapid advancement of Large Language Models (LLMs) has sparked intense debate regarding the prevalence of bias in these models and its mitigation. Yet, as exemplified by both results on debiasing methods in the literature and reports of alignment-related defects from the wider community, bias remains a poorly understood topic despite its practical relevance. To enhance the understanding of the internal causes of bias, we analyse LLM bias through the lens of causal fairness analysis, which enables us to both comprehend the origins of bias and reason about its downstream consequences and mitigation. To operationalize this framework, we propose a prompt-based method for the extraction of confounding and mediating attributes which contribute to the LLM decision process. By applying Activity Dependency Networks (ADNs), we then analyse how these attributes influence an LLM’s decision process. We apply our method to LLM ratings of argument quality in political debates. We find that the observed disparate treatment can at least in part be attributed to confounding and mitigating attributes and model misalignment, and discuss the consequences of our findings for human-AI alignment and bias mitigation.</abstract>
+      <url hash="fdf3cb2f">2024.nlp4pi-1.15</url>
+      <bibkey>jenny-etal-2024-exploring</bibkey>
+    </paper>
+    <paper id="16">
+      <title><fixed-case>A</fixed-case>gri<fixed-case>LLM</fixed-case>:Harnessing Transformers for Framer Queries</title>
+      <author><first>Krish</first><last>Didwania</last></author>
+      <author><first>Pratinav</first><last>Seth</last><affiliation>Arya.ai</affiliation></author>
+      <author><first>Aditya</first><last>Kasliwal</last></author>
+      <author><first>Amit</first><last>Agarwal</last><affiliation>Wells Fargo</affiliation></author>
+      <pages>179-187</pages>
+      <abstract>Agriculture, vital for global sustenance, necessitates innovative solutions due to a lack of organized domain experts, particularly in developing countries where many farmers are impoverished and cannot afford expert consulting. Initiatives like Farmers Helpline play a crucial role in such countries, yet challenges such as high operational costs persist. Automating query resolution can alleviate the burden on traditional call centers, providing farmers with immediate and contextually relevant information.The integration of Agriculture and Artificial Intelligence (AI) offers a transformative opportunity to empower farmers and bridge information gaps.Language models like transformers, the rising stars of AI, possess remarkable language understanding capabilities, making them ideal for addressing information gaps in agriculture.This work explores and demonstrates the transformative potential of Large Language Models (LLMs) in automating query resolution for agricultural farmers, leveraging their expertise in deciphering natural language and understanding context. Using a subset of a vast dataset of real-world farmer queries collected in India, our study focuses on approximately 4 million queries from the state of Tamil Nadu, spanning various sectors, seasonal crops, and query types.</abstract>
+      <url hash="1705d712">2024.nlp4pi-1.16</url>
+      <bibkey>didwania-etal-2024-agrillm</bibkey>
+    </paper>
+    <paper id="17">
+      <title><fixed-case>S</fixed-case>ci<fixed-case>T</fixed-case>ech<fixed-case>B</fixed-case>ait<fixed-case>RO</fixed-case>: <fixed-case>C</fixed-case>lick<fixed-case>B</fixed-case>ait Detection for <fixed-case>R</fixed-case>omanian Science and Technology News</title>
+      <author><first>Raluca-Andreea</first><last>Gînga</last></author>
+      <author><first>Ana</first><last>Uban</last><affiliation>Universitatea Bucuresti</affiliation></author>
+      <pages>188-201</pages>
+      <abstract>In this paper, we introduce a new annotated corpus of clickbait news in a low-resource language - Romanian, and a rarely covered domain - science and technology news: SciTechBaitRO. It is one of the first and the largest corpus (almost 11,000 examples) of annotated clickbait texts for the Romanian language and the first one to focus on the sci-tech domain, to our knowledge. We evaluate the possibility of automatically detecting clickbait through a series of data analysis and machine learning experiments with varied features and models, including a range of linguistic features, classical machine learning models, deep learning and pre-trained models. We compare the performance of models using different kinds of features, and show that the best results are given by the BERT models, with results of up to 89% F1 score. We additionally evaluate the models in a cross-domain setting for news belonging to other categories (i.e. politics, sports, entertainment) and demonstrate their capacity to generalize by detecting clickbait news outside of domain with high F1-scores.</abstract>
+      <url hash="fc85a57d">2024.nlp4pi-1.17</url>
+      <bibkey>ginga-uban-2024-scitechbaitro</bibkey>
+    </paper>
+    <paper id="18">
+      <title>Investigating Ableism in <fixed-case>LLM</fixed-case>s through Multi-turn Conversation</title>
+      <author><first>Guojun</first><last>Wu</last><affiliation>University of Zurich</affiliation></author>
+      <author><first>Sarah</first><last>Ebling</last><affiliation>University of Zurich</affiliation></author>
+      <pages>202-210</pages>
+      <abstract>To reveal ableism (i.e., bias against persons with disabilities) in large language models (LLMs), we introduce a novel approach involving multi-turn conversations, enabling a comparative assessment. Initially, we prompt the LLM to elaborate short biographies, followed by a request to incorporate information about a disability. Finally, we employ several methods to identify the top words that distinguish the disability-integrated biographies from those without. This comparative setting helps us uncover how LLMs handle disability-related information and reveal underlying biases. We observe that LLMs tend to highlight disabilities in a manner that can be perceived as patronizing or as implying that overcoming challenges is unexpected due to the disability.</abstract>
+      <url hash="eae6ca00">2024.nlp4pi-1.18</url>
+      <bibkey>wu-ebling-2024-investigating</bibkey>
+    </paper>
+    <paper id="19">
+      <title>Eliciting Uncertainty in Chain-of-Thought to Mitigate Bias against Forecasting Harmful User Behaviors</title>
+      <author><first>Anthony</first><last>Sicilia</last><affiliation>Northeastern University</affiliation></author>
+      <author><first>Malihe</first><last>Alikhani</last><affiliation>Northeastern University</affiliation></author>
+      <pages>211-223</pages>
+      <abstract>Conversation forecasting tasks a model with predicting the outcome of an unfolding conversation. For instance, it can be applied in social media moderation to predict harmful user behaviors before they occur, allowing for preventative interventions. While large language models (LLMs) have recently been proposed as an effective tool for conversation forecasting, it’s unclear what biases they may have, especially against forecasting the (potentially harmful) outcomes we request them to predict during moderation. This paper explores to what extent model uncertainty can be used as a tool to mitigate potential biases. Specifically, we ask three primary research questions: 1) how does LLM forecasting accuracy change when we ask models to represent their uncertainty; 2) how does LLM bias change when we ask models to represent their uncertainty; 3) how can we use uncertainty representations to reduce or completely mitigate biases without many training data points. We address these questions for 5 open-source language models tested on 2 datasets designed to evaluate conversation forecasting for social media moderation.</abstract>
+      <url hash="1ae04413">2024.nlp4pi-1.19</url>
+      <bibkey>sicilia-alikhani-2024-eliciting</bibkey>
+    </paper>
+    <paper id="21">
+      <title>Inferring Mental Burnout Discourse Across <fixed-case>R</fixed-case>eddit Communities</title>
+      <author><first>Nazanin</first><last>Sabri</last></author>
+      <author><first>Anh</first><last>Pham</last></author>
+      <author><first>Ishita</first><last>Kakkar</last></author>
+      <author><first>Mai</first><last>ElSherief</last><affiliation>Northeastern University</affiliation></author>
+      <pages>224-231</pages>
+      <abstract>Mental burnout refers to a psychological syndrome induced by chronic stress that negatively impacts the emotional and physical well-being of individuals. From the occupational context to personal hobbies, burnout is pervasive across domains and therefore affects the morale and productivity of society as a whole. Currently, no linguistic resources are available for the analysis or detection of burnout language. We address this gap by introducing a dataset annotated for burnout language. Given that social media is a platform for sharing life experiences and mental health struggles, our work examines the manifestation of burnout language in Reddit posts. We introduce a contextual word sense disambiguation approach to identify the specific meaning or context in which the word “burnout” is used, distinguishing between its application in mental health (e.g., job-related stress leading to burnout) and non-mental health contexts (e.g., engine burnout in a mechanical context). We create a dataset of 2,330 manually labeled Reddit posts for this task, as well as annotating the reason the poster associates with their burnout (e.g., professional, personal, non-traditional). We train machine learning models on this dataset achieving a minimum F1 score of 0.84 on the different tasks. We make our dataset of annotated Reddit post IDs publicly available to help advance future research in this field.</abstract>
+      <url hash="ffa94b78">2024.nlp4pi-1.21</url>
+      <bibkey>sabri-etal-2024-inferring</bibkey>
+    </paper>
+    <paper id="22">
+      <title>Decoding Ableism in Large Language Models: An Intersectional Approach</title>
+      <author><first>Rong</first><last>Li</last><affiliation>University of Zurich</affiliation></author>
+      <author><first>Ashwini</first><last>Kamaraj</last><affiliation>University of Zurich</affiliation></author>
+      <author><first>Jing</first><last>Ma</last><affiliation>University of Zurich</affiliation></author>
+      <author><first>Sarah</first><last>Ebling</last><affiliation>University of Zurich</affiliation></author>
+      <pages>232-249</pages>
+      <abstract>With the pervasive use of large language models (LLMs) across various domains, addressing the inherent ableist biases within these models requires more attention and resolution. This paper examines ableism in three LLMs (GPT-3.5, GPT-4, and Llama 3) by analyzing the intersection of disability with two additional social categories: gender and social class. Utilizing two task-specific prompts, we generated and analyzed text outputs with two metrics, VADER and regard, to evaluate sentiment and social perception biases within the responses. Our results indicate a marked improvement in bias mitigation from GPT-3.5 to GPT-4, with the latter demonstrating more positive sentiments overall, while Llama 3 showed comparatively weaker performance. Additionally, our findings underscore the complexity of intersectional biases: These biases are shaped by the combined effects of disability, gender, and class, which alter the expression and perception of ableism in LLM outputs. This research highlights the necessity for more nuanced and inclusive bias mitigation strategies in AI development, contributing to the ongoing dialogue on ethical AI practices.</abstract>
+      <url hash="999f38cb">2024.nlp4pi-1.22</url>
+      <bibkey>li-etal-2024-decoding</bibkey>
+    </paper>
+    <paper id="23">
+      <title>Explainable Identification of Hate Speech towards Islam using Graph Neural Networks</title>
+      <author><first>Azmine Toushik</first><last>Wasi</last></author>
+      <pages>250-257</pages>
+      <abstract>Islamophobic language on online platforms fosters intolerance, making detection and elimination crucial for promoting harmony. Traditional hate speech detection models rely on NLP techniques like tokenization, part-of-speech tagging, and encoder-decoder models. However, Graph Neural Networks (GNNs), with their ability to utilize relationships between data points, offer more effective detection and greater explainability. In this work, we represent speeches as nodes and connect them with edges based on their context and similarity to develop the graph. This study introduces a novel paradigm using GNNs to identify and explain hate speech towards Islam. Our model leverages GNNs to understand the context and patterns of hate speech by connecting texts via pretrained NLP-generated word embeddings, achieving state-of-the-art performance and enhancing detection accuracy while providing valuable explanations. This highlights the potential of GNNs in combating online hate speech and fostering a safer, more inclusive online environment.</abstract>
+      <url hash="aeca8657">2024.nlp4pi-1.23</url>
+      <bibkey>wasi-2024-explainable</bibkey>
+    </paper>
+    <paper id="24">
+      <title>From Text to Maps: <fixed-case>LLM</fixed-case>-Driven Extraction and Geotagging of Epidemiological Data</title>
+      <author><first>Karlyn</first><last>Harrod</last><affiliation>Oak Ridge National Laboratory</affiliation></author>
+      <author><first>Prabin</first><last>Bhandari</last><affiliation>George Mason University</affiliation></author>
+      <author><first>Antonios</first><last>Anastasopoulos</last><affiliation>Athena Research Center and George Mason University</affiliation></author>
+      <pages>258-270</pages>
+      <abstract>Epidemiological datasets are essential for public health analysis and decision-making, yet they remain scarce and often difficult to compile due to inconsistent data formats, language barriers, and evolving political boundaries. Traditional methods of creating such datasets involve extensive manual effort and are prone to errors in accurate location extraction. To address these challenges, we propose utilizing large language models (LLMs) to automate the extraction and geotagging of epidemiological data from textual documents. Our approach significantly reduces the manual effort required, limiting human intervention to validating a subset of records against text snippets and verifying the geotagging reasoning, as opposed to reviewing multiple entire documents manually to extract, clean, and geotag. Additionally, the LLMs identify information often overlooked by human annotators, further enhancing the dataset’s completeness. Our findings demonstrate that LLMs can be effectively used to semi-automate the extraction and geotagging of epidemiological data, offering several key advantages: (1) comprehensive information extraction with minimal risk of missing critical details; (2) minimal human intervention; (3) higher-resolution data with more precise geotagging; and (4) significantly reduced resource demands compared to traditional methods.</abstract>
+      <url hash="e994f00f">2024.nlp4pi-1.24</url>
+      <bibkey>harrod-etal-2024-text</bibkey>
+    </paper>
+    <paper id="25">
+      <title>Crafting Tomorrow’s Headlines: Neural News Generation and Detection in <fixed-case>E</fixed-case>nglish, <fixed-case>T</fixed-case>urkish, <fixed-case>H</fixed-case>ungarian, and <fixed-case>P</fixed-case>ersian</title>
+      <author><first>Cem</first><last>Üyük</last></author>
+      <author><first>Danica</first><last>Rovó</last><affiliation>Technische Universität München</affiliation></author>
+      <author><first>Shaghayeghkolli</first><last>Shaghayeghkolli</last></author>
+      <author><first>Rabia</first><last>Varol</last><affiliation>Technische Universität München</affiliation></author>
+      <author><first>Georg</first><last>Groh</last><affiliation>Technical University Munich</affiliation></author>
+      <author><first>Daryna</first><last>Dementieva</last></author>
+      <pages>271-307</pages>
+      <abstract>In the era dominated by information overload and its facilitation with Large Language Models (LLMs), the prevalence of misinformation poses a significant threat to public discourse and societal well-being. A critical concern at present involves the identification of machine-generated news. In this work, we take a significant step by introducing a benchmark dataset designed for neural news detection in four languages: English, Turkish, Hungarian, and Persian. The dataset incorporates outputs from multiple multilingual generators (in both, zero-shot and fine-tuned setups) such as BloomZ, LLaMa-2, Mistral, Mixtral, and GPT-4. Next, we experiment with a variety of classifiers, ranging from those based on linguistic features to advanced Transformer-based models and LLMs prompting. We present the detection results aiming to delve into the interpretablity and robustness of machine-generated texts detectors across all target languages.</abstract>
+      <url hash="a0f58e28">2024.nlp4pi-1.25</url>
+      <bibkey>uyuk-etal-2024-crafting</bibkey>
+    </paper>
+    <paper id="26">
+      <title>Reference-Based Metrics Are Biased Against Blind and Low-Vision Users’ Image Description Preferences</title>
+      <author><first>Rhea</first><last>Kapur</last></author>
+      <author><first>Elisa</first><last>Kreiss</last><affiliation>University of California, Los Angeles</affiliation></author>
+      <pages>308-314</pages>
+      <abstract>Image description generation models are sophisticated Vision-Language Models which promise to make visual content, such as images, non-visually accessible through linguistic descriptions. While these systems can benefit all, their primary motivation tends to lie in allowing blind and low-vision (BLV) users access to increasingly visual (online) discourse. Well-defined evaluation methods are crucial for steering model development into socially useful directions. In this work, we show that the most popular evaluation metrics (reference-based metrics) are biased against BLV users and therefore potentially stifle useful model development. Reference-based metrics assign quality scores based on the similarity to human-generated ground-truth descriptions and are widely accepted as neutrally representing the needs of all users. However, we find that these metrics are more strongly correlated with sighted participant ratings than BLV ratings, and we explore factors which appear to mediate this finding: description length, the image’s context of appearance, and the number of reference descriptions available. These findings suggest that there is a need for developing evaluation methods that are established based on specific downstream user groups, and they highlight the importance of reflecting on emerging biases against minorities in the development of general-purpose automatic metrics.</abstract>
+      <url hash="258e1417">2024.nlp4pi-1.26</url>
+      <bibkey>kapur-kreiss-2024-reference</bibkey>
+    </paper>
+    <paper id="27">
+      <title><fixed-case>M</fixed-case>ulti<fixed-case>C</fixed-case>limate: Multimodal Stance Detection on Climate Change Videos</title>
+      <author><first>Jiawen</first><last>Wang</last></author>
+      <author><first>Longfei</first><last>Zuo</last></author>
+      <author><first>Siyao</first><last>Peng</last><affiliation>Ludwig-Maximilians-Universität München</affiliation></author>
+      <author><first>Barbara</first><last>Plank</last><affiliation>Ludwig-Maximilians-Universität München and IT University of Copenhagen</affiliation></author>
+      <pages>315-326</pages>
+      <abstract>Climate change (CC) has attracted increasing attention in NLP in recent years. However, detecting the stance on CC in multimodal data is understudied and remains challenging due to a lack of reliable datasets. To improve the understanding of public opinions and communication strategies, this paper presents MultiClimate, the first open-source manually-annotated stance detection dataset with 100 CC-related YouTube videos and 4,209 frame-transcript pairs. We deploy state-of-the-art vision and language models, as well as multimodal models for MultiClimate stance detection. Results show that text-only BERT significantly outperforms image-only ResNet50 and ViT. Combining both modalities achieves state-of-the-art, 0.747/0.749 in accuracy/F1. Our 100M-sized fusion models also beat CLIP and BLIP, as well as the much larger 9B-sized multimodal IDEFICS and text-only Llama3 and Gemma2, indicating that multimodal stance detection remains challenging for large language models. Our code, dataset, as well as supplementary materials, are available at https://github.com/werywjw/MultiClimate.</abstract>
+      <url hash="c1a37083">2024.nlp4pi-1.27</url>
+      <bibkey>wang-etal-2024-multiclimate</bibkey>
+    </paper>
+    <paper id="28">
+      <title><fixed-case>AAVENUE</fixed-case>: Detecting <fixed-case>LLM</fixed-case> Biases on <fixed-case>NLU</fixed-case> Tasks in <fixed-case>AAVE</fixed-case> via a Novel Benchmark</title>
+      <author><first>Abhay</first><last>Gupta</last></author>
+      <author><first>Ece</first><last>Yurtseven</last></author>
+      <author><first>Philip</first><last>Meng</last><affiliation>Algoverse AI Research</affiliation></author>
+      <author><first>Kevin</first><last>Zhu</last><affiliation>Algoverse AI Research</affiliation></author>
+      <pages>327-333</pages>
+      <abstract>Detecting biases in natural language understanding (NLU) for African American Vernacular English (AAVE) is crucial to developing inclusive natural language processing (NLP) systems. To address dialect-induced performance discrepancies, we introduce AAVENUE (AAVE Natural Language Understanding Evaluation), a benchmark for evaluating large language model (LLM) performance on NLU tasks in AAVE and Standard American English (SAE). AAVENUE builds upon and extends existing benchmarks like VALUE, replacing deterministic syntactic and morphological transformations with a more flexible methodology leveraging LLM-based translation with few-shot prompting, improving performance across our evaluation metrics when translating key tasks from the GLUE and SuperGLUE benchmarks. We compare AAVENUE and VALUE translations using five popular LLMs and a comprehensive set of metrics including fluency, BARTScore, quality, coherence, and understandability. Additionally, we recruit fluent AAVE speakers to validate our translations for authenticity. Our evaluations reveal that LLMs consistently perform better on SAE tasks than AAVE-translated versions, underscoring inherent biases and highlighting the need for more inclusive NLP models.</abstract>
+      <url hash="55415c44">2024.nlp4pi-1.28</url>
+      <bibkey>gupta-etal-2024-aavenue</bibkey>
+    </paper>
+    <paper id="29">
+      <title><fixed-case>D</fixed-case>iversity<fixed-case>M</fixed-case>ed<fixed-case>QA</fixed-case>: A Benchmark for Assessing Demographic Biases in Medical Diagnosis using Large Language Models</title>
+      <author><first>Rajat</first><last>Rawat</last><affiliation>Algoverse Coding Academy LLC</affiliation></author>
+      <author><first>Hudson</first><last>McBride</last></author>
+      <author><first>Dhiyaan</first><last>Nirmal</last><affiliation>Algoverse Coding Academy</affiliation></author>
+      <author><first>Rajarshi</first><last>Ghosh</last></author>
+      <author><first>Jong</first><last>Moon</last></author>
+      <author><first>Dhruv</first><last>Alamuri</last></author>
+      <author><first>Kevin</first><last>Zhu</last><affiliation>Algoverse AI Research</affiliation></author>
+      <pages>334-348</pages>
+      <abstract>As large language models (LLMs) gain traction in healthcare, concerns about their susceptibility to demographic biases are growing. We introduce DiversityMedQA, a novel benchmark designed to assess LLM responses to medical queries across diverse patient demographics, such as gender and ethnicity. By perturbing questions from the MedQA dataset, which comprises of medical board exam questions, we created a benchmark that captures the nuanced differences in medical diagnosis across varying patient profiles. To ensure that our perturbations did not alter the clinical outcomes, we implemented a filtering strategy to validate each perturbation, so that any performance discrepancies would be indicative of bias. Our findings reveal notable discrepancies in model performance when tested against these demographic variations. By releasing DiversityMedQA, we provide a resource for evaluating and mitigating demographic bias in LLM medical diagnoses.</abstract>
+      <url hash="b67d2d9b">2024.nlp4pi-1.29</url>
+      <bibkey>rawat-etal-2024-diversitymedqa</bibkey>
+    </paper>
+    <paper id="30">
+      <title>Improving Industrial Safety by Auto-Generating Case-specific Preventive Recommendations</title>
+      <author><first>Sangameshwar</first><last>Patil</last><affiliation>Indian Institute of Technology, Madras and Tata Consultancy Services Limited, India</affiliation></author>
+      <author><first>Sumit</first><last>Koundanya</last><affiliation>Tata Consultancy Services Limited, India</affiliation></author>
+      <author><first>Shubham</first><last>Kumbhar</last><affiliation>Tata Consultancy Services Limited, India</affiliation></author>
+      <author><first>Alok</first><last>Kumar</last><affiliation>Tata Consultancy Services Limited, India</affiliation></author>
+      <pages>349-353</pages>
+      <abstract>In this paper, we propose a novel application to improve industrial safety by generating preventive recommendations using LLMs. Using a dataset of 275 incidents representing 11 different incident types sampled from real-life OSHA incidents, we compare three different LLMs to evaluate the quality of preventive recommendations generated by them. We also show that LLMs are not a panacea for the preventive recommendation generation task. They have limitations and can produce responses that are incorrect or irrelevant. We found that about 65% of the output from Vicuna model was not acceptable at all at the basic readability and other sanity checks level. Mistral and Phi_3 are better than Vicuna, but not all of their recommendations are of similar quality. We find that for a given safety incident case, the generated recommendations can be categorized as specific, generic, or irrelevant. This helps us to better quantify and compare the performance of the models. This paper is among the initial and novel work for the preventive recommendation generation problem. We believe it will pave way for use of NLP to positively impact the industrial safety.</abstract>
+      <url hash="52509926">2024.nlp4pi-1.30</url>
+      <bibkey>patil-etal-2024-improving</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.nlp4science.xml b/data/xml/2024.nlp4science.xml
new file mode 100644
index 0000000000..f784c0446d
--- /dev/null
+++ b/data/xml/2024.nlp4science.xml
@@ -0,0 +1,252 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.nlp4science">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the 1st Workshop on NLP for Science (NLP4Science)</booktitle>
+      <editor><first>Lotem</first><last>Peled-Cohen</last></editor>
+      <editor><first>Nitay</first><last>Calderon</last></editor>
+      <editor><first>Shir</first><last>Lissak</last></editor>
+      <editor><first>Roi</first><last>Reichart</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, FL, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="e09842a3">2024.nlp4science-1</url>
+      <venue>nlp4science</venue>
+    </meta>
+    <frontmatter>
+      <url hash="4f909efb">2024.nlp4science-1.0</url>
+      <bibkey>nlp4science-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title><fixed-case>T</fixed-case>oken<fixed-case>SHAP</fixed-case>: Interpreting Large Language Models with <fixed-case>M</fixed-case>onte <fixed-case>C</fixed-case>arlo Shapley Value Estimation</title>
+      <author><first>Miriam</first><last>Horovicz</last></author>
+      <author><first>Roni</first><last>Goldshmidt</last></author>
+      <pages>1-8</pages>
+      <abstract>As large language models (LLMs) become increasingly prevalent in critical applications, the need for interpretable AI has grown. We introduce TokenSHAP, a novel method for interpreting LLMs by attributing importance to individual tokens or substrings within input prompts. This approach adapts Shapley values from cooperative game theory to natural language processing, offering a rigorous framework for understanding how different parts of an input contribute to a model’s response. TokenSHAP leverages Monte Carlo sampling for computational efficiency, providing interpretable, quantitative measures of token importance. We demonstrate its efficacy across diverse prompts and LLM architectures, showing consistent improvements over existing baselines in alignment with human judgments, faithfulness to model behavior, and consistency. Our method’s ability to capture nuanced interactions between tokens provides valuable insights into LLM behavior, enhancing model transparency, improving prompt engineering, and aiding in the development of more reliable AI systems. TokenSHAP represents a significant step towards the necessary interpretability for responsible AI deployment, contributing to the broader goal of creating more transparent, accountable, and trustworthy AI systems. Open Source code https://github.com/ronigold/TokenSHAP</abstract>
+      <url hash="95c8c9f0">2024.nlp4science-1.1</url>
+      <bibkey>horovicz-goldshmidt-2024-tokenshap</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Prediction of <fixed-case>CRISPR</fixed-case> On-Target Effects via Deep Learning</title>
+      <author><first>Condy</first><last>Bao</last><affiliation>NA</affiliation></author>
+      <author><first>Fuxiao</first><last>Liu</last></author>
+      <pages>9-15</pages>
+      <abstract>Since the advent of CRISPR-Cas9, a groundbreaking gene-editing technology that enables precise genomic modifications via a short RNA guide sequence, there has been a marked increase in the accessibility and application of this technology across various fields. The success of CRISPR-Cas9 has spurred further investment and led to the discovery of additional CRISPR systems, including CRISPR-Cas13. Distinct from Cas9, which targets DNA, Cas13 targets RNA, offering unique advantages for gene modulation. We focus on Cas13d, a variant known for its collateral activity where it non-specifically cleaves adjacent RNA molecules upon activation, a feature critical to its function. We introduce DeepFM-Crispr, a novel deep learning model developed to predict the on-target efficiency and evaluate the off-target effects of Cas13d. This model harnesses a large language model to generate comprehensive representations rich in evolutionary and structural data, thereby enhancing predictions of RNA secondary structures and overall sgRNA efficacy. A transformer-based architecture processes these inputs to produce a predictive efficacy score. Comparative experiments show that DeepFM-Crispr not only surpasses traditional models but also outperforms recent state-of-the-art deep learning methods in terms of prediction accuracy and reliability.</abstract>
+      <url hash="493580be">2024.nlp4science-1.2</url>
+      <bibkey>bao-liu-2024-prediction</bibkey>
+    </paper>
+    <paper id="3">
+      <title>What an Elegant Bridge: Multilingual <fixed-case>LLM</fixed-case>s are Biased Similarly in Different Languages</title>
+      <author><first>Viktor</first><last>Mihaylov</last></author>
+      <author><first>Aleksandar</first><last>Shtedritski</last></author>
+      <pages>16-23</pages>
+      <abstract>This paper investigates biases of Large Language Models (LLMs) through the lens of grammatical gender. Drawing inspiration from seminal works in psycholinguistics, particularly the study of gender’s influence on language perception, we leverage multilingual LLMs to revisit and expand upon the foundational experiments of Boroditsky (2003). Employing LLMs as a novel method for examining psycholinguistic biases related to grammatical gender, we prompt a model to describe nouns with adjectives in various languages, focusing specifically on languages with grammatical gender. In particular, we look at adjective co-occurrences across gender and languages, and train a binary classifier to predict grammatical gender given adjectives an LLM uses to describe a noun. Surprisingly, we find that a simple classifier can not only predict noun gender above chance but also exhibit cross-language transferability. We show that while LLMs may describe words differently in different languages, they are biased similarly.</abstract>
+      <url hash="df7a9929">2024.nlp4science-1.3</url>
+      <bibkey>mihaylov-shtedritski-2024-elegant</bibkey>
+    </paper>
+    <paper id="4">
+      <title><fixed-case>P</fixed-case>sycho<fixed-case>L</fixed-case>ex: Unveiling the Psychological Mind of Large Language Models</title>
+      <author><first>Mohammad</first><last>Abbasi</last><affiliation>Iran University of Science and Technology Tehran, University of Tehran</affiliation></author>
+      <author><first>Farnaz</first><last>Mirnezami</last></author>
+      <author><first>Hassan</first><last>Naderi</last></author>
+      <pages>24-35</pages>
+      <abstract>This paper explores the intersection of psychology and artificial intelligence through the development and evaluation of specialized Large Language Models (LLMs). We introduce PsychoLex , a suite of resources designed to enhance LLMs’ proficiency in psychological tasks in both Persian and English. Key contributions include the PsychoLexQA dataset for instructional content and the PsychoLexEval dataset for rigorous evaluation of LLMs in complex psychological scenarios. Additionally, we present the PsychoLexLLaMA model, optimized specifically for psychological applications, demonstrating superior performance compared to general-purpose models. The findings underscore the potential of tailored LLMs for advancing psychological research and applications, while also highlighting areas for further refinement. This research offers a foundational step towards integrating LLMs into specialized psychological domains, with implications for future advancements in AI-driven psychological practice.</abstract>
+      <url hash="bdb4aa27">2024.nlp4science-1.4</url>
+      <bibkey>abbasi-etal-2024-psycholex</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Two-Stage Graph-Augmented Summarization of Scientific Documents</title>
+      <author><first>Rezvaneh</first><last>Rezapour</last><affiliation>Drexel University</affiliation></author>
+      <author><first>Yubin</first><last>Ge</last><affiliation>University of Illinois at Urbana-Champaign</affiliation></author>
+      <author><first>Kanyao</first><last>Han</last><affiliation>Walmart Global Tech and University of Illinois at Urbana-Champaign</affiliation></author>
+      <author><first>Ray</first><last>Jeong</last><affiliation>University of Illinois, Urbana-Champaign</affiliation></author>
+      <author><first>Jana</first><last>Diesner</last><affiliation>Technische Universität München</affiliation></author>
+      <pages>36-46</pages>
+      <abstract>Automatic text summarization helps to digest the vast and ever-growing amount of scientific publications. While transformer-based solutions like BERT and SciBERT have advanced scientific summarization, lengthy documents pose a challenge due to the token limits of these models. To address this issue, we introduce and evaluate a two-stage model that combines an extract-then-compress framework. Our model incorporates a “graph-augmented extraction module” to select order-based salient sentences and an “abstractive compression module” to generate concise summaries. Additionally, we introduce the *BioConSumm* dataset, which focuses on biodiversity conservation, to support underrepresented domains and explore domain-specific summarization strategies. Out of the tested models, our model achieves the highest ROUGE-2 and ROUGE-L scores on our newly created dataset (*BioConSumm*) and on the *SUMPUBMED* dataset, which serves as a benchmark in the field of biomedicine.</abstract>
+      <url hash="2eac456c">2024.nlp4science-1.5</url>
+      <bibkey>rezapour-etal-2024-two</bibkey>
+    </paper>
+    <paper id="6">
+      <title><fixed-case>GCD</fixed-case>-<fixed-case>TM</fixed-case>: Graph-Driven Community Detection for Topic Modelling in Psychiatry Texts</title>
+      <author><first>Anusuya</first><last>Krishnan</last><affiliation>United Arab Emirates University</affiliation></author>
+      <author><first>Isaias</first><last>Ghebrehiwet</last></author>
+      <pages>47-57</pages>
+      <abstract>Psychiatry texts provide critical insights into patient mental states and therapeutic interactions. These texts are essential for understanding psychiatric conditions, treatment dynamics, and patient responses. However, the complex and diverse nature of psychiatric communications poses significant challenges for traditional topic modeling methods. The intricate language, subtle psychological nuances, and varying lengths of text segments make it difficult to extract coherent and meaningful topics. Conventional approaches often struggle to capture the depth and overlap of themes present in these texts. In this study, we present a novel approach to topic modeling that addresses these limitations by reformulating the problem as a community detection task within a graph constructed from the text corpus. Our methodology includes lemmatization for data standardization, TF-IDF vectorization to create a term-document matrix, and cosine similarity computation to produce a similarity matrix. This matrix is then binarized to form a graph, on which community detection is performed using the Louvain method. The detected communities are subsequently analyzed with Latent Dirichlet Allocation (LDA) to extract topics. Our approach outperforms traditional topic modeling methods, offering more accurate and interpretable topic extraction with improved coherence and lower perplexity.</abstract>
+      <url hash="00e78f36">2024.nlp4science-1.6</url>
+      <bibkey>krishnan-ghebrehiwet-2024-gcd</bibkey>
+    </paper>
+    <paper id="7">
+      <title><fixed-case>SCITUNE</fixed-case>: Aligning Large Language Models with Human-Curated Scientific Multimodal Instructions</title>
+      <author><first>Sameera</first><last>Horawalavithana</last><affiliation>Pacific Northwest National Laboratory</affiliation></author>
+      <author><first>Sai</first><last>Munikoti</last><affiliation>Pacific Northwest National Laboratory</affiliation></author>
+      <author><first>Ian</first><last>Stewart</last><affiliation>Pacific Northwest National Laboratory</affiliation></author>
+      <author><first>Henry</first><last>Kvinge</last><affiliation>Pacific Northwest National Laboratory</affiliation></author>
+      <author><first>Karl</first><last>Pazdernik</last><affiliation>North Carolina State University, Pacific Northwest National Laboratory and Deep Football</affiliation></author>
+      <pages>58-72</pages>
+      <abstract>Instruction finetuning is a popular paradigm to align large language models (LLM) with human intent. Despite its popularity, this idea is less explored in improving LLMs to align existing foundation models with scientific disciplines, concepts and goals. In this work, we present SciTune as a tuning framework to improve the ability of LLMs to follow multimodal instructions generated from scientific publications. To test our methodology, we train a large multimodal model LLaMA-SciTune that connects a vision encoder and LLM for science-focused visual and language understanding. LLaMA-SciTune significantly outperforms the state-of-the-art models in the generated figure types and captions in SciCap and VisText benchmarks. In comparison to the models that are finetuned with synthetic data only, LLaMA-SciTune surpasses human performance on average and in many sub-categories on the ScienceQA benchmark. Our results demonstrate that human-generated scientific multimodal instructions remain highly valuable in tuning LLMs to perform well on science tasks, despite their lower volume and relative scarcity compared to synthetic data.</abstract>
+      <url hash="619471ce">2024.nlp4science-1.7</url>
+      <bibkey>horawalavithana-etal-2024-scitune</bibkey>
+    </paper>
+    <paper id="8">
+      <title><fixed-case>RACER</fixed-case>: An <fixed-case>LLM</fixed-case>-powered Methodology for Scalable Analysis of Semi-structured Mental Health Interviews</title>
+      <author><first>Satpreet Harcharan</first><last>Singh</last><affiliation>Harvard University</affiliation></author>
+      <author><first>Kevin</first><last>Jiang</last><affiliation>NA</affiliation></author>
+      <author><first>Kanchan</first><last>Bhasin</last><affiliation>NA</affiliation></author>
+      <author><first>Ashutosh</first><last>Sabharwal</last></author>
+      <author><first>Nidal</first><last>Moukaddam</last><affiliation>NA</affiliation></author>
+      <author><first>Ankit</first><last>Patel</last><affiliation>Baylor College of Medicine and Rice University</affiliation></author>
+      <pages>73-98</pages>
+      <abstract>Semi-structured interviews (SSIs) are a commonly employed data-collection method in healthcare research, offering in-depth qualitative insights into subject experiences. Despite their value, manual analysis of SSIs is notoriously time-consuming and labor-intensive, in part due to the difficulty of extracting and categorizing emotional responses, and challenges in scaling human evaluation for large populations. In this study, we develop RACER, a Large Language Model (LLM) based expert-guided automated pipeline that efficiently converts raw interview transcripts into insightful domain-relevant themes and sub-themes. We used RACER to analyze SSIs conducted with 93 healthcare professionals and trainees to assess the broad personal and professional mental health impacts of the COVID-19 crisis. RACER achieves moderately high agreement with two human evaluators (72%), which approaches the human inter-rater agreement (77%). Interestingly, LLMs and humans struggle with similar content involving nuanced emotional, ambivalent/dialectical, and psychological statements. Our study highlights the opportunities and challenges in using LLMs to improve research efficiency and opens new avenues for scalable analysis of SSIs in healthcare research.</abstract>
+      <url hash="913a584c">2024.nlp4science-1.8</url>
+      <bibkey>singh-etal-2024-racer</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Soft Measures for Extracting Causal Collective Intelligence</title>
+      <author><first>Maryam</first><last>Berijanian</last><affiliation>Michigan State University</affiliation></author>
+      <author><first>Spencer</first><last>Dork</last></author>
+      <author><first>Kuldeep</first><last>Singh</last></author>
+      <author><first>Michael</first><last>Millikan</last></author>
+      <author><first>Ashlin</first><last>Riggs</last></author>
+      <author><first>Aadarsh</first><last>Swaminathan</last></author>
+      <author><first>Sarah</first><last>Gibbs</last><affiliation>University of South Alabama</affiliation></author>
+      <author><first>Scott</first><last>Friedman</last><affiliation>SIFT</affiliation></author>
+      <author><first>Nathan</first><last>Brugnone</last><affiliation>Two Six Technologies</affiliation></author>
+      <pages>99-116</pages>
+      <abstract>Understanding and modeling collective intelligence is essential for addressing complex social systems. Directed graphs called fuzzy cognitive maps (FCMs) offer a powerful tool for encoding causal mental models, but extracting high-integrity FCMs from text is challenging. This study presents an approach using large language models (LLMs) to automate FCM extraction. We introduce novel graph-based similarity measures and evaluate them by correlating their outputs with human judgments through the Elo rating system. Results show positive correlations with human evaluations, but even the best-performing measure exhibits limitations in capturing FCM nuances. Fine-tuning LLMs improves performance, but existing measures still fall short. This study highlights the need for soft similarity measures tailored to FCM extraction, advancing collective intelligence modeling with NLP.</abstract>
+      <url hash="354fbc98">2024.nlp4science-1.9</url>
+      <bibkey>berijanian-etal-2024-soft</bibkey>
+    </paper>
+    <paper id="10">
+      <title>Hypothesis Generation with Large Language Models</title>
+      <author><first>Yangqiaoyu</first><last>Zhou</last><affiliation>University of Chicago</affiliation></author>
+      <author><first>Haokun</first><last>Liu</last><affiliation>University of Chicago</affiliation></author>
+      <author><first>Tejes</first><last>Srivastava</last><affiliation>University of Chicago</affiliation></author>
+      <author><first>Hongyuan</first><last>Mei</last><affiliation>Toyota Technological Institute at Chicago</affiliation></author>
+      <author><first>Chenhao</first><last>Tan</last><affiliation>University of Chicago</affiliation></author>
+      <pages>117-139</pages>
+      <abstract>Effective generation of novel hypotheses is instrumental to scientific progress. So far, researchers have been the main powerhouse behind hypothesis generation by painstaking data analysis and thinking (also known as the Eureka moment). In this paper, we examine the potential of large language models (LLMs) to generate hypotheses. We focus on hypothesis generation based on data (i.e., labeled examples). To enable LLMs to handle Long contexts, we generate initial hypotheses from a small number of examples and then update them iteratively to improve the quality of hypotheses. Inspired by multi-armed bandits, we design a reward function to inform the exploitation-exploration tradeoff in the update process. Our algorithm is able to generate hypotheses that enable much better predictive performance than few-shot prompting in classification tasks, improving accuracy by 31.7% on a synthetic dataset and by 13.9%, 3.3% and, 24.9% on three real-world datasets. We also outperform supervised learning by 12.1% and 11.6% on two challenging real-world datasets. Furthermore, we find that the generated hypotheses not only corroborate human-verified theories but also uncover new insights for the tasks.</abstract>
+      <url hash="d6dc57aa">2024.nlp4science-1.10</url>
+      <bibkey>zhou-etal-2024-hypothesis</bibkey>
+    </paper>
+    <paper id="11">
+      <title>Dreaming with <fixed-case>C</fixed-case>hat<fixed-case>GPT</fixed-case>: Unraveling the Challenges of <fixed-case>LLM</fixed-case>s Dream Generation</title>
+      <author><first>Harel</first><last>Berger</last></author>
+      <author><first>Hadar</first><last>King</last><affiliation>NA</affiliation></author>
+      <author><first>Omer</first><last>David</last><affiliation>NA</affiliation></author>
+      <pages>140-147</pages>
+      <abstract>Large Language Models (LLMs), such as ChatGPT, are used daily for different human-like text generation tasks. This motivates us to ask: <i>Can an LLM generate human dreams?</i> For this research, we explore this new avenue through the lens of ChatGPT, and its ability to generate valid dreams. We have three main findings: (i) Chatgpt-4o, the new version of chatGPT, generated all requested dreams. (ii) Generated dreams meet key psychological criteria of dreams. We hope our work will set the stage for developing a new task of dream generation for LLMs. This task can help psychologists evaluate patients’ dreams based on their demographic factors.</abstract>
+      <url hash="281d86b7">2024.nlp4science-1.11</url>
+      <bibkey>berger-etal-2024-dreaming</bibkey>
+    </paper>
+    <paper id="12">
+      <title><fixed-case>LLM</fixed-case>s and <fixed-case>NLP</fixed-case> for Generalized Learning in <fixed-case>AI</fixed-case>-Enhanced Educational Videos and Powering Curated Videos with Generative Intelligence</title>
+      <author><first>Naina</first><last>Chaturvedi</last></author>
+      <pages>148-154</pages>
+      <abstract>LLMs and NLP for Generalized Learning in AI-Enhanced Educational Videos and Powering Curated Videos with Generative IntelligenceAuthors - Naina Chaturvedi, Rutgers UniversityAnanda Gunawardena, Rutgers UniversityContact: cnaina1601@gmail.com or nc832@cs.rutgers.eduThe rapid advancement of Large Language Models (LLMs) and Natural Language Processing (NLP) technologies has opened new frontiers in educational content creation and consumption. This paper explores the intersection of these technologies with instructional videos in computer science education, addressing the crucial aspect of generalization in NLP models within an educational context.With 78% of computer science students utilizing YouTube to supplement traditional learning materials, there’s a clear demand for high-quality video content. However, the challenge of finding appropriate resources has led 73% of students to prefer curated video libraries. We propose a novel approach that leverages LLMs and NLP techniques to revolutionize this space, focusing on the ability of these models to generalize across diverse educational content and contexts.Our research utilizes the cubits.ai platform, developed at Princeton University, to demonstrate how generative AI, powered by advanced LLMs, can transform standard video playlists into interactive, AI-enhanced learning experiences. We present a framework for creating AI-generated video summaries, on-demand questions, and in-depth topic explorations, all while considering the challenges posed by LLMs trained on vast, often opaque datasets. Our approach not only enhances student engagement but also provides a unique opportunity to study how well these models generalize across different educational topics and student needs.Drawing insights from computer science courses at Princeton and Rutgers Universities, we highlight the transformative potential of AI-enhanced videos in promoting active learning, particularly in large classes. This research contributes to the ongoing dialogue about generalization in NLP while simultaneously demonstrating practical applications in educational technology. By bridging these domains, we aim to establish a shared platform for state-of-the-art generalization testing in NLP within an educational framework.Our findings not only demonstrate how educators can enhance existing video playlists using AI but also provide insights into the challenges and opportunities of using LLMs in educational settings. This work serves as a cornerstone for catalyzing research on generalization in the NLP community, particularly focusing on the application and evaluation of LLMs in adaptive, personalized learning environments.Keywords: Instructional videos; AI-enhanced learning; Large Language Models (LLMs); Natural Language Processing (NLP); generalization in NLP; computer science education; cubits.ai platform; AI-generated content; interactive video experiences; video summarization; on-demand questions; personalized learning; active learning; data-driven insights; generative AI; educational technology; adaptive learning environments</abstract>
+      <url hash="db2ec8e6">2024.nlp4science-1.12</url>
+      <bibkey>chaturvedi-2024-llms</bibkey>
+    </paper>
+    <paper id="13">
+      <title>The Moral Foundations <fixed-case>W</fixed-case>eibo Corpus</title>
+      <author><first>Renjie</first><last>Cao</last></author>
+      <author><first>Miaoyan</first><last>Hu</last></author>
+      <author><first>Jiahan</first><last>Wei</last></author>
+      <author><first>Baha</first><last>Ihnaini</last><affiliation>Wenzhou Kean University</affiliation></author>
+      <pages>155-165</pages>
+      <abstract>Moral sentiments expressed in natural language significantly influence both online and offline environments, shaping behavioral styles and interaction patterns, including social media self-presentation, cyberbullying, adherence to social norms, and ethical decision-making. To effectively measure moral sentiments in natural language processing texts, it is crucial to utilize large, annotated datasets that provide nuanced understanding for accurate analysis and model training. However, existing corpora, while valuable, often face linguistic limitations. To address this gap in the Chinese language domain, we introduce the Moral Foundation Weibo Corpus. This corpus consists of 25,671 Chinese comments on Weibo, encompassing six diverse topic areas. Each comment is manually annotated by at least three systematically trained annotators based on ten moral categories derived from a grounded theory of morality. To assess annotator reliability, we present the kappa test results, a gold standard for measuring consistency. Additionally, we apply several the latest large language models to supplement the manual annotations, conducting analytical experiments to compare their performance and report baseline results for moral sentiment classification.</abstract>
+      <url hash="96cbf777">2024.nlp4science-1.13</url>
+      <bibkey>cao-etal-2024-moral</bibkey>
+    </paper>
+    <paper id="14">
+      <title>Why So Serious: Humor and its Association with Treatment Measurements Process and Outcome</title>
+      <author><first>Matan</first><last>Kenigsbuch</last><affiliation>NA</affiliation></author>
+      <author><first>Natalie</first><last>Shapira</last></author>
+      <pages>166-174</pages>
+      <abstract>Humor is an important social construct with various roles in human communication, yet clinicians remain divided on its appropriateness and effectiveness. Despite its importance, empirical research on humor in psychotherapy is limited. This study explores the theoretical concept of “humor” by examining the operational variable of “laughs” within psychotherapy. Method: We analyzed transcriptions from 872 psychotherapy sessions involving 68 clients treated by 59 therapists. Clients self-reported their symptoms and state of well-being before each session, while both clients and therapists provided self-reports on their therapeutic alliance after each session. Through text analysis, we extracted the number of laughs and words for each session. We investigated the within-client associations between laughs and symptoms, well-being, therapeutic alliance, and clients’ number of words. Results: We found concurrent session-level associations between laughs and well-being, symptoms, and the number of words. However, no significant associations were observed between laughs and the therapeutic alliance, either from the perspective of the therapist or the client.</abstract>
+      <url hash="f78dd1e9">2024.nlp4science-1.14</url>
+      <bibkey>kenigsbuch-shapira-2024-serious</bibkey>
+    </paper>
+    <paper id="15">
+      <title>Learning the Bitter Lesson: Empirical Evidence from 20 Years of <fixed-case>CVPR</fixed-case> Proceedings</title>
+      <author><first>Mojtaba</first><last>Yousefi</last></author>
+      <author><first>Jack</first><last>Collins</last></author>
+      <pages>175-187</pages>
+      <abstract>This study examines the alignment of Conference on Computer Vision and Pattern Recognition (CVPR) research with the principles of the “bitter lesson” proposed by Rich Sutton. We analyze two decades of CVPR abstracts and titles using large language models (LLMs) to assess the field’s embracement of these principles. Our methodology leverages state-of-the-art natural language processing techniques to systematically evaluate the evolution of research approaches in computer vision. The results reveal significant trends in the adoption of general-purpose learning algorithms and the utilization of increased computational resources. We discuss the implications of these findings for the future direction of computer vision research and its potential impact on broader artificial intelligence development. This work contributes to the ongoing dialogue about the most effective strategies for advancing machine learning and computer vision, offering insights that may guide future research priorities and methodologies in the field.</abstract>
+      <url hash="6a480e5d">2024.nlp4science-1.15</url>
+      <bibkey>yousefi-collins-2024-learning</bibkey>
+    </paper>
+    <paper id="16">
+      <title>Personalized-<fixed-case>ABA</fixed-case>: Personalized Treatment Plan Generation for Applied Behavior Analysis using Natural Language Processing</title>
+      <author><first>Aman</first><last>Kumar</last><affiliation>TheraDriver</affiliation></author>
+      <author><first>Mareiko</first><last>Au</last></author>
+      <author><first>Raj</first><last>Semlawat</last><affiliation>NA</affiliation></author>
+      <author><first>Malavica</first><last>Sridhar</last></author>
+      <author><first>Hitesh</first><last>Gurnani</last><affiliation>NA</affiliation></author>
+      <pages>188-196</pages>
+      <abstract>Autism Spectrum Disorder (ASD) is a neurological and developmental disability that affects how an individual learns, communicates, interacts with others. Applied Behavior Analysis (ABA) is a gold standard therapy for children and adults suffering from ASD to improve their learning, social, and communication skills. Today, 1 in 36 children are diagnosed with ASD with expectations that this rate will only continue to rise. The supply of certified ABA providers is alarmingly insufficient to meet the needs of children with ASD. In fact, waitlists to receive ABA therapy in the United States exceed 10 months in most states. Clinicians or Board Certified Behavior Analysts (BCBAs) are now experiencing intense bottlenecks around diagnostic evaluations and developing treatment plans quickly enough to support timely access to care. Over the past few years, Artificial Intelligence has changed the way industries operate by offering powerful ways to process, analyze, generate, and predict data. In this paper, we have addressed the problem of both time and supply restrictions faced by ABA providers by proposing a novel method for personalized treatment plan generation and program prediction by leveraging the capabilities of Deep Learning and Large Language Models (LLM). Additionally, we have introduced two separate models for behavior program prediction (F1-Score: 0.671) and skill acquisition program predictions (Rouge-1 Score: 0.476) which will help ABA providers in treatment plan implementation. Results are promising: an AI-generated treatment plan demonstrates a high similarity (Average Similarity Score: 0.915) to the original treatment plan written by a BCBA. Finally, as we partnered with a multi-state ABA provider in building this product, we ran a single-blind study that concluded that BCBAs prefer an AI-generated treatment plan 65 percent of the time compared to a BCBA-generated one.</abstract>
+      <url hash="86a0f3ce">2024.nlp4science-1.16</url>
+      <bibkey>kumar-etal-2024-personalized</bibkey>
+    </paper>
+    <paper id="17">
+      <title>Exploring Scientific Hypothesis Generation with Mamba</title>
+      <author><first>Miaosen</first><last>Chai</last></author>
+      <author><first>Emily</first><last>Herron</last><affiliation>Oak Ridge National Laboratory</affiliation></author>
+      <author><first>Erick</first><last>Cervantes</last></author>
+      <author><first>Tirthankar</first><last>Ghosal</last><affiliation>Oak Ridge National Laboratory</affiliation></author>
+      <pages>197-207</pages>
+      <abstract>Generating scientifically grounded hypotheses is a challenging frontier task for generative AI models in science. The difficulty arises from the inherent subjectivity of the task and the extensive knowledge of prior work required to assess the validity of a generated hypothesis. Large Language Models (LLMs), trained on vast datasets from diverse sources, have shown a strong ability to utilize the knowledge embedded in their training data. Recent research has explored using transformer-based models for scientific hypothesis generation, leveraging their advanced capabilities. However, these models often require a significant number of parameters to manage Long sequences, which can be a limitation. State Space Models, such as Mamba, offer an alternative by effectively handling very Long sequences with fewer parameters than transformers. In this work, we investigate the use of Mamba for scientific hypothesis generation. Our preliminary findings indicate that Mamba achieves similar performance w.r.t. transformer-based models of similar sizes for a higher-order complex task like hypothesis generation. We have made our code available here: https://github.com/fglx-c/Exploring-Scientific-Hypothesis-Generation-with-Mamba</abstract>
+      <url hash="8d147d9e">2024.nlp4science-1.17</url>
+      <bibkey>chai-etal-2024-exploring</bibkey>
+    </paper>
+    <paper id="18">
+      <title>Benchmarking Automated Theorem Proving with Large Language Models</title>
+      <author><first>Vanessa</first><last>Lama</last><affiliation>Oak Ridge National Laboratory</affiliation></author>
+      <author><first>Catherine</first><last>Ma</last></author>
+      <author><first>Tirthankar</first><last>Ghosal</last><affiliation>Oak Ridge National Laboratory</affiliation></author>
+      <pages>208-218</pages>
+      <abstract>Theorem proving presents a significant challenge for large language models (LLMs) due to the requirement for formal proofs to be rigorously checked by proof assistants, such as Lean, eliminating any margin for error or hallucination. While existing LLM-based theorem provers attempt to operate autonomously, they often struggle with novel and complex theorems where human insights are essential. Lean Copilot is a novel framework that integrates LLM inference into the Lean proof assistant environment. In this work, we benchmark performance of several LLMs including general and math-specific models for theorem proving using the Lean Copilot framework. Our initial investigation suggests that a general-purpose large model like LLaMa-70B still has edge over math-specific smaller models for the task under consideration. We provide useful insights into the performance of different LLMs we chose for the task.</abstract>
+      <url hash="e6856acc">2024.nlp4science-1.18</url>
+      <bibkey>lama-etal-2024-benchmarking</bibkey>
+    </paper>
+    <paper id="19">
+      <title>The Grid: A semi-automated tool to support expert-driven modeling</title>
+      <author><first>Allegra</first><last>A. Beal Cohen</last><affiliation>NA</affiliation></author>
+      <author><first>Maria</first><last>Alexeeva</last><affiliation>University of Arizona</affiliation></author>
+      <author><first>Keith</first><last>Alcock</last><affiliation>University of Arizona</affiliation></author>
+      <author><first>Mihai</first><last>Surdeanu</last><affiliation>University of Arizona</affiliation></author>
+      <pages>219-229</pages>
+      <abstract>When building models of human behavior, we often struggle to find data that capture important factors at the right level of granularity. In these cases, we must rely on expert knowledge to build models. To help partially automate the organization of expert knowledge for modeling, we combine natural language processing (NLP) and machine learning (ML) methods in a tool called the Grid. The Grid helps users organize textual knowledge into clickable cells aLong two dimensions using iterative, collaborative clustering. We conduct a user study to explore participants’ reactions to the Grid, as well as to investigate whether its clustering feature helps participants organize a corpus of expert knowledge. We find that participants using the Grid’s clustering feature appeared to work more efficiently than those without it, but written feedback about the clustering was critical. We conclude that the general design of the Grid was positively received and that some of the user challenges can likely be mitigated through the use of LLMs.</abstract>
+      <url hash="82e6970d">2024.nlp4science-1.19</url>
+      <bibkey>a-beal-cohen-etal-2024-grid</bibkey>
+    </paper>
+    <paper id="20">
+      <title>Categorical Syllogisms Revisited: A Review of the Logical Reasoning Abilities of <fixed-case>LLM</fixed-case>s for Analyzing Categorical Syllogisms</title>
+      <author><first>Shi</first><last>Zong</last></author>
+      <author><first>Jimmy</first><last>Lin</last><affiliation>University of Waterloo</affiliation></author>
+      <pages>230-239</pages>
+      <abstract>There has been a huge number of benchmarks proposed to evaluate how large language models (LLMs) behave for logic inference tasks. However, it remains an open question how to properly evaluate this ability. In this paper, we provide a systematic overview of prior works on the logical reasoning ability of LLMs for analyzing categorical syllogisms. We first investigate all the possible variations for categorical syllogisms from a purely logical perspective and then examine the underlying configurations (i.e., mood and figure) tested by existing datasets. Our results indicate that compared to template-based synthetic datasets, crowdsourcing approaches normally sacrifice the coverage of configurations (i.e., mood and figure) of categorical syllogisms for more language variations, thus bringing challenges to fully testing LLMs under different situations. We then summarize the findings and observations for the performance of LLMs to infer the validity of syllogisms from the current literature. The error rate breakdown analyses suggest that the interpretation of quantifiers seems to be the current bottleneck that limits the performance of the LLMs and is thus worth more attention. Finally, we discuss several points that might be worth considering when researchers plan to release categorical syllogism datasets. We hope our work will provide a timely review of the current literature regarding categorical syllogisms, and motivate more interdisciplinary research between communities, specifically computational linguists and logicians.</abstract>
+      <url hash="9d255bfc">2024.nlp4science-1.20</url>
+      <bibkey>zong-lin-2024-categorical</bibkey>
+    </paper>
+    <paper id="21">
+      <title>Individuation in Neural Models with and without Visual Grounding</title>
+      <author><first>Alexey</first><last>Tikhonov</last><affiliation>Inworld AI</affiliation></author>
+      <author><first>Lisa</first><last>Bylinina</last><affiliation>Utrecht University</affiliation></author>
+      <author><first>Ivan</first><last>Yamshchikov</last><affiliation>Technical University of Applied Sciences Würzburg-Schweinfurt and ISEG, University of Lisbon</affiliation></author>
+      <pages>240-248</pages>
+      <abstract>We show differences between a language-and-vision model CLIP and two text-only models — FastText and SBERT — when it comes to the encoding of individuation information. We study latent representations that CLIP provides for substrates, granular aggregates, and various numbers of objects. We demonstrate that CLIP embeddings capture quantitative differences in individuation better than models trained on text-only data. Moreover, the individuation hierarchy we deduce from the CLIP embeddings agrees with the hierarchies proposed in linguistics and cognitive science.</abstract>
+      <url hash="1407691d">2024.nlp4science-1.21</url>
+      <bibkey>tikhonov-etal-2024-individuation</bibkey>
+    </paper>
+    <paper id="22">
+      <title><fixed-case>C</fixed-case>og<fixed-case>E</fixed-case>rg<fixed-case>LLM</fixed-case>: Exploring Large Language Model Systems Design Perspective Using Cognitive Ergonomics</title>
+      <author><first>Azmine Toushik</first><last>Wasi</last></author>
+      <author><first>Mst</first><last>Islam</last></author>
+      <pages>249-258</pages>
+      <abstract>Integrating cognitive ergonomics with LLMs is crucial for improving safety, reliability, and user satisfaction in human-AI interactions. Current LLM designs often lack this integration, resulting in systems that may not fully align with human cognitive capabilities and limitations. This oversight exacerbates biases in LLM outputs and leads to suboptimal user experiences due to inconsistent application of user-centered design principles. Researchers are increasingly leveraging NLP, particularly LLMs, to model and understand human behavior across social sciences, psychology, psychiatry, health, and neuroscience. Our position paper explores the need to integrate cognitive ergonomics into LLM design, providing a comprehensive framework and practical guidelines for ethical development. By addressing these challenges, we aim to advance safer, more reliable, and ethically sound human-AI interactions.</abstract>
+      <url hash="96f9c8d9">2024.nlp4science-1.22</url>
+      <bibkey>wasi-islam-2024-cogergllm</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.sicon.xml b/data/xml/2024.sicon.xml
new file mode 100644
index 0000000000..0b96aa43b7
--- /dev/null
+++ b/data/xml/2024.sicon.xml
@@ -0,0 +1,148 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.sicon">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Second Workshop on Social Influence in Conversations (SICon 2024)</booktitle>
+      <editor><first>James</first><last>Hale</last></editor>
+      <editor><first>Kushal</first><last>Chawla</last></editor>
+      <editor><first>Muskan</first><last>Garg</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="a8143e6b">2024.sicon-1</url>
+      <venue>sicon</venue>
+    </meta>
+    <frontmatter>
+      <url hash="28d42c57">2024.sicon-1.0</url>
+      <bibkey>sicon-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Observing the <fixed-case>S</fixed-case>outhern <fixed-case>US</fixed-case> Culture of Honor Using Large-Scale Social Media Analysis</title>
+      <author><first>Juho</first><last>Kim</last><affiliation>University of Toronto</affiliation></author>
+      <author><first>Michael</first><last>Guerzhoy</last><affiliation>University of Toronto</affiliation></author>
+      <pages>1-8</pages>
+      <abstract>A culture of honor refers to a social system where individuals’ status, reputation, and esteem play a central role in governing interpersonal relations. Past works have associated this concept with the United States (US) South and related with it various traits such as higher sensitivity to insult, a higher value on reputation, and a tendency to react violently to insults. In this paper, we hypothesize and confirm that internet users from the US South, where a culture of honor is more prevalent, are more likely to display a trait predicted by their belonging to a culture of honor. Specifically, we test the hypothesis that US Southerners are more likely to retaliate to personal attacks by personally attacking back. We leverage OpenAI’s GPT-3.5 API to both geolocate internet users and to automatically detect whether users are insulting each other. We validate the use of GPT-3.5 by measuring its performance on manually-labeled subsets of the data. Our work demonstrates the potential of formulating a hypothesis based on a conceptual framework, operationalizing it in a way that is amenable to large-scale LLM-aided analysis, manually validating the use of the LLM, and drawing a conclusion.</abstract>
+      <url hash="db8a9919">2024.sicon-1.1</url>
+      <bibkey>kim-guerzhoy-2024-observing</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Should We Respect <fixed-case>LLM</fixed-case>s? A Cross-Lingual Study on the Influence of Prompt Politeness on <fixed-case>LLM</fixed-case> Performance</title>
+      <author><first>Ziqi</first><last>Yin</last><affiliation>Waseda University</affiliation></author>
+      <author><first>Hao</first><last>Wang</last><affiliation>Waseda University</affiliation></author>
+      <author><first>Kaito</first><last>Horio</last><affiliation>Waseda University</affiliation></author>
+      <author><first>Daisuike</first><last>Kawahara</last><affiliation>Waseda University</affiliation></author>
+      <author><first>Satoshi</first><last>Sekine</last><affiliation>RIKEN AIP, NII LLMC</affiliation></author>
+      <pages>9-35</pages>
+      <abstract>We investigate the impact of politeness levels in prompts on the performance of large language models (LLMs). Polite language in human communications often garners more compliance and effectiveness, while rudeness can cause aversion, impacting response quality. We consider that LLMs mirror human communication traits, suggesting they align with human cultural norms. We assess the impact of politeness in prompts on LLMs across English, Chinese, and Japanese tasks. We observed that impolite prompts often result in poor performance, but overly polite language does not guarantee better outcomes. The best politeness level is different according to the language. This phenomenon suggests that LLMs not only reflect human behavior but are also influenced by language, particularly in different cultural contexts. Our findings highlight the need to factor in politeness for cross-cultural natural language processing and LLM usage.</abstract>
+      <url hash="5deb24e2">2024.sicon-1.2</url>
+      <bibkey>yin-etal-2024-respect</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Personality Differences Drive Conversational Dynamics: A High-Dimensional <fixed-case>NLP</fixed-case> Approach</title>
+      <author><first>Julia</first><last>Fisher</last><affiliation>Stanford University</affiliation></author>
+      <author><first>Nilam</first><last>Ram</last><affiliation>Stanford University</affiliation></author>
+      <pages>36-45</pages>
+      <abstract>This paper investigates how the topical flow of dyadic conversations emerges over time and how differences in interlocutors’ personality traits contribute to this topical flow. Leveraging text embeddings, we map the trajectories of conversations between strangers into a high-dimensional space. Using nonlinear projections and clustering, we then identify when each interlocutor enters and exits various topics. Differences in conversational flow are quantified via , a summary measure of the “spread” of topics covered during a conversation, and , a time-varying measure of the cosine similarity between interlocutors’ embeddings. Our findings suggest that interlocutors with a larger difference in the personality dimension of openness influence each other to spend more time discussing a wider range of topics and that interlocutors with a larger difference in extraversion experience a larger decrease in linguistic alignment throughout their conversation. We also examine how participants’ affect (emotion) changes from before to after a conversation, finding that a larger difference in extraversion predicts a larger difference in affect change and that a greater topic entropy predicts a larger affect increase. This work demonstrates how communication research can be advanced through the use of high-dimensional NLP methods and identifies personality difference as an important driver of social influence.</abstract>
+      <url hash="e63faa80">2024.sicon-1.3</url>
+      <bibkey>fisher-ram-2024-personality</bibkey>
+    </paper>
+    <paper id="4">
+      <title><fixed-case>R</fixed-case>ecom<fixed-case>M</fixed-case>ind: Movie Recommendation Dialogue with Seeker’s Internal State</title>
+      <author><first>Takashi</first><last>Kodama</last><affiliation>Research and Development Center for LLMs, National Institute of Informatics</affiliation></author>
+      <author><first>Hirokazu</first><last>Kiyomaru</last><affiliation>Research and Development Center for LLMs, National Institute of Informatics</affiliation></author>
+      <author><first>Yin Jou</first><last>Huang</last><affiliation>Kyoto University</affiliation></author>
+      <author><first>Sadao</first><last>Kurohashi</last><affiliation>Research and Development Center for LLMs, National Institute of Informatics, Kyoto University</affiliation></author>
+      <pages>46-63</pages>
+      <abstract>Humans pay careful attention to the interlocutor’s internal state in dialogues. For example, in recommendation dialogues, we make recommendations while estimating the seeker’s internal state, such as his/her level of knowledge and interest. Since there are no existing annotated resources for the analysis and experiment, we constructed RecomMind, a movie recommendation dialogue dataset with annotations of the seeker’s internal state at the entity level. Each entity has a first-person label annotated by the seeker and a second-person label annotated by the recommender. Our analysis based on RecomMind reveals that the success of recommendations is enhanced when recommenders mention entities that seekers do not know but are interested in. We also propose a response generation framework that explicitly considers the seeker’s internal state, utilizing the chain-of-thought prompting. The human evaluation results show that our proposed method outperforms the baseline method in both consistency and the success of recommendations.</abstract>
+      <url hash="4faa7e5f">2024.sicon-1.4</url>
+      <bibkey>kodama-etal-2024-recommind</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Redefining Proactivity for Information Seeking Dialogue</title>
+      <author><first>Jing Yang</first><last>Lee</last><affiliation>Nanyang Technological University</affiliation></author>
+      <author><first>Seokhwan</first><last>Kim</last><affiliation>Google Cloud AI</affiliation></author>
+      <author><first>Kartik</first><last>Mehta</last><affiliation>Amazon AGI</affiliation></author>
+      <author><first>Jiun-Yu</first><last>Kao</last><affiliation>Amazon AGI</affiliation></author>
+      <author><first>Yu-Hsiang</first><last>Lin</last><affiliation>Amazon AGI, Meta</affiliation></author>
+      <author><first>Arpit</first><last>Gupta</last><affiliation>Amazon AGI</affiliation></author>
+      <pages>64-84</pages>
+      <abstract>Humans pay careful attention to the interlocutor’s internal state in dialogues. For example, in recommendation dialogues, we make recommendations while estimating the seeker’s internal state, such as his/her level of knowledge and interest. Since there are no existing annotated resources for the analysis and experiment, we constructed RecomMind, a movie recommendation dialogue dataset with annotations of the seeker’s internal state at the entity level. Each entity has a first-person label annotated by the seeker and a second-person label annotated by the recommender. Our analysis based on RecomMind reveals that the success of recommendations is enhanced when recommenders mention entities that seekers do not know but are interested in. We also propose a response generation framework that explicitly considers the seeker’s internal state, utilizing the chain-of-thought prompting. The human evaluation results show that our proposed method outperforms the baseline method in both consistency and the success of recommendations.</abstract>
+      <url hash="528a1116">2024.sicon-1.5</url>
+      <bibkey>lee-etal-2024-redefining</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Leveraging Large Language Models for Code-Mixed Data Augmentation in Sentiment Analysis</title>
+      <author><first>Linda</first><last>Zeng</last><affiliation>The Harker School</affiliation></author>
+      <pages>85-101</pages>
+      <abstract>Code-mixing (CM), where speakers blend languages within a single expression, is prevalent in multilingual societies but poses challenges for natural language processing due to its complexity and limited data. We propose using a large language model to generate synthetic CM data, which is then used to enhance the performance of task-specific models for CM sentiment analysis. Our results show that in Spanish-English, synthetic data improved the F1 score by 9.32%, outperforming previous augmentation techniques. However, in Malayalam-English, synthetic data only helped when the baseline was low; with strong natural data, additional synthetic data offered little benefit. Human evaluation confirmed that this approach is a simple, cost-effective way to generate natural-sounding CM sentences, particularly beneficial for low baselines. Our findings suggest that few-shot prompting of large language models is a promising method for CM data augmentation and has significant impact on improving sentiment analysis, an important element in the development of social influence systems.</abstract>
+      <url hash="d86534d0">2024.sicon-1.6</url>
+      <bibkey>zeng-2024-leveraging</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Balancing Transparency and Accuracy: A Comparative Analysis of Rule-Based and Deep Learning Models in Political Bias Classification</title>
+      <author><first>Manuel</first><last>Martinez</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Sonja</first><last>Schmer-Galunder</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Zoey</first><last>Liu</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Sangpil</first><last>Youm</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Chathuri</first><last>Jayaweera</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Bonnie</first><last>Dorr</last><affiliation>University of Florida</affiliation></author>
+      <pages>102-115</pages>
+      <abstract>The unchecked spread of digital information, combined with increasing political polarization and the tendency of individuals to isolate themselves from opposing political viewpoints opposing views, has driven researchers to develop systems for automatically detecting political bias in media. This trend has been further fueled by discussions on social media. We explore methods for categorizing bias in US news articles, comparing rule-based and deep learning approaches. The study highlights the sensitivity of modern self-learning systems to unconstrained data ingestion, while reconsidering the strengths of traditional rule-based systems. Applying both models to left-leaning (CNN) and right-leaning (FOX) News articles, we assess their effectiveness on data beyond the original training and test sets. This analysis highlights each model’s accuracy, offers a framework for exploring deep-learning explainability, and sheds light on political bias in US news media. We contrast the opaque architecture of a deep learning model with the transparency of a linguistically informed rule-based model, showing that the rule-based model performs consistently across different data conditions and offers greater transparency, whereas the deep learning model is dependent on the training set and struggles with unseen data.</abstract>
+      <url hash="46566c92">2024.sicon-1.7</url>
+      <bibkey>martinez-etal-2024-balancing</bibkey>
+    </paper>
+    <paper id="8">
+      <title>”So, are you a different person today?” Analyzing Bias in Questions during Parole Hearings</title>
+      <author><first>Wassiliki</first><last>Siskou</last><affiliation>Cluster of Excellence ”The Politics of Inequality”, University of Konstanz, University of Passau</affiliation></author>
+      <author><first>Ingrid</first><last>Espinoza</last><affiliation>Cluster of Excellence ”The Politics of Inequality”, University of Konstanz</affiliation></author>
+      <pages>116-128</pages>
+      <abstract>During Parole Suitability Hearings commissioners need to evaluate whether an inmate’s risk of reoffending has decreased sufficiently to justify their release from prison before completing their full sentence. The conversation between the commissioners and the inmate is the key element of such hearings and is largely driven by question-and-answer patterns which can be influenced by the commissioner’s questioning behavior. To our knowledge, no previous study has investigated the relationship between the types of questions asked during parole hearings and potentially biased outcomes. We address this gap by analysing commissioner’s questioning behavior during Californian parole hearings. We test ChatGPT-4o’s capability of annotating questions automatically and achieve a high F1-score of 0.91 without prior training. By analysing all questions posed directly by commissioners to inmates, we tested for potential biases in question types across multiple demographic variables. The results show minimal bias in questioning behavior toward inmates asking for parole.</abstract>
+      <url hash="4a61d936">2024.sicon-1.8</url>
+      <bibkey>siskou-espinoza-2024-different</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Principles for <fixed-case>AI</fixed-case>-Assisted Social Influence and Their Application to Social Mediation</title>
+      <author><first>Ian</first><last>Perera</last><affiliation>Florida Institute for Human and Machine Cognition</affiliation></author>
+      <author><first>Alex</first><last>Memory</last><affiliation>Johns Hopkins University Applied Physics Laboratory</affiliation></author>
+      <author><first>Vera</first><last>Kazakova</last><affiliation>Florida Institute for Human and Machine Cognition</affiliation></author>
+      <author><first>Bonnie</first><last>Dorr</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Brodie</first><last>Mather</last><affiliation>Florida Institute for Human and Machine Cognition</affiliation></author>
+      <author><first>Ritwik</first><last>Bose</last><affiliation>Johns Hopkins University Applied Physics Laboratory</affiliation></author>
+      <author><first>Arash</first><last>Mahyari</last><affiliation>Florida Institute for Human and Machine Cognition</affiliation></author>
+      <author><first>Corey</first><last>Lofdahl</last><affiliation>Leidos, Inc.</affiliation></author>
+      <author><first>Mack</first><last>Blackburn</last><affiliation>Leidos, Inc.</affiliation></author>
+      <author><first>Archna</first><last>Bhatia</last><affiliation>Florida Institute for Human and Machine Cognition</affiliation></author>
+      <author><first>Brandon</first><last>Patterson</last><affiliation>Florida Institute for Human and Machine Cognition</affiliation></author>
+      <author><first>Peter</first><last>Pirolli</last><affiliation>Florida Institute for Human and Machine Cognition</affiliation></author>
+      <pages>129-140</pages>
+      <abstract>Successful social influence, whether at individual or community levels, requires expertise and care in several dimensions of communication: understanding of emotions, beliefs, and values; transparency; and context-aware behavior shaping. Based on our experience in identifying mediation needs in social media and engaging with moderators and users, we developed a set of principles that we believe social influence systems should adhere to to ensure ethical operation, effectiveness, widespread adoption, and trust by users on both sides of the engagement of influence. We demonstrate these principles in D-ESC: Dialogue Assistant for Engaging in Social-Cybermediation, in the context of AI-assisted social media mediation, a newer paradigm of automatic moderation that responds to unique and changing communities while engendering and maintaining trust in users, moderators, and platform-holders. Through this case study, we identify opportunities for our principles to guide future systems towards greater opportunities for positive social change.</abstract>
+      <url hash="3380e150">2024.sicon-1.9</url>
+      <bibkey>perera-etal-2024-principles</bibkey>
+    </paper>
+    <paper id="10">
+      <title><fixed-case>EHDC</fixed-case>hat: A Knowledge-Grounded, Empathy-Enhanced Language Model for Healthcare Interactions</title>
+      <author><first>Shenghan</first><last>Wu</last><affiliation>Institute of Data Science, National University of Singapore</affiliation></author>
+      <author><first>Wynne</first><last>Hsu</last><affiliation>Institute of Data Science, National University of Singapore</affiliation></author>
+      <author><first>Mong Li</first><last>Lee</last><affiliation>Institute of Data Science, National University of Singapore</affiliation></author>
+      <pages>141-151</pages>
+      <abstract>Large Language Models (LLMs) excel at a range of tasks but often struggle with issues like hallucination and inadequate empathy support. To address hallucinations, we ground our dialogues in medical knowledge sourced from external repositories such as Disease Ontology and DrugBank. To improve empathy support, we develop the Empathetic Healthcare Dialogues dataset, which utilizes multiple dialogue strategies in each response. This dataset is then used to fine-tune an LLM, and we introduce a lightweight, adaptable method called Strategy Combination Guidance to enhance the emotional support capabilities of the fine-tuned model, named EHDChat. Our evaluations show that EHDChat significantly outperforms existing models in providing emotional support and medical accuracy, demonstrating the effectiveness of our approach in enhancing empathetic and informed AI interactions in healthcare.</abstract>
+      <url hash="3e5cb928">2024.sicon-1.10</url>
+      <bibkey>wu-etal-2024-ehdchat</bibkey>
+    </paper>
+    <paper id="11">
+      <title>Domain-Expanded <fixed-case>ASTE</fixed-case>: Rethinking Generalization in Aspect Sentiment Triplet Extraction</title>
+      <author><first>Yew Ken</first><last>Chia</last><affiliation>Singapore University of Technology and Design, DAMO Academy, Alibaba Group, Singapore</affiliation></author>
+      <author><first>Hui</first><last>Chen</last><affiliation>Singapore University of Technology and Design</affiliation></author>
+      <author><first>Guizhen</first><last>Chen</last><affiliation>DAMO Academy, Alibaba Group, Nanyang Technological University</affiliation></author>
+      <author><first>Wei</first><last>Han</last><affiliation>Singapore University of Technology and Design</affiliation></author>
+      <author><first>Sharifah</first><last>Aljunied</last><affiliation>DAMO Academy, Alibaba Group, Singapore</affiliation></author>
+      <author><first>Soujanya</first><last>Poria</last><affiliation>Singapore University of Technology and Design</affiliation></author>
+      <author><first>Lidong</first><last>Bing</last><affiliation>DAMO Academy, Alibaba Group, Singapore</affiliation></author>
+      <pages>152-165</pages>
+      <abstract>Aspect Sentiment Triplet Extraction (ASTE) is a challenging task in sentiment analysis, aiming to provide fine-grained insights into human sentiments. However, existing benchmarks are limited to two domains and do not evaluate model performance on unseen domains, raising concerns about the generalization of proposed methods. Furthermore, it remains unclear if large language models (LLMs) can effectively handle complex sentiment tasks like ASTE. In this work, we address the issue of generalization in ASTE from both a benchmarking and modeling perspective. We introduce a domain-expanded benchmark by annotating samples from diverse domains, enabling evaluation of models in both in-domain and out-of-domain settings. Additionally, we propose CASE, a simple and effective decoding strategy that enhances trustworthiness and performance of LLMs in ASTE. Through comprehensive experiments involving multiple tasks, settings, and models, we demonstrate that CASE can serve as a general decoding strategy for complex sentiment tasks. By expanding the scope of evaluation and providing a more reliable decoding strategy, we aim to inspire the research community to reevaluate the generalizability of benchmarks and models for ASTE. Our code, data, and models are available at https://github.com/DAMO-NLP-SG/domain-expanded-aste.</abstract>
+      <url hash="c763be83">2024.sicon-1.11</url>
+      <bibkey>chia-etal-2024-domain</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.tsar.xml b/data/xml/2024.tsar.xml
new file mode 100644
index 0000000000..7e2824228f
--- /dev/null
+++ b/data/xml/2024.tsar.xml
@@ -0,0 +1,151 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.tsar">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Third Workshop on Text Simplification, Accessibility and Readability (TSAR 2024)</booktitle>
+      <editor><first>Horacio</first><last>Saggion</last></editor>
+      <editor><first>Marcos</first><last>Zampieri</last></editor>
+      <editor><first>Matthew</first><last>Shardlow</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="f06e8db2">2024.tsar-1</url>
+      <venue>tsar</venue>
+    </meta>
+    <frontmatter>
+      <url hash="bf3f39e5">2024.tsar-1.0</url>
+      <bibkey>tsar-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title><fixed-case>M</fixed-case>ulti<fixed-case>LS</fixed-case>: An End-to-End Lexical Simplification Framework</title>
+      <author><first>Kai</first><last>North</last><affiliation>George Mason University</affiliation></author>
+      <author><first>Tharindu</first><last>Ranasinghe</last><affiliation>Lancaster University</affiliation></author>
+      <author><first>Matthew</first><last>Shardlow</last><affiliation>Manchester Metropolitan University</affiliation></author>
+      <author><first>Marcos</first><last>Zampieri</last><affiliation>George Mason University</affiliation></author>
+      <pages>1-11</pages>
+      <abstract>Lexical Simplification (LS) automatically replaces difficult to read words for easier alternatives while preserving a sentence’s original meaning. Several datasets exist for LS and each of them specialize in one or two sub-tasks within the LS pipeline. However, as of this moment, no single LS dataset has been developed that covers all LS sub-tasks. We present MultiLS, the first LS framework that allows for the creation of a multi-task LS dataset. We also present MultiLS-PT, the first dataset created using the MultiLS framework. We demonstrate the potential of MultiLS-PT by carrying out all LS sub-tasks of (1) lexical complexity prediction (LCP), (2) substitute generation, and (3) substitute ranking for Portuguese.</abstract>
+      <url hash="1c7a58c4">2024.tsar-1.1</url>
+      <bibkey>north-etal-2024-multils</bibkey>
+    </paper>
+    <paper id="2">
+      <title><fixed-case>O</fixed-case>to<fixed-case>BERT</fixed-case>: Identifying Suffixed Verbal Forms in <fixed-case>M</fixed-case>odern <fixed-case>H</fixed-case>ebrew Literature</title>
+      <author><first>Avi</first><last>Shmidman</last><affiliation>Bar Ilan University and DICTA</affiliation></author>
+      <author><first>Shaltiel</first><last>Shmidman</last><affiliation>Dicta</affiliation></author>
+      <pages>12-19</pages>
+      <abstract>We provide a solution for a specific morphological obstacle which often makes Hebrew literature difficult to parse for the younger generation. The morphologically-rich nature of the Hebrew language allows pronominal direct objects to be realized as bound morphemes, suffixed to the verb. Although such suffixes are often utilized in Biblical Hebrew, their use has all but disappeared in modern Hebrew. Nevertheless, authors of modern Hebrew literature, in their search for literary flair, do make use of such forms. These unusual forms are notorious for alienating young readers from Hebrew literature, especially because these rare suffixed forms are often orthographically identical to common Hebrew words with different meanings. Upon encountering such words, readers naturally select the usual analysis of the word; yet, upon completing the sentence, they find themselves confounded. Young readers end up feeling “tricked”, and this in turn contributes to their alienation from the text. In order to address this challenge, we pretrained a new BERT model specifically geared to identify such forms, so that they may be automatically simplified and/or flagged. We release this new BERT model to the public for unrestricted use.</abstract>
+      <url hash="e6c1dc7d">2024.tsar-1.2</url>
+      <bibkey>shmidman-shmidman-2024-otobert</bibkey>
+    </paper>
+    <paper id="3">
+      <title><fixed-case>C</fixed-case>omp<fixed-case>L</fixed-case>ex-<fixed-case>ZH</fixed-case>: A New Dataset for Lexical Complexity Prediction in <fixed-case>M</fixed-case>andarin and <fixed-case>C</fixed-case>antonese</title>
+      <author><first>Le</first><last>Qiu</last><affiliation>The Hong Kong Polytechnic University</affiliation></author>
+      <author><first>Shanyue</first><last>Guo</last><affiliation>The Hong Kong Polytechnic University</affiliation></author>
+      <author><first>Tak-Sum</first><last>Wong</last><affiliation>Department of Chinese and Bilingual Studies</affiliation></author>
+      <author><first>Emmanuele</first><last>Chersoni</last><affiliation>Hong Kong Polytechnic University</affiliation></author>
+      <author><first>John</first><last>Lee</last><affiliation>City University of Hong Kong</affiliation></author>
+      <author><first>Chu-Ren</first><last>Huang</last><affiliation>The Hong Kong Polytechnic Universiy</affiliation></author>
+      <pages>20-26</pages>
+      <abstract>The prediction of lexical complexity in context is assuming an increasing relevance in Natural Language Processing research, since identifying complex words is often the first step of text simplification pipelines. To the best of our knowledge, though, datasets annotated with complex words are available only for English and for a limited number of Western languages.In our paper, we introduce CompLex-ZH, a dataset including words annotated with complexity scores in sentential contexts for Chinese. Our data include sentences in Mandarin and Cantonese, which were selected from a variety of sources and textual genres. We provide a first evaluation with baselines combining hand-crafted and language models-based features.</abstract>
+      <url hash="14e7f4f4">2024.tsar-1.3</url>
+      <bibkey>qiu-etal-2024-complex</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Images Speak Volumes: User-Centric Assessment of Image Generation for Accessible Communication</title>
+      <author><first>Miriam</first><last>Anschütz</last><affiliation>Technical University of Munich</affiliation></author>
+      <author><first>Tringa</first><last>Sylaj</last><affiliation>Technical University of Munich</affiliation></author>
+      <author><first>Georg</first><last>Groh</last><affiliation>TUM</affiliation></author>
+      <pages>27-40</pages>
+      <abstract>Explanatory images play a pivotal role in accessible and easy-to-read (E2R) texts. However, the images available in online databases are not tailored toward the respective texts, and the creation of customized images is expensive. In this large-scale study, we investigated whether text-to-image generation models can close this gap by providing customizable images quickly and easily. We benchmarked seven, four open- and three closed-source, image generation models and provide an extensive evaluation of the resulting images. In addition, we performed a user study with people from the E2R target group to examine whether the images met their requirements. We find that some of the models show remarkable performance, but none of the models are ready to be used at a larger scale without human supervision. Our research is an important step toward facilitating the creation of accessible information for E2R creators and tailoring accessible images to the target group’s needs.</abstract>
+      <url hash="e9776d0c">2024.tsar-1.4</url>
+      <bibkey>anschutz-etal-2024-images</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Cochrane-auto: An Aligned Dataset for the Simplification of Biomedical Abstracts</title>
+      <author><first>Jan</first><last>Bakker</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Jaap</first><last>Kamps</last><affiliation>University of Amsterdam</affiliation></author>
+      <pages>41-51</pages>
+      <abstract>The most reliable and up-to-date information on health questions is in the biomedical literature, but inaccessible due to the complex language full of jargon. Domain specific scientific text simplification holds the promise to make this literature accessible to a lay audience. Therefore, we create Cochrane-auto: a large corpus of pairs of aligned sentences, paragraphs, and abstracts from biomedical abstracts and lay summaries. Experiments demonstrate that a plan-guided simplification system trained on Cochrane-auto is able to outperform a strong baseline trained on unaligned abstracts and lay summaries. More generally, our freely available corpus complementing Newsela-auto and Wiki-auto facilitates text simplification research beyond the sentence-level and direct lexical and grammatical revisions.</abstract>
+      <url hash="c60c9c44">2024.tsar-1.5</url>
+      <bibkey>bakker-kamps-2024-cochrane</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Considering Human Interaction and Variability in Automatic Text Simplification</title>
+      <author><first>Jenia</first><last>Kim</last><affiliation>HU University of Applied Sciences Utrecht</affiliation></author>
+      <author><first>Stefan</first><last>Leijnen</last><affiliation>HU University of Applied Sciences Utrecht</affiliation></author>
+      <author><first>Lisa</first><last>Beinborn</last><affiliation>Unviersity of Goettingen</affiliation></author>
+      <pages>52-60</pages>
+      <abstract>Research into automatic text simplification aims to promote access to information for all members of society. To facilitate generalizability, simplification research often abstracts away from specific use cases, and targets a prototypical reader and an underspecified content creator. In this paper, we consider a real-world use case – simplification technology for use in Dutch municipalities – and identify the needs of the content creators and the target audiences in this use case. The stakeholders envision a system that (a) assists the human writer without taking over the task; (b) can provide diverse alternative outputs, tailored for specific target audiences; and (c) can explain and motivate the suggestions that it outputs. These requirements call for technology that is characterized by modularity, explainability, and variability. We believe that these are important research directions that require further exploration.</abstract>
+      <url hash="1a42cf44">2024.tsar-1.6</url>
+      <bibkey>kim-etal-2024-considering</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Society of Medical Simplifiers</title>
+      <author><first>Chen</first><last>Lyu</last><affiliation>University of Warwick</affiliation></author>
+      <author><first>Gabriele</first><last>Pergola</last><affiliation>University of Warwick</affiliation></author>
+      <pages>61-68</pages>
+      <abstract>Medical text simplification is crucial for making complex biomedical literature more accessible to non-experts. Traditional methods struggle with the specialized terms and jargon of medical texts, lacking the flexibility to adapt the simplification process dynamically. In contrast, recent advancements in large language models (LLMs) present unique opportunities by offering enhanced control over text simplification through iterative refinement and collaboration between specialized agents. In this work, we introduce the Society of Medical Simplifiers, a novel LLM-based framework inspired by the “Society of Mind” (SOM) philosophy. Our approach leverages the strengths of LLMs by assigning five distinct roles, i.e., Layperson, Simplifier, Medical Expert, Language Clarifier, and Redundancy Checker, organized into interaction loops. This structure allows the agents to progressively improve text simplification while maintaining the complexity and accuracy of the original content. Evaluations on the Cochrane text simplification dataset demonstrate that our framework is on par with or outperforms state-of-the-art methods, achieving superior readability and content preservation through controlled simplification processes.</abstract>
+      <url hash="49283047">2024.tsar-1.7</url>
+      <bibkey>lyu-pergola-2024-society</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Difficult for Whom? A Study of <fixed-case>J</fixed-case>apanese Lexical Complexity</title>
+      <author><first>Adam</first><last>Nohejl</last><affiliation>Nara Institute of Science and Technology</affiliation></author>
+      <author><first>Akio</first><last>Hayakawa</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <author><first>Yusuke</first><last>Ide</last><affiliation>Nara Institute of Science and Technology</affiliation></author>
+      <author><first>Taro</first><last>Watanabe</last><affiliation>Nara Institute of Science and Technology</affiliation></author>
+      <pages>69-81</pages>
+      <abstract>The tasks of lexical complexity prediction (LCP) and complex word identification (CWI) commonly presuppose that difficult-to-understand words are shared by the target population. Meanwhile, personalization methods have also been proposed to adapt models to individual needs. We verify that a recent Japanese LCP dataset is representative of its target population by partially replicating the annotation. By another reannotation we show that native Chinese speakers perceive the complexity differently due to Sino-Japanese vocabulary. To explore the possibilities of personalization, we compare competitive baselines trained on the group mean ratings and individual ratings in terms of performance for an individual. We show that the model trained on a group mean performs similarly to an individual model in the CWI task, while achieving good LCP performance for an individual is difficult. We also experiment with adapting a finetuned BERT model, which results only in marginal improvements across all settings.</abstract>
+      <url hash="c54b4ed2">2024.tsar-1.8</url>
+      <bibkey>nohejl-etal-2024-difficult</bibkey>
+    </paper>
+    <paper id="9">
+      <title>*Lexical Complexity Prediction and Lexical Simplification for <fixed-case>C</fixed-case>atalan and <fixed-case>S</fixed-case>panish: Resource Creation, Quality Assessment, and Ethical Considerations</title>
+      <author><first>Horacio</first><last>Saggion</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <author><first>Stefan</first><last>Bott</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <author><first>Sandra</first><last>Szasz</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <author><first>Nelson</first><last>Pérez</last><affiliation>Instituto Tecnológico de Costa Rica</affiliation></author>
+      <author><first>Saúl</first><last>Calderón</last><affiliation>Intituto Tecnológico de Costa Rica</affiliation></author>
+      <author><first>Martín</first><last>Solís</last><affiliation>Instituto Tecnológico de Costa Rica</affiliation></author>
+      <pages>82-94</pages>
+      <abstract>Automatic lexical simplification is a task to substitute lexical items that may be unfamiliar and difficult to understand with easier and more common words. This paper presents the description and analysis of two novel datasets for lexical simplification in Spanish and Catalan. This dataset represents the first of its kind in Catalan and a substantial addition to the sparse data on automatic lexical simplification which is available for Spanish. Specifically, it is the first dataset for Spanish which includes scalar ratings of the understanding difficulty of lexical items. In addition, we present a detailed analysis aiming at assessing the appropriateness and ethical dimensions of the data for the lexical simplification task.</abstract>
+      <url hash="ac03e771">2024.tsar-1.9</url>
+      <bibkey>saggion-etal-2024-lexical</bibkey>
+    </paper>
+    <paper id="10">
+      <title>*<fixed-case>S</fixed-case>ci<fixed-case>G</fixed-case>is<fixed-case>P</fixed-case>y: a Novel Metric for Biomedical Text Simplification via Gist Inference Score</title>
+      <author><first>Chen</first><last>Lyu</last><affiliation>University of Warwick</affiliation></author>
+      <author><first>Gabriele</first><last>Pergola</last><affiliation>University of Warwick</affiliation></author>
+      <pages>95-106</pages>
+      <abstract>Biomedical literature is often written in highly specialized language, posing significant comprehension challenges for non-experts. Automatic text simplification (ATS) offers a solution by making such texts more accessible while preserving critical information. However, evaluating ATS for biomedical texts is still challenging due to the limitations of existing evaluation metrics. General-domain metrics like SARI, BLEU, and ROUGE focus on surface-level text features, and readability metrics like FKGL and ARI fail to account for domain-specific terminology or assess how well the simplified text conveys core meanings (gist). To address this, we introduce SciGisPy, a novel evaluation metric inspired by Gist Inference Score (GIS) from Fuzzy-Trace Theory (FTT). SciGisPy measures how well a simplified text facilitates the formation of abstract inferences (gist) necessary for comprehension, especially in the biomedical domain. We revise GIS for this purpose by introducing domain-specific enhancements, including semantic chunking, Information Content (IC) theory, and specialized embeddings, while removing unsuitable indexes. Our experimental evaluation on the Cochrane biomedical text simplification dataset demonstrates that SciGisPy outperforms the original GIS formulation, with a significant increase in correctly identified simplified texts (84% versus 44.8%). The results and a thorough ablation study confirm that SciGisPy better captures the essential meaning of biomedical content, outperforming existing approaches.</abstract>
+      <url hash="c17a419b">2024.tsar-1.10</url>
+      <bibkey>lyu-pergola-2024-scigispy</bibkey>
+    </paper>
+    <paper id="11">
+      <title>*<fixed-case>EASSE</fixed-case>-<fixed-case>DE</fixed-case> &amp; <fixed-case>EASSE</fixed-case>-multi: Easier Automatic Sentence Simplification Evaluation for <fixed-case>G</fixed-case>erman &amp; Multiple Languages</title>
+      <author><first>Regina</first><last>Stodden</last><affiliation>Computational Linguistics Department, Heinrich Heine University Düsseldorf</affiliation></author>
+      <pages>107-116</pages>
+      <abstract>In this work, we propose EASSE-multi, a framework for easier automatic sentence evaluation for languages other than English. Compared to the original EASSE framework, EASSE-multi does not focus only on English.It contains tokenizers and versions of text simplification evaluation metrics which are suitable for multiple languages. In this paper, we exemplify the usage of EASSE-multi for German TS resulting in EASSE-DE. Further, we compare text simplification results when evaluating with different language or tokenization settings of the metrics. Based on this, we formulate recommendations on how to make the evaluation of (German) TS models more transparent and better comparable. Additionally, we present a benchmark on German TS evaluated with EASSE-DE and make its resources (i.e., test sets, system outputs, and evaluation reports) available. The code of EASSE-multi and its German specialisation (EASSE-DE) can be found at https://github.com/rstodden/easse-multi and https://github.com/rstodden/easse-de.</abstract>
+      <url hash="51045588">2024.tsar-1.11</url>
+      <bibkey>stodden-2024-easse</bibkey>
+    </paper>
+    <paper id="12">
+      <title>*Evaluating the Simplification of <fixed-case>B</fixed-case>razilian Legal Rulings in <fixed-case>LLM</fixed-case>s Using Readability Scores as a Target</title>
+      <author><first>Antonio Flavio</first><last>Paula</last><affiliation>Universidade Federal de Goiás</affiliation></author>
+      <author><first>Celso</first><last>Camilo-Junior</last><affiliation>Institute of Informatics, Federal University of Goiás</affiliation></author>
+      <pages>117-125</pages>
+      <abstract>Legal documents are often characterized by complex language, including jargon and technical terms, making them challenging for Natural Language Processing (NLP) applications. We apply the readability-controlled text modification task with an emphasis on legal texts simplification. Additionally, our work explores an evaluation based on the comparison of word complexity in the documents using Zipf scale, demonstrating the models’ ability to simplify text according to the target readability scores, while also identifying a limit to this capability. Our results with Llama-3 and Sabiá-2 show that while the complexity score decreases with higher readability targets, there is a trade-off with reduced semantic similarity.</abstract>
+      <url hash="d201556f">2024.tsar-1.12</url>
+      <bibkey>paula-camilo-junior-2024-evaluating</bibkey>
+    </paper>
+    <paper id="13">
+      <title>*Measuring and Modifying the Readability of <fixed-case>E</fixed-case>nglish Texts with <fixed-case>GPT</fixed-case>-4</title>
+      <author><first>Sean</first><last>Trott</last><affiliation>UC San Diego</affiliation></author>
+      <author><first>Pamela</first><last>Rivière</last><affiliation>UC San Diego</affiliation></author>
+      <pages>126-134</pages>
+      <abstract>The success of Large Language Models (LLMs) in other domains has raised the question of whether LLMs can reliably assess and manipulate the readability of text. We approach this question empirically. First, using a published corpus of 4,724 English text excerpts, we find that readability estimates produced “zero-shot” from GPT-4 Turbo and GPT-4o mini exhibit relatively high correlation with human judgments (r = 0.76 and r = 0.74, respectively), out-performing estimates derived from traditional readability formulas and various psycholinguistic indices. Then, in a pre-registered human experiment (N = 59), we ask whether Turbo can reliably make text easier or harder to read. We find evidence to support this hypothesis, though considerable variance in human judgments remains unexplained. We conclude by discussing the limitations of this approach, including limited scope, as well as the validity of the “readability” construct and its dependence on context, audience, and goal.</abstract>
+      <url hash="e80c1741">2024.tsar-1.13</url>
+      <bibkey>trott-riviere-2024-measuring</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.wat.xml b/data/xml/2024.wat.xml
new file mode 100644
index 0000000000..041515d9d2
--- /dev/null
+++ b/data/xml/2024.wat.xml
@@ -0,0 +1,80 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.wat">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Eleventh Workshop on Asian Translation (WAT 2024)</booktitle>
+      <editor><first>Toshiaki</first><last>Nakazawa</last></editor>
+      <editor><first>Isao</first><last>Goto</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="49a77437">2024.wat-1</url>
+      <venue>wat</venue>
+    </meta>
+    <frontmatter>
+      <url hash="4cf2c49d">2024.wat-1.0</url>
+      <bibkey>wat-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Creative and Context-Aware Translation of <fixed-case>E</fixed-case>ast <fixed-case>A</fixed-case>sian Idioms with <fixed-case>GPT</fixed-case>-4</title>
+      <author><first>Kenan</first><last>Tang</last><affiliation>UC Santa Barbara</affiliation></author>
+      <author><first>Peiyang</first><last>Song</last><affiliation>California Institute of Technology</affiliation></author>
+      <author><first>Yao</first><last>Qin</last><affiliation>UC Santa Barbara</affiliation></author>
+      <author><first>Xifeng</first><last>Yan</last><affiliation>UC Santa Barbara</affiliation></author>
+      <pages>1-21</pages>
+      <abstract>As a type of figurative language, an East Asian idiom condenses rich cultural background into only a few characters. Translating such idioms is challenging for human translators, who often resort to choosing a context-aware translation from an existing list of candidates. However, compiling a dictionary of candidate translations demands much time and creativity even for expert translators. To alleviate such burden, we evaluate if GPT-4 can help generate high-quality translations. Based on automatic evaluations of faithfulness and creativity, we first identify Pareto-optimal prompting strategies that can outperform translation engines from Google and DeepL. Then, at a low cost, our context-aware translations can achieve far more high-quality translations per idiom than the human baseline. We open-source all code and data to facilitate further research.</abstract>
+      <url hash="8f53f1df">2024.wat-1.1</url>
+      <attachment type="SupplementaryMaterial" hash="99d2b4fd">2024.wat-1.1.SupplementaryMaterial.txt</attachment>
+      <attachment type="SupplementaryMaterial" hash="9c5fcac3">2024.wat-1.1.SupplementaryMaterial.zip</attachment>
+      <bibkey>tang-etal-2024-creative-context</bibkey>
+    </paper>
+    <paper id="2">
+      <title>An Empirical Study of Multilingual Vocabulary for Neural Machine Translation Models</title>
+      <author><first>Kenji</first><last>Imamura</last><affiliation>National Institute of Information and Communications Technology</affiliation></author>
+      <author><first>Masao</first><last>Utiyama</last><affiliation>NICT</affiliation></author>
+      <pages>22-35</pages>
+      <abstract>In this paper, we discuss multilingual vocabulary for neural machine translation models. Multilingual vocabularies should generate highly accurate machine translations regardless of the languages, and have preferences so that tokenized strings contain rare out-of-vocabulary (OOV) tokens and token sequences are short. In this paper, we discuss the characteristics of various multilingual vocabularies via tokenization and translation experiments. We also present our recommended vocabulary and tokenizer.</abstract>
+      <url hash="72a0a1ed">2024.wat-1.2</url>
+      <attachment type="SupplementaryMaterial" hash="6e209c03">2024.wat-1.2.SupplementaryMaterial.zip</attachment>
+      <attachment type="SupplementaryMaterial" hash="bec11c1e">2024.wat-1.2.SupplementaryMaterial.txt</attachment>
+      <bibkey>imamura-utiyama-2024-empirical</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Machine Translation Of <fixed-case>M</fixed-case>arathi Dialects: A Case Study Of Kadodi</title>
+      <author><first>Raj</first><last>Dabre</last><affiliation>NICT</affiliation></author>
+      <author><first>Mary</first><last>Dabre</last><affiliation>Independent</affiliation></author>
+      <author><first>Teresa</first><last>Pereira</last><affiliation>St. Gonsalo Garcia College, Vasai</affiliation></author>
+      <pages>36-44</pages>
+      <abstract>While Marathi is considered as a low- to middle-resource language, its 42 dialects have mostly been ignored, mainly because these dialects are mostly spoken and rarely written, making them extremely low-resource. In this paper we explore the machine translation (MT) of Kadodi, also known as Samvedi, which is a dialect of Marathi. We first discuss the Kadodi dialect, highlighting the differences from the standard dialect, followed by presenting a manually curated dataset called Suman consisting of a trilingual Kadodi-Marathi-English dictionary of 949 entries and 942 simple sentence triples and idioms created by native Kadodi speakers. We then evaluate 3 existing large language models (LLMs) supporting Marathi, namely Gemma-2-9b, Sarvam-2b-0.5 and LLaMa-3.1-8b, in few-shot prompting style to determine their efficacy for translation involving Kadodi. We observe that these models exhibit rather lackluster performance in handling Kadodi even for simple sentences, indicating a dire situation.</abstract>
+      <url hash="aaa157de">2024.wat-1.3</url>
+      <attachment type="SupplementaryMaterial" hash="05613c0b">2024.wat-1.3.SupplementaryMaterial.txt</attachment>
+      <bibkey>dabre-etal-2024-machine</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Are Large Language Models State-of-the-art Quality Estimators for Machine Translation of User-generated Content?</title>
+      <author><first>Shenbin</first><last>Qian</last><affiliation>University of Surrey</affiliation></author>
+      <author><first>Constantin</first><last>Orasan</last><affiliation>University of Surrey</affiliation></author>
+      <author><first>Diptesh</first><last>Kanojia</last><affiliation>University of Surrey</affiliation></author>
+      <author><first>Félix</first><last>Do Carmo</last><affiliation>University of Surrey</affiliation></author>
+      <pages>45-55</pages>
+      <abstract>This paper investigates whether large language models (LLMs) are state-of-the-art quality estimators for machine translation of user-generated content (UGC) that contains emotional expressions, without the use of reference translations. To achieve this, we employ an existing emotion-related dataset with human-annotated errors and calculate quality evaluation scores based on the Multi-dimensional Quality Metrics. We compare the accuracy of several LLMs with that of our fine-tuned baseline models, under in-context learning and parameter-efficient fine-tuning (PEFT) scenarios. We find that PEFT of LLMs leads to better performance in score prediction with human interpretable explanations than fine-tuned models. However, a manual analysis of LLM outputs reveals that they still have problems such as refusal to reply to a prompt and unstable output while evaluating machine translation of UGC.</abstract>
+      <url hash="a462c08f">2024.wat-1.4</url>
+      <attachment type="SupplementaryMaterial" hash="7c8a38a8">2024.wat-1.4.SupplementaryMaterial.txt</attachment>
+      <bibkey>qian-etal-2024-large-language</bibkey>
+    </paper>
+    <paper id="5">
+      <title><fixed-case>AI</fixed-case>-Tutor: Interactive Learning of Ancient Knowledge from Low-Resource Languages</title>
+      <author><first>Siddhartha</first><last>Dalal</last><affiliation>Columbia University</affiliation></author>
+      <author><first>Rahul</first><last>Aditya</last><affiliation>Columbia University</affiliation></author>
+      <author><first>Vethavikashini</first><last>Chithrra Raghuram</last><affiliation>Columbia University</affiliation></author>
+      <author><first>Prahlad</first><last>Koratamaddi</last><affiliation>Columbia University</affiliation></author>
+      <pages>56-66</pages>
+      <abstract>Many low-resource languages, such as Prakrit, present significant linguistic complexities and have limited modern-day resources. These languages often have multiple derivatives; for example, Prakrit, a language in use by masses around 2500 years ago for 500 years, includes Pali and Gandhari, which encompass a vast body of Buddhist literature, as well as Ardhamagadhi, rich in Jain literature. Despite these challenges, these languages are invaluable for their historical, religious, and cultural insights needed by non-language experts and others.To explore and understand the deep knowledge within these ancient texts for non-language experts, we propose a novel approach: translating multiple dialects of the parent language into a contemporary language and then enabling them to interact with the system in their native language, including English, Hindi, French and German, through a question-and-answer interface built on Large Language Models. We demonstrate the effectiveness of this novel AI-Tutor system by focusing on Ardhamagadhi and Pali.</abstract>
+      <url hash="8582cee2">2024.wat-1.5</url>
+      <attachment type="SupplementaryMaterial" hash="a4df5788">2024.wat-1.5.SupplementaryMaterial.txt</attachment>
+      <attachment type="SupplementaryMaterial" hash="3efa0b58">2024.wat-1.5.SupplementaryMaterial.zip</attachment>
+      <bibkey>dalal-etal-2024-ai</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.wikinlp.xml b/data/xml/2024.wikinlp.xml
new file mode 100644
index 0000000000..948a32694e
--- /dev/null
+++ b/data/xml/2024.wikinlp.xml
@@ -0,0 +1,154 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.wikinlp">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the First Workshop on Advancing Natural Language Processing for Wikipedia</booktitle>
+      <editor><first>Lucie</first><last>Lucie-Aimée</last></editor>
+      <editor><first>Angela</first><last>Fan</last></editor>
+      <editor><first>Tajuddeen</first><last>Gwadabe</last></editor>
+      <editor><first>Isaac</first><last>Johnson</last></editor>
+      <editor><first>Fabio</first><last>Petroni</last></editor>
+      <editor><first>Daniel</first><last>van Strien</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="a0cca7de">2024.wikinlp-1</url>
+      <venue>wikinlp</venue>
+    </meta>
+    <frontmatter>
+      <url hash="818a06f1">2024.wikinlp-1.0</url>
+      <bibkey>wikinlp-2024-1</bibkey>
+    </frontmatter>
+    <paper id="3">
+      <title><fixed-case>B</fixed-case>ord<fixed-case>IR</fixed-case>lines: A Dataset for Evaluating Cross-lingual Retrieval Augmented Generation</title>
+      <author><first>Bryan</first><last>Li</last></author>
+      <author><first>Samar</first><last>Haider</last><affiliation>University of Pennsylvania</affiliation></author>
+      <author><first>Fiona</first><last>Luo</last></author>
+      <author><first>Adwait</first><last>Agashe</last></author>
+      <author><first>Chris</first><last>Callison-Burch</last><affiliation>Allen Institute for Artificial Intelligence and University of Pennsylvania</affiliation></author>
+      <pages>1-13</pages>
+      <abstract>Large language models excel at creative generation but continue to struggle with the issues of hallucination and bias. While retrieval-augmented generation (RAG) provides a framework for grounding LLMs’ responses in accurate and up-to-date information, it still raises the question of bias: which sources should be selected for inclusion in the context? And how should their importance be weighted? In this paper, we study the challenge of cross-lingual RAG and present a dataset to investigate the robustness of existing systems at answering queries about geopolitical disputes, which exist at the intersection of linguistic, cultural, and political boundaries. Our dataset is sourced from Wikipedia pages containing information relevant to the given queries and we investigate the impact of including additional context, as well as the composition of this context in terms of language and source, on an LLM’s response. Our results show that existing RAG systems continue to be challenged by cross-lingual use cases and suffer from a lack of consistency when they are provided with competing information in multiple languages. We present case studies to illustrate these issues and outline steps for future research to address these challenges.</abstract>
+      <url hash="8bfe1404">2024.wikinlp-1.3</url>
+      <bibkey>li-etal-2024-bordirlines</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Multi-Label Field Classification for Scientific Documents using Expert and Crowd-sourced Knowledge</title>
+      <author><first>Rebecca</first><last>Gelles</last><affiliation>Georgetown University</affiliation></author>
+      <author><first>James</first><last>Dunham</last><affiliation>Georgetown University</affiliation></author>
+      <pages>14-20</pages>
+      <abstract>Taxonomies of scientific research seek to describe complex domains of activity that are overlapping and dynamic. We address this challenge by combining knowledge curated by the Wikipedia community with the input of subject-matter experts to identify, define, and validate a system of 1,110 granular fields of study for use in multi-label classification of scientific publications. The result is capable of categorizing research across subfields of artificial intelligence, computer security, semiconductors, genetics, virology, immunology, neuroscience, biotechnology, and bioinformatics. We then develop and evaluate a solution for zero-shot classification of publications in terms of these fields.</abstract>
+      <url hash="10fadbcd">2024.wikinlp-1.7</url>
+      <bibkey>gelles-dunham-2024-multi</bibkey>
+    </paper>
+    <paper id="8">
+      <title>Uncovering Differences in Persuasive Language in <fixed-case>R</fixed-case>ussian versus <fixed-case>E</fixed-case>nglish <fixed-case>W</fixed-case>ikipedia</title>
+      <author><first>Bryan</first><last>Li</last></author>
+      <author><first>Aleksey</first><last>Panasyuk</last><affiliation>Air Force Research Laboratory</affiliation></author>
+      <author><first>Chris</first><last>Callison-Burch</last><affiliation>Allen Institute for Artificial Intelligence and University of Pennsylvania</affiliation></author>
+      <pages>21-35</pages>
+      <abstract>We study how differences in persuasive language across Wikipedia articles, written in either English and Russian, can uncover each culture’s distinct perspective on different subjects. We develop a large language model (LLM) powered system to identify instances of persuasive language in multilingual texts. Instead of directly prompting LLMs to detect persuasion, which is subjective and difficult, we propose to reframe the task to instead ask high-level questions (HLQs) which capture different persuasive aspects. Importantly, these HLQs are authored by LLMs themselves. LLMs over-generate a large set of HLQs, which are subsequently filtered to a small set aligned with human labels for the original task. We then apply our approach to a large-scale, bilingual dataset of Wikipedia articles (88K total), using a two-stage identify-then-extract prompting strategy to find instances of persuasion. We quantify the amount of persuasion per article, and explore the differences in persuasion through several experiments on the paired articles. Notably, we generate rankings of articles by persuasion in both languages. These rankings match our intuitions on the culturally-salient subjects; Russian Wikipedia highlights subjects on Ukraine, while English Wikipedia highlights the Middle East. Grouping subjects into larger topics, we find politically-related events contain more persuasion than others. We further demonstrate that HLQs obtain similar performance when posed in either English or Russian. Our methodology enables cross-lingual, cross-cultural understanding at scale, and we release our code, prompts, and data.</abstract>
+      <url hash="4125e112">2024.wikinlp-1.8</url>
+      <bibkey>li-etal-2024-uncovering</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Retrieval Evaluation for Long-Form and Knowledge-Intensive Image–Text Article Composition</title>
+      <author><first>Jheng-Hong</first><last>Yang</last></author>
+      <author><first>Carlos</first><last>Lassance</last><affiliation>NA</affiliation></author>
+      <author><first>Rafael</first><last>Rezende</last><affiliation>Naver Labs Europe</affiliation></author>
+      <author><first>Krishna</first><last>Srinivasan</last><affiliation>Research, Google</affiliation></author>
+      <author><first>Stéphane</first><last>Clinchant</last><affiliation>Naver Labs Europe</affiliation></author>
+      <author><first>Jimmy</first><last>Lin</last><affiliation>University of Waterloo</affiliation></author>
+      <pages>36-45</pages>
+      <abstract>This paper examines the integration of images into Wikipedia articles by evaluating image–text retrieval tasks in multimedia content creation, focusing on developing retrieval-augmented tools to enhance the creation of high-quality multimedia articles. Despite ongoing research, the interplay between text and visuals, such as photos and diagrams, remains underexplored, limiting support for real-world applications. We introduce AToMiC, a dataset for long-form, knowledge-intensive image–text retrieval, detailing its task design, evaluation protocols, and relevance criteria.Our findings show that a hybrid approach combining a sparse retriever with a dense retriever achieves satisfactory effectiveness, with nDCG@10 scores around 0.4 for Image Suggestion and Image Promotion tasks, providing insights into the challenges of retrieval evaluation in an image–text interleaved article composition context.The AToMiC dataset is available at https://github.com/TREC-AToMiC/AToMiC.</abstract>
+      <url hash="1c677e04">2024.wikinlp-1.9</url>
+      <bibkey>yang-etal-2024-retrieval</bibkey>
+    </paper>
+    <paper id="10">
+      <title><fixed-case>W</fixed-case>iki<fixed-case>B</fixed-case>ias as an Extrapolation Corpus for Bias Detection</title>
+      <author><first>K.</first><last>Salas-Jimenez</last><affiliation>Universidad Nacional Autónoma de México</affiliation></author>
+      <author><first>Francisco</first><last>Lopez-Ponce</last></author>
+      <author><first>Sergio-Luis</first><last>Ojeda-Trueba</last></author>
+      <author><first>Gemma</first><last>Bel-Enguix</last><affiliation>Universidad Nacional Autónoma de México</affiliation></author>
+      <pages>46-52</pages>
+      <abstract>This paper explores whether it is possible to train a machine learning model using Wikipedia data to detect subjectivity in sentences and generalize effectively to other domains. To achieve this, we performed experiments with the WikiBias corpus, the BABE corpus, and the CheckThat! Dataset. Various classical models for ML were tested, including Logistic Regression, SVC, and SVR, including characteristics such as Sentence Transformers similarity, probabilistic sentiment measures, and biased lexicons. Pre-trained models like DistilRoBERTa, as well as large language models like Gemma and GPT-4, were also tested for the same classification task.</abstract>
+      <url hash="a824a4e0">2024.wikinlp-1.10</url>
+      <bibkey>salas-jimenez-etal-2024-wikibias</bibkey>
+    </paper>
+    <paper id="11">
+      <title><fixed-case>HOAXPEDIA</fixed-case>: A Unified <fixed-case>W</fixed-case>ikipedia Hoax Articles Dataset</title>
+      <author><first>Hsuvas</first><last>Borkakoty</last><affiliation>Cardiff University</affiliation></author>
+      <author><first>Luis</first><last>Espinosa-Anke</last><affiliation>Cardiff University and AMPLYFI</affiliation></author>
+      <pages>53-66</pages>
+      <abstract>Hoaxes are a recognised form of disinformation created deliberately, with potential serious implications in the credibility of reference knowledge resources such as Wikipedia. What makes detecting Wikipedia hoaxes hard is that they often are written according to the official style guidelines. In this work, we first provide a systematic analysis of similarities and discrepancies between legitimate and hoax Wikipedia articles, and introduce HOAXPEDIA, a collection of 311 hoax articles (from existing literature and official Wikipedia lists), together with semantically similar legitimate articles, which together form a binary text classification dataset aimed at fostering research in automated hoax detection. In this paper, We report results after analyzing several language models, hoax-to-legit ratios, and the amount of text classifiers are exposed to (full article vs the article’s definition alone). Our results suggest that detecting deceitful content in Wikipedia based on content alone is hard but feasible, and complement our analysis with a study on the differences in distributions in edit histories, and find that looking at this feature yields better classification results than context.</abstract>
+      <url hash="bc54668c">2024.wikinlp-1.11</url>
+      <bibkey>borkakoty-espinosa-anke-2024-hoaxpedia</bibkey>
+    </paper>
+    <paper id="12">
+      <title>The Rise of <fixed-case>AI</fixed-case>-Generated Content in <fixed-case>W</fixed-case>ikipedia</title>
+      <author><first>Creston</first><last>Brooks</last></author>
+      <author><first>Samuel</first><last>Eggert</last></author>
+      <author><first>Denis</first><last>Peskoff</last></author>
+      <pages>67-79</pages>
+      <abstract>The rise of AI-generated content in popular information sources raises significant concerns about accountability, accuracy, and bias amplification. Beyond directly impacting consumers, the widespread presence of this content poses questions for the long-term viability of training language models on vast internet sweeps. We use GPTZero, a proprietary AI detector, and Binoculars, an open-source alternative, to establish lower bounds on the presence of AI-generated content in recently created Wikipedia pages. Both detectors reveal a marked increase in AI-generated content in recent pages compared to those from before the release of GPT-3.5. With thresholds calibrated to achieve a 1% false positive rate on pre-GPT-3.5 articles, detectors flag over 5% of newly created English Wikipedia articles as AI-generated, with lower percentages for German, French, and Italian articles. Flagged Wikipedia articles are typically of lower quality and are often self-promotional or partial towards a specific viewpoint on controversial topics.</abstract>
+      <url hash="9024d660">2024.wikinlp-1.12</url>
+      <bibkey>brooks-etal-2024-rise</bibkey>
+    </paper>
+    <paper id="13">
+      <title>Embedded Topic Models Enhanced by Wikification</title>
+      <author><first>Takashi</first><last>Shibuya</last><affiliation>University of Tsukuba and Sony AI</affiliation></author>
+      <author><first>Takehito</first><last>Utsuro</last><affiliation>University of Tsukuba</affiliation></author>
+      <pages>80-90</pages>
+      <abstract>Topic modeling analyzes a collection of documents to learn meaningful patterns of words.However, previous topic models consider only the spelling of words and do not take into consideration the polysemy of words.In this study, we incorporate the Wikipedia knowledge into a neural topic model to make it aware of named entities.We evaluate our method on two datasets, 1) news articles of New York Times and 2) the AIDA-CoNLL dataset.Our experiments show that our method improves the performance of neural topic models in generalizability.Moreover, we analyze frequent words in each topic and the temporal dependencies between topics to demonstrate that our entity-aware topic models can capture the time-series development of topics well.</abstract>
+      <url hash="24c8dea4">2024.wikinlp-1.13</url>
+      <bibkey>shibuya-utsuro-2024-embedded</bibkey>
+    </paper>
+    <paper id="14">
+      <title>Wikimedia data for <fixed-case>AI</fixed-case>: a review of Wikimedia datasets for <fixed-case>NLP</fixed-case> tasks and <fixed-case>AI</fixed-case>-assisted editing</title>
+      <author><first>Isaac</first><last>Johnson</last><affiliation>Wikimedia</affiliation></author>
+      <author><first>Lucie-Aimée</first><last>Kaffee</last><affiliation>Hugging Face</affiliation></author>
+      <author><first>Miriam</first><last>Redi</last><affiliation>Wikimedia Foundation</affiliation></author>
+      <pages>91-101</pages>
+      <abstract>Wikimedia content is used extensively by the AI community and within the language modeling community in particular. In this paper, we provide a review of the different ways in which Wikimedia data is curated to use in NLP tasks across pre-training, post-training, and model evaluations. We point to opportunities for greater use of Wikimedia content but also identify ways in which the language modeling community could better center the needs of Wikimedia editors. In particular, we call for incorporating additional sources of Wikimedia data, a greater focus on benchmarks for LLMs that encode Wikimedia principles, and greater multilingualism in Wikimedia-derived datasets.</abstract>
+      <url hash="417e2650">2024.wikinlp-1.14</url>
+      <bibkey>johnson-etal-2024-wikimedia</bibkey>
+    </paper>
+    <paper id="16">
+      <title>Blocks Architecture (<fixed-case>B</fixed-case>lo<fixed-case>A</fixed-case>rk): Efficient, Cost-Effective, and Incremental Dataset Architecture for <fixed-case>W</fixed-case>ikipedia Revision History</title>
+      <author><first>Lingxi</first><last>Li</last><affiliation>University of Massachusetts at Amherst</affiliation></author>
+      <author><first>Zonghai</first><last>Yao</last><affiliation>University of Massachusetts at Amherst</affiliation></author>
+      <author><first>Sunjae</first><last>Kwon</last></author>
+      <author><first>Hong</first><last>Yu</last><affiliation>Columbia University</affiliation></author>
+      <pages>102-111</pages>
+      <abstract>Wikipedia (Wiki) is one of the most widely used and publicly available resources for natural language processing (NLP) applications. Wikipedia Revision History (WikiRevHist) shows the order in which edits were made to any Wiki page since its first modification. While the most up-to-date Wiki has been widely used as a training source, WikiRevHist can also be valuable resources for NLP applications. However, there are insufficient tools available to process WikiRevHist without having substantial computing resources, making additional customization, and spending extra time adapting others’ works. Therefore, we report Blocks Architecture (BloArk), an efficiency-focused data processing architecture that reduces running time, computing resource requirements, and repeated works in processing WikiRevHist dataset. BloArk consists of three parts in its infrastructure: blocks, segments, and warehouses. On top of that, we build the core data processing pipeline: builder and modifier. The BloArk builder transforms the original WikiRevHist dataset from XML syntax into JSON Lines (JSONL) format for improving the concurrent and storage efficiency. The BloArk modifier takes previously-built warehouses to operate incremental modifications for improving the utilization of existing databases and reducing the cost of reusing others’ works. In the end, BloArk can scale up easily in both processing Wikipedia Revision History and incrementally modifying existing dataset for downstream NLP use cases. The source code, documentations, and example usages are publicly available online and open-sourced under GPL-2.0 license.</abstract>
+      <url hash="f64310eb">2024.wikinlp-1.16</url>
+      <bibkey>li-etal-2024-blocks</bibkey>
+    </paper>
+    <paper id="17">
+      <title><fixed-case>ARMADA</fixed-case>: Attribute-Based Multimodal Data Augmentation</title>
+      <author><first>Xiaomeng</first><last>Jin</last></author>
+      <author><first>Jeonghwan</first><last>Kim</last></author>
+      <author><first>Yu</first><last>Zhou</last><affiliation>Stanford University</affiliation></author>
+      <author><first>Kuan-Hao</first><last>Huang</last><affiliation>Texas A&amp;M University</affiliation></author>
+      <author><first>Te-Lin</first><last>Wu</last><affiliation>University of California, Los Angeles</affiliation></author>
+      <author><first>Nanyun</first><last>Peng</last><affiliation>University of California, Los Angeles</affiliation></author>
+      <author><first>Heng</first><last>Ji</last><affiliation>University of Illinois, Urbana-Champaign</affiliation></author>
+      <pages>112-125</pages>
+      <abstract>In Multimodal Language Models (MLMs), the cost of manually annotating high-quality image-text pair data for fine-tuning and alignment is extremely high. While existing multimodal data augmentation frameworks propose ways to augment image-text pairs, they either suffer from semantic inconsistency between texts and images, or generate unrealistic images, causing knowledge gap with real world examples. To address these issues, we propose Attribute-based Multimodal Data Augmentation (ARMADA), a novel multimodal data augmentation method via knowledge-guided manipulation of visual attributes of the mentioned entities. Specifically, we extract entities and their visual attributes from the original text data, then search for alternative values for the visual attributes under the guidance of knowledge bases (KBs) and large language models (LLMs). We then utilize an image-editing model to edit the images with the extracted attributes. ARMADA is a novel multimodal data generation framework that: (i) extracts knowledge-grounded attributes from symbolic KBs for semantically consistent yet distinctive image-text pair generation, (ii) generates visually similar images of disparate categories using neighboring entities in the KB hierarchy, and (iii) uses the commonsense knowledge of LLMs to modulate auxiliary visual attributes such as backgrounds for more robust representation of original entities. Our empirical results over four downstream tasks demonstrate the efficacy of our framework to produce high-quality data and enhance the model performance. This also highlights the need to leverage external knowledge proxies for enhanced interpretability and real-world grounding.</abstract>
+      <url hash="761a42e8">2024.wikinlp-1.17</url>
+      <bibkey>jin-etal-2024-armada</bibkey>
+    </paper>
+    <paper id="18">
+      <title>Summarization-Based Document <fixed-case>ID</fixed-case>s for Generative Retrieval with Language Models</title>
+      <author><first>Alan</first><last>Li</last><affiliation>Yale University</affiliation></author>
+      <author><first>Daniel</first><last>Cheng</last></author>
+      <author><first>Phillip</first><last>Keung</last><affiliation>University of Washington</affiliation></author>
+      <author><first>Jungo</first><last>Kasai</last><affiliation>Toyota Technological Institute at Chicago</affiliation></author>
+      <author><first>Noah</first><last>Smith</last><affiliation>University of Washington and Allen Institute for Artificial Intelligence</affiliation></author>
+      <pages>126-135</pages>
+      <abstract>Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a popular approach for end-to-end document retrieval that directly generates document identifiers given an input query. We introduce summarization-based document IDs, in which each document’s ID is composed of an extractive summary or abstractive keyphrases generated by a language model, rather than an integer ID sequence or bags of n-grams as proposed in past work. We find that abstractive, content-based IDs (ACID) and an ID based on the first 30 tokens are very effective in direct comparisons with previous approaches to ID creation. We show that using ACID improves top-10 and top-20 recall by 15.6% and 14.4% (relative) respectively versus the cluster-based integer ID baseline on the MSMARCO 100k retrieval task, and 9.8% and 9.9% respectively on the Wikipedia-based NQ 100k retrieval task. Our results demonstrate the effectiveness of human-readable, natural-language IDs created through summarization for generative retrieval. We also observed that extractive IDs outperformed abstractive IDs on Wikipedia articles in NQ but not the snippets in MSMARCO, which suggests that document characteristics affect generative retrieval performance.</abstract>
+      <url hash="54becc2f">2024.wikinlp-1.18</url>
+      <bibkey>li-etal-2024-summarization</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.winlp.xml b/data/xml/2024.winlp.xml
new file mode 100644
index 0000000000..a13dc4a8c9
--- /dev/null
+++ b/data/xml/2024.winlp.xml
@@ -0,0 +1,19 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.winlp">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Eighth Widening NLP Workshop</booktitle>
+      <editor><first>Alfredo</first><last>Gomez</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, United States</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="fd9eea27">2024.winlp-1</url>
+      <venue>winlp</venue>
+    </meta>
+    <frontmatter>
+      <url hash="c67ce02a">2024.winlp-1.0</url>
+      <bibkey>winlp-2024-1</bibkey>
+    </frontmatter>
+  </volume>
+</collection>
diff --git a/data/xml/2024.wmt.xml b/data/xml/2024.wmt.xml
new file mode 100644
index 0000000000..7bbd040740
--- /dev/null
+++ b/data/xml/2024.wmt.xml
@@ -0,0 +1,1640 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.wmt">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the Ninth Conference on Machine Translation</booktitle>
+      <editor><first>Barry</first><last>Haddow</last></editor>
+      <editor><first>Tom</first><last>Kocmi</last></editor>
+      <editor><first>Philipp</first><last>Koehn</last></editor>
+      <editor><first>Christof</first><last>Monz</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="3ce11c7c">2024.wmt-1</url>
+      <venue>wmt</venue>
+    </meta>
+    <frontmatter>
+      <url hash="b3f3605b">2024.wmt-1.0</url>
+      <bibkey>wmt-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Findings of the <fixed-case>WMT</fixed-case>24 General Machine Translation Shared Task: The <fixed-case>LLM</fixed-case> Era Is Here but <fixed-case>MT</fixed-case> Is Not Solved Yet</title>
+      <author><first>Tom</first><last>Kocmi</last><affiliation>Cohere</affiliation></author>
+      <author><first>Eleftherios</first><last>Avramidis</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Rachel</first><last>Bawden</last><affiliation>Inria</affiliation></author>
+      <author><first>Ondřej</first><last>Bojar</last><affiliation>Charles University, MFF UFAL</affiliation></author>
+      <author><first>Anton</first><last>Dvorkovich</last><affiliation>Yandex</affiliation></author>
+      <author><first>Christian</first><last>Federmann</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Mark</first><last>Fishel</last><affiliation>University of Tartu</affiliation></author>
+      <author><first>Markus</first><last>Freitag</last><affiliation>Google Research</affiliation></author>
+      <author><first>Thamme</first><last>Gowda</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Roman</first><last>Grundkiewicz</last><affiliation>Microsoft Research</affiliation></author>
+      <author><first>Barry</first><last>Haddow</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Marzena</first><last>Karpinska</last><affiliation>University of Massachusetts Amherst</affiliation></author>
+      <author><first>Philipp</first><last>Koehn</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Benjamin</first><last>Marie</last><affiliation>The Kaitchup</affiliation></author>
+      <author><first>Christof</first><last>Monz</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Kenton</first><last>Murray</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Masaaki</first><last>Nagata</last><affiliation>NTT Corporation</affiliation></author>
+      <author><first>Martin</first><last>Popel</last><affiliation>Charles University, Faculty of Mathematics and Physics, UFAL</affiliation></author>
+      <author><first>Maja</first><last>Popović</last><affiliation>ADAPT, Dublin City University</affiliation></author>
+      <author><first>Mariya</first><last>Shmatova</last><affiliation>Dubformer</affiliation></author>
+      <author><first>Steinthór</first><last>Steingrímsson</last><affiliation>The Árni Magnússon Institute for Icelandic Studies</affiliation></author>
+      <author><first>Vilém</first><last>Zouhar</last><affiliation>ETH Zurich, Charles University</affiliation></author>
+      <pages>1-46</pages>
+      <abstract>This overview paper presents the results of the General Machine Translation Task organised as part of the 2024 Conference on Machine Translation (WMT). In the general MT task, participants were asked to build machine translation systems for any of 11 language pairs, to be evaluated on test sets consisting of three to five different domains. In addition to participating systems, we collected translations from 8 different large language models (LLMs) and 4 online translation providers. We evaluate system outputs with professional human annotators using a new protocol called Error Span Annotations (ESA).</abstract>
+      <url hash="9467deaf">2024.wmt-1.1</url>
+      <bibkey>kocmi-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="2">
+      <title>Are <fixed-case>LLM</fixed-case>s Breaking <fixed-case>MT</fixed-case> Metrics? Results of the <fixed-case>WMT</fixed-case>24 Metrics Shared Task</title>
+      <author><first>Markus</first><last>Freitag</last><affiliation>Google Research</affiliation></author>
+      <author><first>Nitika</first><last>Mathur</last><affiliation>The University of Melbourne</affiliation></author>
+      <author><first>Daniel</first><last>Deutsch</last><affiliation>Google</affiliation></author>
+      <author><first>Chi-Kiu</first><last>Lo</last><affiliation>National Research Council of Canada</affiliation></author>
+      <author><first>Eleftherios</first><last>Avramidis</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Ricardo</first><last>Rei</last><affiliation>Unbabel/INESC-ID</affiliation></author>
+      <author><first>Brian</first><last>Thompson</last><affiliation>Amazon</affiliation></author>
+      <author><first>Frederic</first><last>Blain</last><affiliation>Tilburg University</affiliation></author>
+      <author><first>Tom</first><last>Kocmi</last><affiliation>Cohere</affiliation></author>
+      <author><first>Jiayi</first><last>Wang</last><affiliation>University College London</affiliation></author>
+      <author><first>David Ifeoluwa</first><last>Adelani</last><affiliation>McGill University / MILA</affiliation></author>
+      <author><first>Marianna</first><last>Buchicchio</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Chrysoula</first><last>Zerva</last><affiliation>Instituto de Instituto de Telecomunicações, Instituto Superior Técnico, University of Lisbon</affiliation></author>
+      <author><first>Alon</first><last>Lavie</last><affiliation>Unbabel/Carnegie Mellon University</affiliation></author>
+      <pages>47-81</pages>
+      <abstract>The WMT24 Metrics Shared Task evaluated the performance of automatic metrics for machine translation (MT), with a major focus on LLM-based translations that were generated as part of the WMT24 General MT Shared Task. As LLMs become increasingly popular in MT, it is crucial to determine whether existing evaluation metrics can accurately assess the output of these systems.To provide a robust benchmark for this evaluation, human assessments were collected using Multidimensional Quality Metrics (MQM), continuing the practice from recent years. Furthermore, building on the success of the previous year, a challenge set subtask was included, requiring participants to design contrastive test suites that specifically target a metric’s ability to identify and penalize different types of translation errors.Finally, the meta-evaluation procedure was refined to better reflect real-world usage of MT metrics, focusing on pairwise accuracy at both the system- and segment-levels.We present an extensive analysis on how well metrics perform on three language pairs: English to Spanish (Latin America), Japanese to Chinese, and English to German. The results strongly confirm the results reported last year, that fine-tuned neural metrics continue to perform well, even when used to evaluate LLM-based translation systems.</abstract>
+      <url hash="fcac57cd">2024.wmt-1.2</url>
+      <bibkey>freitag-etal-2024-llms</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Findings of the Quality Estimation Shared Task at <fixed-case>WMT</fixed-case> 2024: Are <fixed-case>LLM</fixed-case>s Closing the Gap in <fixed-case>QE</fixed-case>?</title>
+      <author><first>Chrysoula</first><last>Zerva</last><affiliation>Instituto de Instituto de Telecomunicações, Instituto Superior Técnico, University of Lisbon</affiliation></author>
+      <author><first>Frederic</first><last>Blain</last><affiliation>Tilburg University</affiliation></author>
+      <author><first>José G.</first><last>C. De Souza</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Diptesh</first><last>Kanojia</last><affiliation>University of Surrey</affiliation></author>
+      <author><first>Sourabh</first><last>Deoghare</last><affiliation>IIT Bombay</affiliation></author>
+      <author><first>Nuno M.</first><last>Guerreiro</last><affiliation>Instituto de Telecomunicacoes, University of Lisbon</affiliation></author>
+      <author><first>Giuseppe</first><last>Attanasio</last><affiliation>Instituto de Telecomunicacoes</affiliation></author>
+      <author><first>Ricardo</first><last>Rei</last><affiliation>Unbabel/INESC-ID</affiliation></author>
+      <author><first>Constantin</first><last>Orasan</last><affiliation>University of Surrey</affiliation></author>
+      <author><first>Matteo</first><last>Negri</last><affiliation>Fondazione Bruno Kessler</affiliation></author>
+      <author><first>Marco</first><last>Turchi</last><affiliation>Zoom Video Communications</affiliation></author>
+      <author><first>Rajen</first><last>Chatterjee</last><affiliation>Apple Inc.</affiliation></author>
+      <author><first>Pushpak</first><last>Bhattacharyya</last><affiliation>Indian Institute of Technology Bombay and Patna</affiliation></author>
+      <author><first>Markus</first><last>Freitag</last><affiliation>Google Research</affiliation></author>
+      <author><first>André</first><last>Martins</last><affiliation>Unbabel, Instituto de Telecomunicacoes</affiliation></author>
+      <pages>82-109</pages>
+      <abstract>We report the results of the WMT 2024 shared task on Quality Estimation, in which the challenge is to predict the quality of the output of neural machine translation systems at the word and sentence levels, without access to reference translations. In this edition, we expanded our scope to assess the potential for quality estimates to help in the correction of translated outputs, hence including an automated post-editing (APE) direction. We publish new test sets with human annotations that target two directions: providing new Multidimensional Quality Metrics (MQM) annotations for three multi-domain language pairs (English to German, Spanish and Hindi) and extending the annotations on Indic languages providing direct assessments and post edits for translation from English into Hindi, Gujarati, Tamil and Telugu. We also perform a detailed analysis of the behaviour of different models with respect to different phenomena including gender bias, idiomatic language, and numerical and entity perturbations. We received submissions based both on traditional, encoder-based approaches as well as large language model (LLM) based ones.</abstract>
+      <url hash="35fc381e">2024.wmt-1.3</url>
+      <bibkey>zerva-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Findings of the <fixed-case>WMT</fixed-case> 2024 Shared Task of the Open Language Data Initiative</title>
+      <author><first>Jean</first><last>Maillard</last><affiliation>Meta AI</affiliation></author>
+      <author><first>Laurie</first><last>Burchell</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Antonios</first><last>Anastasopoulos</last><affiliation>George Mason University</affiliation></author>
+      <author><first>Christian</first><last>Federmann</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Philipp</first><last>Koehn</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Skyler</first><last>Wang</last><affiliation>FAIR, Meta</affiliation></author>
+      <pages>110-117</pages>
+      <abstract>We present the results of the WMT 2024 shared task of the Open Language Data Initiative. Participants were invited to contribute to the FLORES+ and MT Seed multilingual datasets, two foundational open resources that facilitate the organic expansion of language technology’s reach. We accepted ten submissions covering 16 languages, which extended the range of languages included in the datasets and improved the quality of existing data.</abstract>
+      <url hash="fb5520d0">2024.wmt-1.4</url>
+      <bibkey>maillard-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="5">
+      <title>Results of the <fixed-case>WAT</fixed-case>/<fixed-case>WMT</fixed-case> 2024 Shared Task on Patent Translation</title>
+      <author><first>Shohei</first><last>Higashiyama</last><affiliation>National Institute of Information and Communications Technology</affiliation></author>
+      <pages>118-123</pages>
+      <abstract>This paper presents the results of the patent translation shared task at the 11th Workshop on Asian Translation and 9th Conference on Machine Translation. Two teams participated in this task, and their submitted translation results for one or more of the six language directions were automatically and manually evaluated. The evaluation results demonstrate the strong performance of large language model-based systems from both participants.</abstract>
+      <url hash="9e8af8b3">2024.wmt-1.5</url>
+      <bibkey>higashiyama-2024-results</bibkey>
+    </paper>
+    <paper id="6">
+      <title>Findings of the <fixed-case>WMT</fixed-case> 2024 Biomedical Translation Shared Task: Test Sets on Abstract Level</title>
+      <author><first>Mariana</first><last>Neves</last><affiliation>German Federal Institute for Risk Assessment</affiliation></author>
+      <author><first>Cristian</first><last>Grozea</last><affiliation>Fraunhofer Institute FOKUS</affiliation></author>
+      <author><first>Philippe</first><last>Thomas</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Roland</first><last>Roller</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Rachel</first><last>Bawden</last><affiliation>Inria</affiliation></author>
+      <author><first>Aurélie</first><last>Névéol</last><affiliation>Université Paris-Saclay, CNRS, LISN</affiliation></author>
+      <author><first>Steffen</first><last>Castle</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Vanessa</first><last>Bonato</last><affiliation>Dept. of Linguistic and Literary Studies University of Padua</affiliation></author>
+      <author><first>Giorgio Maria</first><last>Di Nunzio</last><affiliation>Dept. of Linguistic and Literary Studies University of Padua</affiliation></author>
+      <author><first>Federica</first><last>Vezzani</last><affiliation>Dept. of Linguistic and Literary Studies University of Padua</affiliation></author>
+      <author><first>Maika</first><last>Vicente Navarro</last><affiliation>Leica Biosystems</affiliation></author>
+      <author><first>Lana</first><last>Yeganova</last><affiliation>NCBI/NLM/NIH</affiliation></author>
+      <author><first>Antonio</first><last>Jimeno Yepes</last><affiliation>RMIT University</affiliation></author>
+      <pages>124-138</pages>
+      <abstract>We present the results of the ninth edition of the Biomedical Translation Task at WMT’24. We released test sets for six language pairs, namely, French, German, Italian, Portuguese, Russian, and Spanish, from and into English. Eachtest set consists of 50 abstracts from PubMed. Differently from previous years, we did not split abstracts into sentences. We received submissions from five teams, and for almost all language directions. We used a baseline/comparison system based on Llama 3.1 and share the source code at https://github.com/cgrozea/wmt24biomed-ref.</abstract>
+      <url hash="8fea0ae5">2024.wmt-1.6</url>
+      <bibkey>neves-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="7">
+      <title><fixed-case>MSLC</fixed-case>24 Submissions to the General Machine Translation Task</title>
+      <author><first>Samuel</first><last>Larkin</last><affiliation>National Research Council Canada</affiliation></author>
+      <author><first>Chi-Kiu</first><last>Lo</last><affiliation>National Research Council of Canada</affiliation></author>
+      <author><first>Rebecca</first><last>Knowles</last><affiliation>National Research Council Canada</affiliation></author>
+      <pages>139-146</pages>
+      <abstract>The MSLC (Metric Score Landscape Challenge) submissions for English-German, English-Spanish, and Japanese-Chinese are constrained systems built using Transformer models for the purpose of better evaluating metric performance in the WMT24 Metrics Task. They are intended to be representative of the performance of systems that can be built relatively simply using constrained data and with minimal modifications to the translation training pipeline.</abstract>
+      <url hash="f6313103">2024.wmt-1.7</url>
+      <bibkey>larkin-etal-2024-mslc24</bibkey>
+    </paper>
+    <paper id="8">
+      <title><fixed-case>IOL</fixed-case> Research Machine Translation Systems for <fixed-case>WMT</fixed-case>24 General Machine Translation Shared Task</title>
+      <author><first>Wenbo</first><last>Zhang</last><affiliation>Transn</affiliation></author>
+      <pages>147-154</pages>
+      <abstract>This paper illustrates the submission system of the IOL Research team for the WMT24 General Machine Translation shared task. We submitted translations for all translation directions in the general machine translation task. According to the official track categorization, our system qualifies as an open system due to the utilization of open-source resources in developing our machine translation model. With the growing prevalence of large language models (LLMs) as a conventional approach for managing diverse NLP tasks, we have developed our machine translation system by leveraging the capabilities of LLMs. Overall, we first performed continued pretraining using the open-source LLMs with tens of billions of parameters to enhance the model’s multilingual capabilities. Subsequently, we employed open-source Large Language Models, equipped with hundreds of billions of parameters, to generate synthetic data. This data was then blended with a modest quantity of additional open-source data for precise supervised fine-tuning. In the final stage, we also used ensemble learning to improve translation quality. Based on the official automated evaluation metrics, our system excelled by securing the top position in 8 out of the total 11 translation directions, spanning both open and constrained system categories.</abstract>
+      <url hash="e1c0a8bc">2024.wmt-1.8</url>
+      <bibkey>zhang-2024-iol</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Choose the Final Translation from <fixed-case>NMT</fixed-case> and <fixed-case>LLM</fixed-case> Hypotheses Using <fixed-case>MBR</fixed-case> Decoding: <fixed-case>HW</fixed-case>-<fixed-case>TSC</fixed-case>’s Submission to the <fixed-case>WMT</fixed-case>24 General <fixed-case>MT</fixed-case> Shared Task</title>
+      <author><first>Zhanglin</first><last>Wu</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Daimeng</first><last>Wei</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Zongyao</first><last>Li</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Hengchao</first><last>Shang</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Jiaxin</first><last>Guo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Shaojun</first><last>Li</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Zhiqiang</first><last>Rao</last><affiliation>Huawei Translation Service Center, Beijing, China</affiliation></author>
+      <author><first>Yuanchang</first><last>Luo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Ning</first><last>Xie</last><affiliation>HuaweiTechnologiesCo.,Ltd.</affiliation></author>
+      <author><first>Hao</first><last>Yang</last><affiliation>Huawei Co. Ltd</affiliation></author>
+      <pages>155-164</pages>
+      <abstract>This paper presents the submission of Huawei Translate Services Center (HW-TSC) to the WMT24 general machine translation (MT) shared task, where we participate in the English to Chinese (en→zh) language pair. Similar to previous years’ work, we use training strategies such as regularized dropout, bidirectional training, data diversification, forward translation, back translation, alternated training, curriculum learning, and transductive ensemble learning to train the neural machine translation (NMT) model based on the deep Transformer-big architecture. The difference is that we also use continue pre-training, supervised fine-tuning, and contrastive preference optimization to train the large language model (LLM) based MT model. By using Minimum Bayesian risk (MBR) decoding to select the final translation from multiple hypotheses for NMT and LLM-based MT models, our submission receives competitive results in the final evaluation.</abstract>
+      <url hash="a75f7339">2024.wmt-1.9</url>
+      <bibkey>wu-etal-2024-choose</bibkey>
+    </paper>
+    <paper id="10">
+      <title><fixed-case>C</fixed-case>ycle<fixed-case>GN</fixed-case>: A Cycle Consistent Approach for Neural Machine Translation</title>
+      <author><first>Sören</first><last>Dreano</last><affiliation>Dublin City University</affiliation></author>
+      <author><first>Derek</first><last>Molloy</last><affiliation>Dublin City University</affiliation></author>
+      <author><first>Noel</first><last>Murphy</last><affiliation>Dublin City University</affiliation></author>
+      <pages>165-175</pages>
+      <abstract>CycleGN is a fully self-supervised Neural Machine Translation framework relying on the Transformer architecture that does not require parallel data. Its approach is similar to a Discriminator-less CycleGAN, hence the “non-adversarial” name, specifically tailored for non-parallel text datasets. The foundational concept of our research posits that in an ideal scenario, retro-translations of generated translations should revert to the original source sentences. Consequently, a pair of models can be trained using a Cycle Consistency Loss (CCL) only, with one model translating in one direction and the second model in the opposite direction.In the context of this research, two sub-categories of non-parallel datasets are introduced. A “permuted” dataset is defined as a parallel dataset wherein the sentences of one language have been systematically rearranged. Consequently, this results in a non-parallel corpus where it is guaranteed that each sentence has a corresponding translation located at an unspecified index within the dataset. A “non-intersecting” dataset is a non-parallel dataset for which it is guaranteed that no sentence has an exact translation.Masked Language Modeling (MLM) is a pre-training strategy implemented in BERT, where a specified proportion of the input tokens are substituted with a unique $mask$ token. The objective of the neural network under this paradigm is to accurately reconstruct the original sentence from this degraded input.In inference mode, Transformers are able to generate sentences without labels. Thus, the first step is to generate pseudo-labels in inference, that are then used as labels during training. However, the models consistently converge towards a trivial solution in which the input, the generated pseudo-labels and the output are identical, achieving an optimal outcome on the CCL function, registering a value of zero. CycleGN demonstrates how MLM pre-training can be leveraged to move away from this trivial path and perform actual text translation.As a contribution to the WMT24 challenge, this study explores the efficacy of the CycleGN architectural framework in learning translation tasks across eleven language pairs under the permuted condition and four under the non-intersecting condition. Moreover, two additional language pairs from the previous WMT edition were trained and the evaluations demonstrate the robust adaptability of CycleGN in learning translation tasks.</abstract>
+      <url hash="9d474aaa">2024.wmt-1.10</url>
+      <bibkey>dreano-etal-2024-cyclegn</bibkey>
+    </paper>
+    <paper id="11">
+      <title><fixed-case>U</fixed-case>v<fixed-case>A</fixed-case>-<fixed-case>MT</fixed-case>’s Participation in the <fixed-case>WMT</fixed-case>24 General Translation Shared Task</title>
+      <author><first>Shaomu</first><last>Tan</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>David</first><last>Stap</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Seth</first><last>Aycock</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Christof</first><last>Monz</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Di</first><last>Wu</last><affiliation>University of Amsterdam</affiliation></author>
+      <pages>176-184</pages>
+      <abstract>Fine-tuning Large Language Models (FT-LLMs) with parallel data has emerged as a promising paradigm in recent machine translation research. In this paper, we explore the effectiveness of FT-LLMs and compare them to traditional encoder-decoder Neural Machine Translation (NMT) systems under the WMT24 general MT shared task for English to Chinese direction. We implement several techniques, including Quality Estimation (QE) data filtering, supervised fine-tuning, and post-editing that integrate NMT systems with LLMs. We demonstrate that fine-tuning LLaMA2 on a high-quality but relatively small bitext dataset (100K) yields COMET results comparable to much smaller encoder-decoder NMT systems trained on over 22 million bitexts. However, this approach largely underperforms on surface-level metrics like BLEU and ChrF. We further control the data quality using the COMET-based quality estimation method. Our experiments show that 1) filtering low COMET scores largely improves encoder-decoder systems, but 2) no clear gains are observed for LLMs when further refining the fine-tuning set. Finally, we show that combining NMT systems with LLMs via post-editing generally yields the best performance for the WMT24 official test set.</abstract>
+      <url hash="58324efe">2024.wmt-1.11</url>
+      <bibkey>tan-etal-2024-uva</bibkey>
+    </paper>
+    <paper id="12">
+      <title>Tower v2: Unbabel-<fixed-case>IST</fixed-case> 2024 Submission for the General <fixed-case>MT</fixed-case> Shared Task</title>
+      <author><first>Ricardo</first><last>Rei</last><affiliation>Unbabel/INESC-ID</affiliation></author>
+      <author><first>Jose</first><last>Pombal</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Nuno M.</first><last>Guerreiro</last><affiliation>Instituto de Telecomunicacoes, University of Lisbon</affiliation></author>
+      <author><first>João</first><last>Alves</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Pedro Henrique</first><last>Martins</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Patrick</first><last>Fernandes</last><affiliation>Carnegie Mellon University, Instituto de Telecomunicações</affiliation></author>
+      <author><first>Helena</first><last>Wu</last><affiliation>University of Lisbon</affiliation></author>
+      <author><first>Tania</first><last>Vaz</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Duarte</first><last>Alves</last><affiliation>Instituto Superior Técnico / IT</affiliation></author>
+      <author><first>Amin</first><last>Farajian</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Sweta</first><last>Agrawal</last><affiliation>Instituto de Telecomunicações</affiliation></author>
+      <author><first>Antonio</first><last>Farinhas</last><affiliation>Instituto de Telecomunicacoes, IST</affiliation></author>
+      <author><first>José G.</first><last>C. De Souza</last><affiliation>Unbabel</affiliation></author>
+      <author><first>André</first><last>Martins</last><affiliation>Unbabel, Instituto de Telecomunicacoes</affiliation></author>
+      <pages>185-204</pages>
+      <abstract>In this work, we present Tower v2, an improved iteration of the state-of-the-art open-weight Tower models, and the backbone of our submission to the WMT24 General Translation shared task. Tower v2 introduces key improvements including expanded language coverage, enhanced data quality, and increased model capacity up to 70B parameters. Our final submission combines these advancements with quality-aware decoding strategies, selecting translations based on multiple translation quality signals. The resulting system demonstrates significant improvement over previous versions, outperforming closed commercial systems like GPT-4o, Claude 3.5, and DeepL even at a smaller 7B scale.</abstract>
+      <url hash="f2eaf28b">2024.wmt-1.12</url>
+      <bibkey>rei-etal-2024-tower</bibkey>
+    </paper>
+    <paper id="13">
+      <title><fixed-case>TSU</fixed-case> <fixed-case>HITS</fixed-case>’s Submissions to the <fixed-case>WMT</fixed-case> 2024 General Machine Translation Shared Task</title>
+      <author><first>Vladimir</first><last>Mynka</last><affiliation>Higher IT School of Tomsk State University</affiliation></author>
+      <author><first>Nikolay</first><last>Mikhaylovskiy</last><affiliation>NTR Labs / Higher IT School of Tomsk State University</affiliation></author>
+      <pages>205-209</pages>
+      <abstract>This paper describes the TSU HITS team’s submission system for the WMT’24 general translation task. We focused on exploring the capabilities of discrete diffusion models for the English-to-{Russian, German, Czech, Spanish} translation tasks in the constrained track. Our submission system consists of a set of discrete diffusion models for each language pair. The main advance is using a separate length regression model to determine the length of the output sequence more precisely.</abstract>
+      <url hash="e353209e">2024.wmt-1.13</url>
+      <bibkey>mynka-mikhaylovskiy-2024-tsu</bibkey>
+    </paper>
+    <paper id="14">
+      <title>Document-level Translation with <fixed-case>LLM</fixed-case> Reranking: Team-<fixed-case>J</fixed-case> at <fixed-case>WMT</fixed-case> 2024 General Translation Task</title>
+      <author><first>Keito</first><last>Kudo</last><affiliation>Tohoku University / RIKEN Center for AIP</affiliation></author>
+      <author><first>Hiroyuki</first><last>Deguchi</last><affiliation>Nara Institute of Science and Technology</affiliation></author>
+      <author><first>Makoto</first><last>Morishita</last><affiliation>Future Corporation</affiliation></author>
+      <author><first>Ryo</first><last>Fujii</last><affiliation>Future Corporation</affiliation></author>
+      <author><first>Takumi</first><last>Ito</last><affiliation>Langsmith Inc. / Tohoku University</affiliation></author>
+      <author><first>Shintaro</first><last>Ozaki</last><affiliation>Nara Institute of Science and Technology</affiliation></author>
+      <author><first>Koki</first><last>Natsumi</last><affiliation>NAIST</affiliation></author>
+      <author><first>Kai</first><last>Sato</last><affiliation>Tohoku university</affiliation></author>
+      <author><first>Kazuki</first><last>Yano</last><affiliation>Tohoku University</affiliation></author>
+      <author><first>Ryosuke</first><last>Takahashi</last><affiliation>Tohoku University</affiliation></author>
+      <author><first>Subaru</first><last>Kimura</last><affiliation>Tohoku University</affiliation></author>
+      <author><first>Tomomasa</first><last>Hara</last><affiliation>Tohoku University</affiliation></author>
+      <author><first>Yusuke</first><last>Sakai</last><affiliation>Nara Institute of Science and Technology</affiliation></author>
+      <author><first>Jun</first><last>Suzuki</last><affiliation>Tohoku University / RIKEN Center for AIP</affiliation></author>
+      <pages>210-226</pages>
+      <abstract>We participated in the constrained track for English-Japanese and Japanese-Chinese translations at the WMT 2024 General Machine Translation Task. Our approach was to generate a large number of sentence-level translation candidates and select the most probable translation using minimum Bayes risk (MBR) decoding and document-level large language model (LLM) re-ranking. We first generated hundreds of translation candidates from multiple translation models and retained the top 30 candidates using MBR decoding. In addition, we continually pre-trained LLMs on the target language corpora to leverage document-level information. We utilized LLMs to select the most probable sentence sequentially in context from the beginning of the document.</abstract>
+      <url hash="aa7d87b8">2024.wmt-1.14</url>
+      <bibkey>kudo-etal-2024-document</bibkey>
+    </paper>
+    <paper id="15">
+      <title><fixed-case>DLUT</fixed-case> and <fixed-case>GTCOM</fixed-case>’s Neural Machine Translation Systems for <fixed-case>WMT</fixed-case>24</title>
+      <author><first>Hao</first><last>Zong</last><affiliation>Global Tone Communication Technology Co., Ltd</affiliation></author>
+      <author><first>Chao</first><last>Bei</last><affiliation>Global Tone Communication Technology Co.,Ltd.</affiliation></author>
+      <author><first>Huan</first><last>Liu</last><affiliation>Dalian University of Technology</affiliation></author>
+      <author><first>Conghu</first><last>Yuan</last><affiliation>Global Tone Communication Technology Co., Ltd</affiliation></author>
+      <author><first>Wentao</first><last>Chen</last><affiliation>Global Tone Communication Technology Co., Ltd</affiliation></author>
+      <author><first>Degen</first><last>Huang</last><affiliation>Dalian University of Technology</affiliation></author>
+      <pages>227-231</pages>
+      <abstract>This paper presents the submission from Global Tone Communication Co., Ltd. and Dalian University of Technology for the WMT24 shared general Machine Translation (MT) task at the Conference on Empirical Methods in Natural Language Processing (EMNLP). Our participation encompasses two language pairs: English to Japanese and Japanese to Chinese. The systems are developed without particular constraints or requirements, facilitating extensive research in machine translation. We emphasize back-translation, utilize multilingual translation models, and apply fine-tuning strategies to improve performance. Additionally, we integrate both human-generated and machine-generated data to fine-tune our models, leading to enhanced translation accuracy. The automatic evaluation results indicate that our system ranks first in terms of BLEU score for the Japanese to Chinese translation.</abstract>
+      <url hash="05a48a69">2024.wmt-1.15</url>
+      <bibkey>zong-etal-2024-dlut</bibkey>
+    </paper>
+    <paper id="16">
+      <title><fixed-case>CUNI</fixed-case> at <fixed-case>WMT</fixed-case>24 General Translation Task: <fixed-case>LLM</fixed-case>s, (<fixed-case>Q</fixed-case>)<fixed-case>L</fixed-case>o<fixed-case>RA</fixed-case>, <fixed-case>CPO</fixed-case> and Model Merging</title>
+      <author><first>Miroslav</first><last>Hrabal</last><affiliation>Charles University</affiliation></author>
+      <author><first>Josef</first><last>Jon</last><affiliation>Charles University</affiliation></author>
+      <author><first>Martin</first><last>Popel</last><affiliation>Charles University, Faculty of Mathematics and Physics, UFAL</affiliation></author>
+      <author><first>Nam</first><last>Luu</last><affiliation>Charles University</affiliation></author>
+      <author><first>Danil</first><last>Semin</last><affiliation>MFF UK</affiliation></author>
+      <author><first>Ondřej</first><last>Bojar</last><affiliation>Charles University, MFF UFAL</affiliation></author>
+      <pages>232-246</pages>
+      <abstract>This paper presents the contributions of Charles University teams to the WMT24 General Translation task (English to Czech, German and Russian, and Czech to Ukrainian), and the WMT24 Translation into Low-Resource Languages of Spain task.Our most elaborate submission, CUNI-MH for en2cs, is the result of fine-tuning Mistral 7B v0.1 for translation using a three-stage process: Supervised fine-tuning using QLoRA, Contrastive Preference Optimization, and merging of model checkpoints. We also describe the CUNI-GA, CUNI-Transformer and CUNI-DocTransformer submissions, which are based on our systems from the previous year.Our en2ru system CUNI-DS uses a similar first stage as CUNI-MH (QLoRA for en2cs) and follows with transferring to en2ru.For en2de (CUNI-NL), we experimented with a LLM-based speech translation system, to translate without the speech input.For the Translation into Low-Resource Languages of Spain task, we performed QLoRA fine-tuning of a large LLM on a small amount of synthetic (backtranslated) data.</abstract>
+      <url hash="64a7ce85">2024.wmt-1.16</url>
+      <bibkey>hrabal-etal-2024-cuni</bibkey>
+    </paper>
+    <paper id="17">
+      <title>From General <fixed-case>LLM</fixed-case> to Translation: How We Dramatically Improve Translation Quality Using Human Evaluation Data for <fixed-case>LLM</fixed-case> Finetuning</title>
+      <author><first>Denis</first><last>Elshin</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Nikolay</first><last>Karpachev</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Boris</first><last>Gruzdev</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Ilya</first><last>Golovanov</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Georgy</first><last>Ivanov</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Alexander</first><last>Antonov</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Nickolay</first><last>Skachkov</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Ekaterina</first><last>Latypova</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Vladimir</first><last>Layner</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Ekaterina</first><last>Enikeeva</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Dmitry</first><last>Popov</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Anton</first><last>Chekashev</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Vladislav</first><last>Negodin</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Vera</first><last>Frantsuzova</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Alexander</first><last>Chernyshev</last><affiliation>Yandex LLC</affiliation></author>
+      <author><first>Kirill</first><last>Denisov</last><affiliation>Yandex LLC</affiliation></author>
+      <pages>247-252</pages>
+      <abstract>In this paper, we present the methodology employed by the NLP team at Yandex LLC for participating in the WMT 2024 General MT Translation track, focusing on English-to-Russian translation. Our approach involves training a YandexGPT LLM-based model for translation tasks using a multi-stage process to ensure high-quality and contextually accurate translations.Initially, we utilize a pre-trained model, trained on a large corpus of high-quality monolingual texts in various languages, crawled from various open sources, not limited to English and Russian. This extensive pre-training allows the model to capture a broad spectrum of linguistic nuances and structures. Following this, the model is fine-tuned on a substantial parallel corpus of high-quality texts collected from diverse open sources, including websites, books, and subtitles. These texts are meticulously aligned at both the sentence and paragraph levels to enhance the model’s contextual understanding and translation accuracy.In the subsequent stage, we employ p-tuning on an internal high-quality corpus of paragraph-aligned data. This step ensures that the model is finely adjusted to handle complex paragraph-level translations with greater fluency and coherence.Next, we apply the Contrastive Pretraining Objective (CPO) method, as described in the paper CPO, using a human-annotated translation corpus. This stage focuses on refining the model’s performance based on metrics evaluated at the paragraph level, emphasizing both the accuracy of the translation and the fluency of the resulting texts. The CPO method helps the model to better distinguish between subtle contextual differences, thereby improving translation quality.In the final stage, we address the importance of preserving the content structure in translations, which is crucial for the General MT test set. To achieve this, we introduce a synthetic corpus based on web pages and video subtitles, and use it during HE markup finetune training. This encourages the model to maintain the original text’s tag structure. This step ensures that the translated output retains the structural integrity of the source web pages, providing a seamless user experience.Our multi-stage approach, combining extensive pre-training, targeted fine-tuning, advanced p-tuning, and structure-preserving techniques, ensures that our model delivers high-quality, fluent, and structurally consistent translations suitable for practical applications and competitive benchmarks.</abstract>
+      <url hash="f89a7a66">2024.wmt-1.17</url>
+      <bibkey>elshin-etal-2024-general</bibkey>
+    </paper>
+    <paper id="18">
+      <title>Cogs in a Machine, Doing What They’re Meant to Do – the <fixed-case>AMI</fixed-case> Submission to the <fixed-case>WMT</fixed-case>24 General Translation Task</title>
+      <author><first>Atli</first><last>Jasonarson</last><affiliation>The Árni Magnússon Institute</affiliation></author>
+      <author><first>Hinrik</first><last>Hafsteinsson</last><affiliation>University of Iceland</affiliation></author>
+      <author><first>Bjarki</first><last>Ármannsson</last><affiliation>The Árni Magnússon Institute</affiliation></author>
+      <author><first>Steinthór</first><last>Steingrímsson</last><affiliation>The Árni Magnússon Institute for Icelandic Studies</affiliation></author>
+      <pages>253-262</pages>
+      <abstract>This paper presents the submission of the Arni Magnusson Institute’s team to the WMT24 General translation task. We work on the English→Icelandic translation direction. Our system comprises four translation models and a grammar correction model. For training our systems we carefully curate our datasets, aggressively filtering out sentence pairs that may detrimentally affect the quality of our systems output. Some of our data are collected from human translations and some are synthetically generated. A part of the synthetic data is generated using an LLM, and we find that it increases the translation capability of our system significantly.</abstract>
+      <url hash="8da5f4fd">2024.wmt-1.18</url>
+      <bibkey>jasonarson-etal-2024-cogs</bibkey>
+    </paper>
+    <paper id="19">
+      <title><fixed-case>IKUN</fixed-case> for <fixed-case>WMT</fixed-case>24 General <fixed-case>MT</fixed-case> Task: <fixed-case>LLM</fixed-case>s Are Here for Multilingual Machine Translation</title>
+      <author><first>Baohao</first><last>Liao</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Christian</first><last>Herold</last><affiliation>eBay Inc.</affiliation></author>
+      <author><first>Shahram</first><last>Khadivi</last><affiliation>eBay</affiliation></author>
+      <author><first>Christof</first><last>Monz</last><affiliation>University of Amsterdam</affiliation></author>
+      <pages>263-269</pages>
+      <abstract>This paper introduces two multilingual systems, IKUN and IKUN-C, developed for the general machine translation task in WMT24. IKUN and IKUN-C represent an open system and a constrained system, respectively, built on Llama-3-8b and Mistral-7B-v0.3. Both systems are designed to handle all 11 language directions using a single model. According to automatic evaluation metrics, IKUN-C achieved 6 first-place and 3 second-place finishes among all constrained systems, while IKUN secured 1 first-place and 2 second-place finishes across both open and constrained systems. These encouraging results suggest that large language models (LLMs) are nearing the level of proficiency required for effective multilingual machine translation. The systems are based on a two-stage approach: first, continuous pre-training on monolingual data in 10 languages, followed by fine-tuning on high-quality parallel data for 11 language directions. The primary difference between IKUN and IKUN-C lies in their monolingual pre-training strategy. IKUN-C is pre-trained using constrained monolingual data, whereas IKUN leverages monolingual data from the OSCAR dataset. In the second phase, both systems are fine-tuned on parallel data sourced from NTREX, Flores, and WMT16-23 for all 11 language pairs.</abstract>
+      <url hash="6ac739b1">2024.wmt-1.19</url>
+      <bibkey>liao-etal-2024-ikun</bibkey>
+    </paper>
+    <paper id="20">
+      <title><fixed-case>NTTSU</fixed-case> at <fixed-case>WMT</fixed-case>2024 General Translation Task</title>
+      <author><first>Minato</first><last>Kondo</last><affiliation>University of Tsukuba</affiliation></author>
+      <author><first>Ryo</first><last>Fukuda</last><affiliation>NTT Communication Science Laboratories</affiliation></author>
+      <author><first>Xiaotian</first><last>Wang</last><affiliation>University of Tsukuba</affiliation></author>
+      <author><first>Katsuki</first><last>Chousa</last><affiliation>NTT</affiliation></author>
+      <author><first>Masato</first><last>Nishimura</last><affiliation>University of Tsukuba</affiliation></author>
+      <author><first>Kosei</first><last>Buma</last><affiliation>University of Tsukuba</affiliation></author>
+      <author><first>Takatomo</first><last>Kano</last><affiliation>NTT</affiliation></author>
+      <author><first>Takehito</first><last>Utsuro</last><affiliation>University of Tsukuba</affiliation></author>
+      <pages>270-279</pages>
+      <abstract>The NTTSU team’s submission leverages several large language models developed through a training procedure that includes continual pre-training and supervised fine-tuning. For paragraph-level translation, we generated synthetic paragraph-aligned data and utilized this data for training.In the task of translating Japanese to Chinese, we particularly focused on the speech domain translation. Specifically, we built Whisper models for Japanese automatic speech recognition (ASR). We used YODAS dataset for Whisper training. Since this data contained many noisy data pairs, we combined the Whisper outputs using ROVER for polishing the transcriptions. Furthermore, to enhance the robustness of the translation model against errors in the transcriptions, we performed data augmentation by forward translation from audio, using both ASR and base translation models.To select the best translation from multiple hypotheses of the models, we applied Minimum Bayes Risk decoding + reranking, incorporating scores such as COMET-QE, COMET, and cosine similarity by LaBSE.</abstract>
+      <url hash="b7277d4c">2024.wmt-1.20</url>
+      <bibkey>kondo-etal-2024-nttsu</bibkey>
+    </paper>
+    <paper id="21">
+      <title><fixed-case>SCIR</fixed-case>-<fixed-case>MT</fixed-case>’s Submission for <fixed-case>WMT</fixed-case>24 General Machine Translation Task</title>
+      <author><first>Baohang</first><last>Li</last><affiliation>Harbin Institute of Technology</affiliation></author>
+      <author><first>Zekai</first><last>Ye</last><affiliation>Harbin Institute of Technology</affiliation></author>
+      <author><first>Yichong</first><last>Huang</last><affiliation>Harbin Institute of Technology</affiliation></author>
+      <author><first>Xiaocheng</first><last>Feng</last><affiliation>Harbin Institute of Technology,SCIR lab</affiliation></author>
+      <author><first>Bing</first><last>Qin</last><affiliation>Harbin Institute of Technology</affiliation></author>
+      <pages>280-285</pages>
+      <abstract>This paper introduces the submission of SCIR research center of Harbin Institute of Technology participating in the WMT24 machine translation evaluation task of constrained track for English to Czech. Our approach involved a rigorous process of cleaning and deduplicating both monolingual and bilingual data, followed by a three-stage model training recipe. During the testing phase, we used the beam serach decoding method to generate a large number of candidate translations. Furthermore, we employed COMET-MBR decoding to identify optimal translations.</abstract>
+      <url hash="232a6e24">2024.wmt-1.21</url>
+      <bibkey>li-etal-2024-scir</bibkey>
+    </paper>
+    <paper id="22">
+      <title><fixed-case>AIST</fixed-case> <fixed-case>AIRC</fixed-case> Systems for the <fixed-case>WMT</fixed-case> 2024 Shared Tasks</title>
+      <author><first>Matiss</first><last>Rikters</last><affiliation>AIST</affiliation></author>
+      <author><first>Makoto</first><last>Miwa</last><affiliation>Toyota Technological Institute</affiliation></author>
+      <pages>286-291</pages>
+      <abstract>At WMT 2024 AIST AIRC participated in the General Machine Translation shared task and the Biomedical Translation task. We trained constrained track models for translation between English, German, and Japanese. Before training the final models, we first filtered the parallel data, then performed iterative back-translation as well as parallel data distillation. We experimented with training baseline Transformer models, Mega models, and fine-tuning open-source T5 and Gemma model checkpoints using the filtered parallel data. Our primary submissions contain translations from ensembles of two Mega model checkpoints and our contrastive submissions are generated by our fine-tuned T5 model checkpoints.</abstract>
+      <url hash="f50413f7">2024.wmt-1.22</url>
+      <bibkey>rikters-miwa-2024-aist</bibkey>
+    </paper>
+    <paper id="23">
+      <title>Occiglot at <fixed-case>WMT</fixed-case>24: <fixed-case>E</fixed-case>uropean Open-source Large Language Models Evaluated on Translation</title>
+      <author><first>Eleftherios</first><last>Avramidis</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Annika</first><last>Grützner-Zahn</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Manuel</first><last>Brack</last><affiliation>DFKI, TU Darmstadt</affiliation></author>
+      <author><first>Patrick</first><last>Schramowski</last><affiliation>TU Darmstadt</affiliation></author>
+      <author><first>Pedro</first><last>Ortiz Suarez</last><affiliation>Common Crawl Foundation</affiliation></author>
+      <author><first>Malte</first><last>Ostendorff</last><affiliation>German Research Center for Artificial Intelligence</affiliation></author>
+      <author><first>Fabio</first><last>Barth</last><affiliation>DFKI</affiliation></author>
+      <author><first>Shushen</first><last>Manakhimova</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Vivien</first><last>Macketanz</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Georg</first><last>Rehm</last><affiliation>DFKI</affiliation></author>
+      <author><first>Kristian</first><last>Kersting</last><affiliation>TU Darmstadt</affiliation></author>
+      <pages>292-298</pages>
+      <abstract>This document describes the submission of the very first version of the Occiglot open-source large language model to the General MT Shared Task of the 9th Conference of Machine Translation (WMT24). Occiglot is an open-source, community-based LLM based on Mistral-7B, which went through language-specific continual pre-training and subsequent instruction tuning, including instructions relevant to machine translation.We examine the automatic metric scores for translating the WMT24 test set and provide a detailed linguistically-motivated analysis.Despite Occiglot performing worse than many of the other system submissions, we observe that it performs better than Mistral7B, which has been based upon, which indicates the positive effect of the language specific continual-pretraining and instruction tuning. We see the submission of this very early version of the model as a motivation to unite community forces and pursue future LLM research on the translation task.</abstract>
+      <url hash="e5c8daf6">2024.wmt-1.23</url>
+      <bibkey>avramidis-etal-2024-occiglot</bibkey>
+    </paper>
+    <paper id="24">
+      <title><fixed-case>C</fixed-case>o<fixed-case>ST</fixed-case> of breaking the <fixed-case>LLM</fixed-case>s</title>
+      <author><first>Ananya</first><last>Mukherjee</last><affiliation>International Institute of Information Technology Hyderabad</affiliation></author>
+      <author><first>Saumitra</first><last>Yadav</last><affiliation>International Institute of Information Technology, Hyderabad</affiliation></author>
+      <author><first>Manish</first><last>Shrivastava</last><affiliation>International Institute of Information Technology Hyderabad</affiliation></author>
+      <pages>299-306</pages>
+      <abstract>This paper presents an evaluation of 16 machine translation systems submitted to the Shared Task of the 9th Conference of Machine Translation (WMT24) for the English-Hindi (en-hi) language pair using our Complex Structures Test (CoST) suite. Aligning with this year’s test suite sub-task theme, “Help us break LLMs”, we curated a comprehensive test suite encompassing diverse datasets across various categories, including autobiography, poetry, legal, conversation, play, narration, technical, and mixed genres. Our evaluation reveals that {textbf{all the systems struggle significantly with the archaic style of text like legal and technical writings or text with creative twist like conversation and poetry datasets}, highlighting their weaknesses in handling complex linguistic structures and stylistic nuances inherent in these text types. Our evaluation identifies the strengths and limitations of the submitted models, pointing to specific areas where further research and development are needed to enhance their performance. Our test suite is available at {url{https://github.com/AnanyaCoder/CoST-WMT-24-Test-Suite-Task}.</abstract>
+      <url hash="eb1387d6">2024.wmt-1.24</url>
+      <bibkey>mukherjee-etal-2024-cost</bibkey>
+    </paper>
+    <paper id="25">
+      <title><fixed-case>WMT</fixed-case>24 Test Suite: Gender Resolution in Speaker-Listener Dialogue Roles</title>
+      <author><first>Hillary</first><last>Dawkins</last><affiliation>National Research Council Canada</affiliation></author>
+      <author><first>Isar</first><last>Nejadgholi</last><affiliation>National Research Council Canada</affiliation></author>
+      <author><first>Chi-Kiu</first><last>Lo</last><affiliation>National Research Council of Canada</affiliation></author>
+      <pages>307-326</pages>
+      <abstract>We assess the difficulty of gender resolution in literary-style dialogue settings and the influence of gender stereotypes. Instances of the test suite contain spoken dialogue interleaved with external meta-context about the characters and the manner of speaking. We find that character and manner stereotypes outside of the dialogue significantly impact the gender agreement of referents within the dialogue.</abstract>
+      <url hash="015cd23e">2024.wmt-1.25</url>
+      <bibkey>dawkins-etal-2024-wmt24</bibkey>
+    </paper>
+    <paper id="26">
+      <title>The <fixed-case>G</fixed-case>ender<fixed-case>Q</fixed-case>ueer Test Suite</title>
+      <author><first>Steinunn Rut</first><last>Friidhriksdóttir</last><affiliation>University of Iceland</affiliation></author>
+      <pages>327-340</pages>
+      <abstract>This paper introduces the GenderQueer Test Suite, an evaluation set for assessing machine translation (MT) systems’ capabilities in handling gender-diverse and queer-inclusive content, focusing on English to Icelandic translation. The suite evaluates MT systems on various aspects of gender-inclusive translation, including pronoun and adjective agreement, LGBTQIA+ terminology, and the impact of explicit gender specifications.The 17 MT systems submitted to the WMT24 English-Icelandic track were evaluated. Key findings reveal significant performance differences between large language model-based systems (LLMs) and lightweight models in handling context for gender agreement. Challenges in translating the singular “they” were widespread, while most systems performed relatively well in translating LGBTQIA+ terminology. Accuracy in adjective gender agreement is quite low, with some models struggling particularly with the feminine form.This evaluation set contributes to the ongoing discussion about inclusive language in MT and natural language processing. By providing a tool for assessing MT systems’ handling of gender-diverse content, it aims to enhance the inclusivity of language technology. The methodology and evaluation scripts are made available for adaptation to other languages, promoting further research in this area.</abstract>
+      <url hash="93d4b87a">2024.wmt-1.26</url>
+      <bibkey>friidhriksdottir-2024-genderqueer</bibkey>
+    </paper>
+    <paper id="27">
+      <title>Domain Dynamics: Evaluating Large Language Models in <fixed-case>E</fixed-case>nglish-<fixed-case>H</fixed-case>indi Translation</title>
+      <author><first>Soham</first><last>Bhattacharjee</last><affiliation>Indian Institute of Technology Patna</affiliation></author>
+      <author><first>Baban</first><last>Gain</last><affiliation>Indian Institute of Technology, Patna</affiliation></author>
+      <author><first>Asif</first><last>Ekbal</last><affiliation>IIT Patna</affiliation></author>
+      <pages>341-354</pages>
+      <abstract>Large Language Models (LLMs) have demonstrated impressive capabilities in machine translation, leveraging extensive pre-training on vast amounts of data. However, this generalist training often overlooks domain-specific nuances, leading to potential difficulties when translating specialized texts. In this study, we present a multi-domain test suite, collated from previously published datasets, designed to challenge and evaluate the translation abilities of LLMs. The test suite encompasses diverse domains such as judicial, education, literature (specifically religious texts), and noisy user-generated content from online product reviews and forums like Reddit. Each domain consists of approximately 250-300 sentences, carefully curated and randomized in the final compilation. This English-to-Hindi dataset aims to evaluate and expose the limitations of LLM-based translation systems, offering valuable insights into areas requiring further research and development. We have submitted the dataset to WMT24 Break the LLM}subtask. In this paper, we present our findings. We have made the code and the dataset publicly available at https://github.com/sohamb37/wmt24-test-suite</abstract>
+      <url hash="ddd3f00a">2024.wmt-1.27</url>
+      <bibkey>bhattacharjee-etal-2024-domain</bibkey>
+    </paper>
+    <paper id="28">
+      <title>Investigating the Linguistic Performance of Large Language Models in Machine Translation</title>
+      <author><first>Shushen</first><last>Manakhimova</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Vivien</first><last>Macketanz</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Eleftherios</first><last>Avramidis</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Ekaterina</first><last>Lapshinova-Koltunski</last><affiliation>University of Hildesheim</affiliation></author>
+      <author><first>Sergei</first><last>Bagdasarov</last><affiliation>Saarland University</affiliation></author>
+      <author><first>Sebastian</first><last>Möller</last><affiliation>Quality and Usability Lab, TU Berlin</affiliation></author>
+      <pages>355-371</pages>
+      <abstract>This paper summarizes the results of our test suite evaluation on 39 machine translation systems submitted at the Shared Task of the Ninth Conference of Machine Translation (WMT24). It offers a fine-grained linguistic evaluation of machine translation outputs for English–German and English–Russian, resulting from significant manual linguistic effort. Based on our results, LLMs are inferior to NMT in English–German, both in overall scores and when translating specific linguistic phenomena, such as punctuation, complex future verb tenses, and stripping. LLMs show quite a competitive performance in English-Russian, although top-performing systems might struggle with some cases of named entities and terminology, function words, mediopassive voice, and semantic roles. Additionally, some LLMs generate very verbose or empty outputs, posing challenges to the evaluation process.</abstract>
+      <url hash="9d2544ec">2024.wmt-1.28</url>
+      <bibkey>manakhimova-etal-2024-investigating</bibkey>
+    </paper>
+    <paper id="29">
+      <title><fixed-case>I</fixed-case>so<fixed-case>C</fixed-case>hrono<fixed-case>M</fixed-case>eter: A Simple and Effective Isochronic Translation Evaluation Metric</title>
+      <author><first>Nikolai</first><last>Rozanov</last><affiliation>Imperial College London</affiliation></author>
+      <author><first>Vikentiy</first><last>Pankov</last><affiliation>Rask AI</affiliation></author>
+      <author><first>Dmitrii</first><last>Mukhutdinov</last><affiliation>Rask AI</affiliation></author>
+      <author><first>Dima</first><last>Vypirailenko</last><affiliation>Rask Ai</affiliation></author>
+      <pages>372-379</pages>
+      <abstract>Machine translation (MT) has come a long way and is readily employed in production systems to serve millions of users daily. With the recent advances in generative AI, a new form of translation is becoming possible - video dubbing. This work motivates the importance of isochronic translation, especially in the context of automatic dubbing, and introduces ‘IsoChronoMeter’ (ICM). ICM is a simple yet effective metric to measure isochrony of translations in a scalable and resource-efficient way without the need for gold data, based on state-of-the-art text-to-speech (TTS) duration predictors. We motivate IsoChronoMeter and demonstrate its effectiveness. Using ICM we demonstrate the shortcomings of state-of-the-art translation systems and show the need for new methods. We release the code at this URL: {url{https://github.com/braskai/isochronometer}.</abstract>
+      <url hash="48e780db">2024.wmt-1.29</url>
+      <bibkey>rozanov-etal-2024-isochronometer</bibkey>
+    </paper>
+    <paper id="30">
+      <title>A Test Suite of Prompt Injection Attacks for <fixed-case>LLM</fixed-case>-based Machine Translation</title>
+      <author><first>Antonio Valerio</first><last>Miceli Barone</last><affiliation>The University of Edinburgh</affiliation></author>
+      <author><first>Zhifan</first><last>Sun</last><affiliation>Technische Universität Darmstadt</affiliation></author>
+      <pages>380-450</pages>
+      <abstract>LLM-based NLP systems typically work by embedding their input data into prompt templates which contain instructions and/or in-context examples, creating queries which are submitted to a LLM, then parse the LLM response in order to generate the system outputs. Prompt Injection Attacks (PIAs) are a type of subversion of these systems where a malicious user crafts special inputs which interfer with the prompt templates, causing the LLM to respond in ways unintended by the system designer.Recently, Sun and Miceli-Barone (2024) proposed a class of PIAs against LLM-based machine translation. Specifically, the task is to translate questions from the TruthfulQA test suite, where an adversarial prompt is prepended to the questions, instructing the system to ignore the translation instruction and answer the questions instead.In this test suite we extend this approach to all the language pairs of the WMT 2024 General Machine Translation task. Moreover, we include additional attack formats in addition to the one originally studied.</abstract>
+      <url hash="132cb636">2024.wmt-1.30</url>
+      <bibkey>miceli-barone-sun-2024-test</bibkey>
+    </paper>
+    <paper id="31">
+      <title>Killing Two Flies with One Stone: An Attempt to Break <fixed-case>LLM</fixed-case>s Using <fixed-case>E</fixed-case>nglish-<fixed-case>I</fixed-case>celandic Idioms and Proper Names</title>
+      <author><first>Bjarki</first><last>Ármannsson</last><affiliation>The Árni Magnússon Institute for Icelandic Studies</affiliation></author>
+      <author><first>Hinrik</first><last>Hafsteinsson</last><affiliation>University of Iceland</affiliation></author>
+      <author><first>Atli</first><last>Jasonarson</last><affiliation>The Árni Magnússon Institute</affiliation></author>
+      <author><first>Steinthor</first><last>Steingrimsson</last><affiliation>The Arni Magnusson Institute for Icelandic Studies</affiliation></author>
+      <pages>451-458</pages>
+      <abstract>The submission of the Árni Magnússon Institute’s team to the WMT24 test suite subtask focuses on idiomatic expressions and proper names for the English→Icelandic translation direction. Intuitively and empirically, idioms and proper names are known to be a significant challenge for neural translation models. We create two different test suites. The first evaluates the competency of MT systems in translating common English idiomatic expressions, as well as testing whether systems can distinguish between those expressions and the same phrases when used in a literal context. The second test suite consists of place names that should be translated into their Icelandic exonyms (and correctly inflected) and pairs of Icelandic names that share a surface form between the male and female variants, so that incorrect translations impact meaning as well as readibility. The scores reported are relatively low, especially for idiomatic expressions and place names, and indicate considerable room for improvement.</abstract>
+      <url hash="e8d09aa7">2024.wmt-1.31</url>
+      <bibkey>armannsson-etal-2024-killing</bibkey>
+    </paper>
+    <paper id="32">
+      <title><fixed-case>M</fixed-case>eta<fixed-case>M</fixed-case>etrics-<fixed-case>MT</fixed-case>: Tuning Meta-Metrics for Machine Translation via Human Preference Calibration</title>
+      <author><first>David</first><last>Anugraha</last><affiliation>University of Toronto</affiliation></author>
+      <author><first>Garry</first><last>Kuwanto</last><affiliation>Boston University</affiliation></author>
+      <author><first>Lucky</first><last>Susanto</last><affiliation>Universitas Indonesia</affiliation></author>
+      <author><first>Derry Tanti</first><last>Wijaya</last><affiliation>Boston University</affiliation></author>
+      <author><first>Genta</first><last>Winata</last><affiliation>Capital One AI Foundations</affiliation></author>
+      <pages>459-469</pages>
+      <abstract>We present MetaMetrics-MT, an innovative metric designed to evaluate machine translation (MT) tasks by aligning closely with human preferences through Bayesian optimization with Gaussian Processes. MetaMetrics-MT enhances existing MT metrics by optimizing their correlation with human judgments. Our experiments on the WMT24 metric shared task dataset demonstrate that MetaMetrics-MT outperforms all existing baselines, setting a new benchmark for state-of-the-art performance in the reference-based setting. Furthermore, it achieves comparable results to leading metrics in the reference-free setting, offering greater efficiency.</abstract>
+      <url hash="63d12b50">2024.wmt-1.32</url>
+      <bibkey>anugraha-etal-2024-metametrics</bibkey>
+    </paper>
+    <paper id="33">
+      <title>chr<fixed-case>F</fixed-case>-<fixed-case>S</fixed-case>: Semantics Is All You Need</title>
+      <author><first>Ananya</first><last>Mukherjee</last><affiliation>International Institute of Information Technology Hyderabad</affiliation></author>
+      <author><first>Manish</first><last>Shrivastava</last><affiliation>International Institute of Information Technology Hyderabad</affiliation></author>
+      <pages>470-474</pages>
+      <abstract>Machine translation (MT) evaluation metrics like BLEU and chrF++ are widely used reference-based metrics that do not require training and are language-independent. However, these metrics primarily focus on n-gram matching and often overlook semantic depth and contextual understanding. To address this gap, we introduce chrF-S (Semantic chrF++), an enhanced metric that integrates sentence embeddings to evaluate translation quality more comprehensively. By combining traditional character and word n-gram analysis with semantic information derived from embeddings, chrF-S captures both syntactic accuracy and sentence-level semantics. This paper presents our contributions to the WMT24 shared metrics task, showcasing our participation and the development of chrF-S. We also demonstrate that, according to preliminary results on the leaderboard, our metric performs on par with other supervised and LLM-based metrics. By merging semantic insights with n-gram precision, chrF-S offers a significant enhancement in the assessment of machine-generated translations, advancing the field of MT evaluation. Our code and data will be made available at {url{https://github.com/AnanyaCoder/chrF-S}.</abstract>
+      <url hash="214509a6">2024.wmt-1.33</url>
+      <bibkey>mukherjee-shrivastava-2024-chrf</bibkey>
+    </paper>
+    <paper id="34">
+      <title><fixed-case>MSLC</fixed-case>24: Further Challenges for Metrics on a Wide Landscape of Translation Quality</title>
+      <author><first>Rebecca</first><last>Knowles</last><affiliation>National Research Council Canada</affiliation></author>
+      <author><first>Samuel</first><last>Larkin</last><affiliation>National Research Council Canada</affiliation></author>
+      <author><first>Chi-Kiu</first><last>Lo</last><affiliation>National Research Council of Canada</affiliation></author>
+      <pages>475-491</pages>
+      <abstract>In this second edition of the Metric Score Landscape Challenge (MSLC), we examine how automatic metrics for machine translation perform on a wide variety of machine translation output, ranging from very low quality systems to the types of high-quality systems submitted to the General MT shared task at WMT. We also explore metric results on specific types of data, such as empty strings, wrong- or mixed-language text, and more. We raise several alarms about inconsistencies in metric scores, some of which can be resolved by increasingly explicit instructions for metric use, while others highlight technical flaws.</abstract>
+      <url hash="e7962357">2024.wmt-1.34</url>
+      <bibkey>knowles-etal-2024-mslc24</bibkey>
+    </paper>
+    <paper id="35">
+      <title><fixed-case>M</fixed-case>etric<fixed-case>X</fixed-case>-24: The <fixed-case>G</fixed-case>oogle Submission to the <fixed-case>WMT</fixed-case> 2024 Metrics Shared Task</title>
+      <author><first>Juraj</first><last>Juraska</last><affiliation>Google</affiliation></author>
+      <author><first>Daniel</first><last>Deutsch</last><affiliation>Google</affiliation></author>
+      <author><first>Mara</first><last>Finkelstein</last><affiliation>Google</affiliation></author>
+      <author><first>Markus</first><last>Freitag</last><affiliation>Google Research</affiliation></author>
+      <pages>492-504</pages>
+      <abstract>In this paper, we present the MetricX-24 submissions to the WMT24 Metrics Shared Task and provide details on the improvements we made over the previous version of MetricX. Our primary submission is a hybrid reference-based/-free metric, which can score a translation irrespective of whether it is given the source segment, the reference, or both. The metric is trained on previous WMT data in a two-stage fashion, first on the DA ratings only, then on a mixture of MQM and DA ratings. The training set in both stages is augmented with synthetic examples that we created to make the metric more robust to several common failure modes, such as fluent but unrelated translation, or undertranslation. We demonstrate the benefits of the individual modifications via an ablation study, and show a significant performance increase over MetricX-23 on the WMT23 MQM ratings, as well as our new synthetic challenge set.</abstract>
+      <url hash="68844997">2024.wmt-1.35</url>
+      <bibkey>juraska-etal-2024-metricx</bibkey>
+    </paper>
+    <paper id="36">
+      <title>Evaluating <fixed-case>WMT</fixed-case> 2024 Metrics Shared Task Submissions on <fixed-case>A</fixed-case>fri<fixed-case>MTE</fixed-case> (the <fixed-case>A</fixed-case>frican Challenge Set)</title>
+      <author><first>Jiayi</first><last>Wang</last><affiliation>University College London</affiliation></author>
+      <author><first>David Ifeoluwa</first><last>Adelani</last><affiliation>McGill University / MILA</affiliation></author>
+      <author><first>Pontus</first><last>Stenetorp</last><affiliation>University College London</affiliation></author>
+      <pages>505-516</pages>
+      <abstract>The AfriMTE challenge set from WMT 2024 Metrics Shared Task aims to evaluate the capabilities of evaluation metrics for machine translation on low-resource African languages, which primarily assesses cross-lingual transfer learning and generalization of machine translation metrics across a wide range of under-resourced languages. In this paper, we analyze the submissions to WMT 2024 Metrics Shared Task. Our findings indicate that language-specific adaptation, cross-lingual transfer learning, and larger language model sizes contribute significantly to improved metric performance. Moreover, supervised models with relatively moderate sizes demonstrate robust performance, when augmented with specific language adaptation for low-resource African languages. Finally, submissions show promising results for language pairs including Darija-French, English-Egyptian Arabic, and English-Swahili. However, significant challenges persist for extremely low-resource languages such as English-Luo and English-Twi, highlighting areas for future research and improvement in machine translation metrics for African languages.</abstract>
+      <url hash="62262108">2024.wmt-1.36</url>
+      <bibkey>wang-etal-2024-evaluating</bibkey>
+    </paper>
+    <paper id="37">
+      <title>Machine Translation Metrics Are Better in Evaluating Linguistic Errors on <fixed-case>LLM</fixed-case>s than on Encoder-Decoder Systems</title>
+      <author><first>Eleftherios</first><last>Avramidis</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Shushen</first><last>Manakhimova</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Vivien</first><last>Macketanz</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Sebastian</first><last>Möller</last><affiliation>Quality and Usability Lab, TU Berlin</affiliation></author>
+      <pages>517-528</pages>
+      <abstract>This year’s MT metrics challenge set submission by DFKI expands on previous years’ linguistically motivated challenge sets. It includes 137,000 items extracted from 100 MT systems for the two language directions (English to German, English to Russian), covering more than 100 linguistically motivated phenomena organized into 14 linguistic categories. The metrics with the statistically significant best performance in our linguistically motivated analysis are MetricX-24-Hybrid and MetricX-24 for English to German, and MetricX-24 for English to Russian. Metametrics and XCOMET are in the next ranking positions in both language pairs. Metrics are more accurate in detecting linguistic errors in translations by large language models (LLMs) than in translations based on the encoder-decoder neural machine translation (NMT) architecture. Some of the most difficult phenomena for the metrics to score are the transitive past progressive, multiple connectors, and the ditransitive simple future I for English to German, and pseudogapping, contact clauses, and cleft sentences for English to Russian. Despite its overall low performance, the LLM-based metric Gemba performs best in scoring German negation errors.</abstract>
+      <url hash="3d307882">2024.wmt-1.37</url>
+      <bibkey>avramidis-etal-2024-machine</bibkey>
+    </paper>
+    <paper id="38">
+      <title><fixed-case>TMU</fixed-case>-<fixed-case>HIT</fixed-case>’s Submission for the <fixed-case>WMT</fixed-case>24 Quality Estimation Shared Task: Is <fixed-case>GPT</fixed-case>-4 a Good Evaluator for Machine Translation?</title>
+      <author><first>Ayako</first><last>Sato</last><affiliation>tokyo metropolitan university</affiliation></author>
+      <author><first>Kyotaro</first><last>Nakajima</last><affiliation>TMU</affiliation></author>
+      <author><first>Hwichan</first><last>Kim</last><affiliation>Tokyo Metropolitan University</affiliation></author>
+      <author><first>Zhousi</first><last>Chen</last><affiliation>Hitotsubashi University</affiliation></author>
+      <author><first>Mamoru</first><last>Komachi</last><affiliation>Hitotsubashi University</affiliation></author>
+      <pages>529-534</pages>
+      <abstract>In machine translation quality estimation (QE), translation quality is evaluated automatically without the need for reference translations. This paper describes our contribution to the sentence-level subtask of Task 1 at the Ninth Machine Translation Conference (WMT24), which predicts quality scores for neural MT outputs without reference translations. We fine-tune GPT-4o mini, a large-scale language model (LLM), with limited data for QE.We report results for the direct assessment (DA) method for four language pairs: English-Gujarati (En-Gu), English-Hindi (En-Hi), English-Tamil (En-Ta), and English-Telugu (En-Te).Experiments under zero-shot, few-shot prompting, and fine-tuning settings revealed significantly low performance in the zero-shot, while fine-tuning achieved accuracy comparable to last year’s best scores. Our system demonstrated the effectiveness of this approach in low-resource language QE, securing 1st place in both En-Gu and En-Hi, and 4th place in En-Ta and En-Te.</abstract>
+      <url hash="1a452a9a">2024.wmt-1.38</url>
+      <bibkey>sato-etal-2024-tmu</bibkey>
+    </paper>
+    <paper id="39">
+      <title><fixed-case>HW</fixed-case>-<fixed-case>TSC</fixed-case> 2024 Submission for the Quality Estimation Shared Task</title>
+      <author><first>Weiqiao</first><last>Shan</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Ming</first><last>Zhu</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Yuang</first><last>Li</last><affiliation>Huawei</affiliation></author>
+      <author><first>Mengyao</first><last>Piao</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Xiaofeng</first><last>Zhao</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Chang</first><last>Su</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Min</first><last>Zhang</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Hao</first><last>Yang</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Yanfei</first><last>Jiang</last><affiliation>HW-TSC</affiliation></author>
+      <pages>535-540</pages>
+      <abstract>Quality estimation (QE) is a crucial technique for evaluating the quality of machine translations without the need for reference translations. This paper focuses on Huawei Translation Services Center’s (HW-TSC’s) submission to the sentence-level QE shared task, named LLMs-enhanced-CrossQE. Our system builds upon the CrossQE architecture from our submission from last year, which consists of a multilingual base model and a task-specific downstream layer. The model input is a concatenation of the source and the translated sentences. To enhance performance, we fine-tuned and ensembled multiple base models, including XLM-R, InfoXLM, RemBERT, and CometKiwi. Specifically, we employed two pseudo-data generation methods: 1) a diverse pseudo-data generation method based on the corruption-based data augmentation technique introduced last year, and 2) a pseudo-data generation method that simulates machine translation errors using large language models (LLMs). Our results demonstrate that the system achieves outstanding performance on sentence-level QE test sets.</abstract>
+      <url hash="f238cdc0">2024.wmt-1.39</url>
+      <bibkey>shan-etal-2024-hw</bibkey>
+    </paper>
+    <paper id="40">
+      <title><fixed-case>HW</fixed-case>-<fixed-case>TSC</fixed-case>’s Participation in the <fixed-case>WMT</fixed-case> 2024 <fixed-case>QEAPE</fixed-case> Task</title>
+      <author><first>Jiawei</first><last>Yu</last><affiliation>Xiamen university</affiliation></author>
+      <author><first>Xiaofeng</first><last>Zhao</last><affiliation>Huawei Technologies Co Ltd</affiliation></author>
+      <author><first>Min</first><last>Zhang</last><affiliation>Huawei</affiliation></author>
+      <author><first>Zhao</first><last>Yanqing</last><affiliation>Huawei</affiliation></author>
+      <author><first>Yuang</first><last>Li</last><affiliation>Huawei</affiliation></author>
+      <author><first>Su</first><last>Chang</last><affiliation>Huawei TSC</affiliation></author>
+      <author><first>Xiaosong</first><last>Qiao</last><affiliation>Huawei Technologies Co Ltd</affiliation></author>
+      <author><first>Ma</first><last>Miaomiao</last><affiliation>Huawei TSC</affiliation></author>
+      <author><first>Hao</first><last>Yang</last><affiliation>Huawei Co. Ltd</affiliation></author>
+      <pages>541-546</pages>
+      <abstract>The paper presents the submission by HW-TSC in the WMT 2024 Quality-informed Automatic Post Editing (QEAPE) shared task for the English-Hindi (En-Hi) and English-Tamil (En-Ta) language pair. We use LLM for En-Hi and Transformer for EN-ta respectively. For LLM, we first continue pertrain the Llama3, and then use the real APE data to SFT the pre-trained LLM. As for the transformer in En-Ta, we first pre-train a Machine Translation (MT) model by utilizing MT data collected from the web. Then, we fine-tune the model by employing real APE data.We also use the data augmentation method to enhance our model. Specifically, we incorporate candidate translations obtained from an external Machine Translation (MT) system.Given that APE systems tend to exhibit a tendency of ‘over-correction’, we employ a sentence-level Quality Estimation (QE) system to select the final output, deciding between the original translation and the corresponding output generated by the APE model. Our experiments demonstrate that pre-trained MT models are effective when being fine-tuned with the APE corpus of a limited size, and the performance can be further improved with external MT augmentation. our approach improves the HTER by -15.99 points and -0.47 points on En-Hi and En-Ta, respectively.</abstract>
+      <url hash="dd0f0ebb">2024.wmt-1.40</url>
+      <bibkey>yu-etal-2024-hw</bibkey>
+    </paper>
+    <paper id="41">
+      <title>Expanding the <fixed-case>FLORES</fixed-case>+ Multilingual Benchmark with Translations for <fixed-case>A</fixed-case>ragonese, Aranese, <fixed-case>A</fixed-case>sturian, and <fixed-case>V</fixed-case>alencian</title>
+      <author><first>Juan Antonio</first><last>Perez-Ortiz</last><affiliation>Departament de Llenguatges i Sistemes Informatics, Universitat d’Alacant</affiliation></author>
+      <author><first>Felipe</first><last>Sánchez-Martínez</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <author><first>Víctor M.</first><last>Sánchez-Cartagena</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <author><first>Miquel</first><last>Esplà-Gomis</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <author><first>Aaron</first><last>Galiano Jimenez</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <author><first>Antoni</first><last>Oliver</last><affiliation>Universitat Oberta de Catalunya</affiliation></author>
+      <author><first>Claudi</first><last>Aventín-Boya</last><affiliation>Universitat Oberta de Catalunya</affiliation></author>
+      <author><first>Alejandro</first><last>Pardos</last><affiliation>Universidad de Zaragoza</affiliation></author>
+      <author><first>Cristina</first><last>Valdés</last><affiliation>Academia de la Llingua Asturiana / Universidad de Oviedo</affiliation></author>
+      <author><first>Jusèp Loís</first><last>Sans Socasau</last><affiliation>Institut d’Estudis Aranesi – Acadèmia Aranesa dera Lengua Occitana</affiliation></author>
+      <author><first>Juan Pablo</first><last>Martínez</last><affiliation>Academia Aragonesa de la Lengua / Universidad de Zaragoza</affiliation></author>
+      <pages>547-555</pages>
+      <abstract>In this paper, we describe the process of creating the FLORES+ datasets for several Romance languages spoken in Spain, namely Aragonese, Aranese, Asturian, and Valencian. The Aragonese and Aranese datasets are entirely new additions to the FLORES+ multilingual benchmark. An initial version of the Asturian dataset was already available in FLORES+, and our work focused on a thorough revision. Similarly, FLORES+ included a Catalan dataset, which we adapted to the Valencian variety spoken in the Valencian Community. The development of the Aragonese, Aranese, and revised Asturian FLORES+ datasets was undertaken as part of a WMT24 shared task on translation into low-resource languages of Spain.</abstract>
+      <url hash="27528a96">2024.wmt-1.41</url>
+      <bibkey>perez-ortiz-etal-2024-expanding</bibkey>
+    </paper>
+    <paper id="42">
+      <title>The <fixed-case>B</fixed-case>angla/<fixed-case>B</fixed-case>engali Seed Dataset Submission to the <fixed-case>WMT</fixed-case>24 Open Language Data Initiative Shared Task</title>
+      <author><first>Firoz</first><last>Ahmed</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Nitin</first><last>Venkateswaran</last><affiliation>University of Florida</affiliation></author>
+      <author><first>Sarah</first><last>Moeller</last><affiliation>University of Florida</affiliation></author>
+      <pages>556-566</pages>
+      <abstract>We contribute a seed dataset for the Bangla/Bengali language as part of the WMT24 Open Language Data Initiative shared task. We validate the quality of the dataset against a mined and automatically aligned dataset (NLLBv1) and two other existing datasets of crowdsourced manual translations. The validation is performed by investigating the performance of state-of-the-art translation models fine-tuned on the different datasets after controlling for training set size. Machine translation models fine-tuned on our dataset outperform models tuned on the other datasets in both translation directions (English-Bangla and Bangla-English). These results confirm the quality of our dataset. We hope our dataset will support machine translation for the Bangla/Bengali community and related low-resource languages.</abstract>
+      <url hash="ef5bd7c9">2024.wmt-1.42</url>
+      <bibkey>ahmed-etal-2024-bangla</bibkey>
+    </paper>
+    <paper id="43">
+      <title>A High-quality Seed Dataset for <fixed-case>I</fixed-case>talian Machine Translation</title>
+      <author><first>Edoardo</first><last>Ferrante</last><affiliation>Conseggio pe-o patrimonio linguistico ligure</affiliation></author>
+      <pages>567-569</pages>
+      <abstract>This paper describes the submission of a high-quality translation of the OLDI Seed datasetinto Italian for the WMT 2023 Open LanguageData Initiative shared task.The base of this submission is a previous ver-sion of an Italian OLDI Seed dataset releasedby Haberland et al. (2024) via machine trans-lation and partial post-editing. This data wassubsequently reviewed in its entirety by twonative speakers of Italian, who carried out ex-tensive post-editing with particular attention tothe idiomatic translation of named entities.</abstract>
+      <url hash="e1ac1e22">2024.wmt-1.43</url>
+      <bibkey>ferrante-2024-high</bibkey>
+    </paper>
+    <paper id="44">
+      <title>Correcting <fixed-case>FLORES</fixed-case> Evaluation Dataset for Four <fixed-case>A</fixed-case>frican Languages</title>
+      <author><first>Idris</first><last>Abdulmumin</last><affiliation>University of Pretoria</affiliation></author>
+      <author><first>Sthembiso</first><last>Mkhwanazi</last><affiliation>CSIR</affiliation></author>
+      <author><first>Mahlatse</first><last>Mbooi</last><affiliation>Council for Scientific and Industrial Research</affiliation></author>
+      <author><first>Shamsuddeen Hassan</first><last>Muhammad</last><affiliation>Bayero University, Kano</affiliation></author>
+      <author><first>Ibrahim Said</first><last>Ahmad</last><affiliation>Northeastern University</affiliation></author>
+      <author><first>Neo</first><last>Putini</last><affiliation>University of KwaZulu-Natal</affiliation></author>
+      <author><first>Miehleketo</first><last>Mathebula</last><affiliation>University of Pretoria</affiliation></author>
+      <author><first>Matimba</first><last>Shingange</last><affiliation>University of Pretoria</affiliation></author>
+      <author><first>Tajuddeen</first><last>Gwadabe</last><affiliation>Masakhane Research Foundation</affiliation></author>
+      <author><first>Vukosi</first><last>Marivate</last><affiliation>University of Pretoria, Lelapa AI</affiliation></author>
+      <pages>570-578</pages>
+      <abstract>This paper describes the corrections made to the FLORES evaluation (dev and devtest) dataset for four African languages, namely Hausa, Northern Sotho (Sepedi), Xitsonga, and isiZulu. The original dataset, though groundbreaking in its coverage of low-resource languages, exhibited various inconsistencies and inaccuracies in the reviewed languages that could potentially hinder the integrity of the evaluation of downstream tasks in natural language processing (NLP), especially machine translation. Through a meticulous review process by native speakers, several corrections were identified and implemented, improving the dataset’s overall quality and reliability. For each language, we provide a concise summary of the errors encountered and corrected and also present some statistical analysis that measures the difference between the existing and corrected datasets. We believe that our corrections enhance the linguistic accuracy and reliability of the data and, thereby, contribute to a more effective evaluation of NLP tasks involving the four African languages. Finally, we recommend that future translation efforts, particularly in low-resource languages, prioritize the active involvement of native speakers at every stage of the process to ensure linguistic accuracy and cultural relevance.</abstract>
+      <url hash="a4ae5888">2024.wmt-1.44</url>
+      <bibkey>abdulmumin-etal-2024-correcting</bibkey>
+    </paper>
+    <paper id="45">
+      <title>Expanding <fixed-case>FLORES</fixed-case>+ Benchmark for More Low-Resource Settings: <fixed-case>P</fixed-case>ortuguese-Emakhuwa Machine Translation Evaluation</title>
+      <author><first>Felermino Dario Mario</first><last>Ali</last><affiliation>Lurio University</affiliation></author>
+      <author><first>Henrique</first><last>Lopes Cardoso</last><affiliation>University of Porto</affiliation></author>
+      <author><first>Rui</first><last>Sousa-Silva</last><affiliation>University of Porto - Faculty of Arts and Humanities</affiliation></author>
+      <pages>579-592</pages>
+      <abstract>As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the {textit{dev} and {textit{devtest} sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa.The data is publicly available at {url{https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES}</abstract>
+      <url hash="ee8f8df2">2024.wmt-1.45</url>
+      <bibkey>ali-etal-2024-expanding</bibkey>
+    </paper>
+    <paper id="46">
+      <title>Enhancing Tuvan Language Resources through the <fixed-case>FLORES</fixed-case> Dataset</title>
+      <author><first>Ali</first><last>Kuzhuget</last><affiliation>tyvan.ru</affiliation></author>
+      <author><first>Airana</first><last>Mongush</last><affiliation>Algebras AI</affiliation></author>
+      <author><first>Nachyn-Enkhedorzhu</first><last>Oorzhak</last><affiliation>www.tyvan.ru</affiliation></author>
+      <pages>593-599</pages>
+      <abstract>FLORES is a benchmark dataset designed for evaluating machine translation systems, partic- ularly for low-resource languages. This paper, conducted as a part of Open Language Data Ini- tiative (OLDI) shared task, presents our contri- bution to expanding the FLORES dataset with high-quality translations from Russian to Tu- van, an endangered Turkic language. Our ap- proach combined the linguistic expertise of na- tive speakers to ensure both accuracy and cul- tural relevance in the translations. This project represents a significant step forward in support- ing Tuvan as a low-resource language in the realm of natural language processing (NLP) and machine translation (MT).</abstract>
+      <url hash="ee2433cf">2024.wmt-1.46</url>
+      <bibkey>kuzhuget-etal-2024-enhancing</bibkey>
+    </paper>
+    <paper id="47">
+      <title>Machine Translation Evaluation Benchmark for <fixed-case>W</fixed-case>u <fixed-case>C</fixed-case>hinese: Workflow and Analysis</title>
+      <author><first>Hongjian</first><last>Yu</last><affiliation>University of Washington</affiliation></author>
+      <author><first>Yiming</first><last>Shi</last><affiliation>East China Normal University</affiliation></author>
+      <author><first>Zherui</first><last>Zhou</last><affiliation>Shanghai Normal University</affiliation></author>
+      <author><first>Christopher</first><last>Haberland</last><affiliation>University of Washington</affiliation></author>
+      <pages>600-605</pages>
+      <abstract>We introduce a FLORES+ dataset as an evaluation benchmark for modern Wu Chinese machine translation models and showcase its compatibility with existing Wu data. Wu Chinese is mutually unintelligible with other Sinitic languages such as Mandarin and Yue (Cantonese), but uses a set of Hanzi (Chinese characters) that profoundly overlaps with others. The population of Wu speakers is the second largest among languages in China, but the language has been suffering from significant drop in usage especially among the younger generations. We identify Wu Chinese as a textually low-resource language and address challenges for its machine translation models. Our contributions include: (1) an open-source, manually translated dataset, (2) full documentations on the process of dataset creation and validation experiments, (3) preliminary tools for Wu Chinese normalization and segmentation, and (4) benefits and limitations of our dataset, as well as implications to other low-resource languages.</abstract>
+      <url hash="075c046a">2024.wmt-1.47</url>
+      <bibkey>yu-etal-2024-machine</bibkey>
+    </paper>
+    <paper id="48">
+      <title>Open Language Data Initiative: Advancing Low-Resource Machine Translation for <fixed-case>K</fixed-case>arakalpak</title>
+      <author><first>Mukhammadsaid</first><last>Mamasaidov</last><affiliation>Tahrirchi</affiliation></author>
+      <author><first>Abror</first><last>Shopulatov</last><affiliation>Tahrirchi</affiliation></author>
+      <pages>606-613</pages>
+      <abstract>This study presents several contributions for the Karakalpak language: a FLORES+ devtest dataset translated to Karakalpak, parallel corpora for Uzbek-Karakalpak, Russian-Karakalpak and English-Karakalpak of 100,000 pairs each and open-sourced fine-tuned neural models for translation across these languages. Our experiments compare different model variants and training approaches, demonstrating improvements over existing baselines. This work, conducted as part of the Open Language Data Initiative (OLDI) shared task, aims to advance machine translation capabilities for Karakalpak and contribute to expanding linguistic diversity in NLP technologies.</abstract>
+      <url hash="a09673a1">2024.wmt-1.48</url>
+      <bibkey>mamasaidov-shopulatov-2024-open</bibkey>
+    </paper>
+    <paper id="49">
+      <title><fixed-case>FLORES</fixed-case>+ Translation and Machine Translation Evaluation for the <fixed-case>E</fixed-case>rzya Language</title>
+      <author><first>Isai</first><last>Gordeev</last><affiliation>École Polytechnique</affiliation></author>
+      <author><first>Sergey</first><last>Kuldin</last><affiliation>Independent Researcher</affiliation></author>
+      <author><first>David</first><last>Dale</last><affiliation>Meta AI</affiliation></author>
+      <pages>614-623</pages>
+      <abstract>This paper introduces a translation of the FLORES+ dataset into the endangered Erzya language, with the goal of evaluating machine translation between this language and any of the other 200 languages already included into FLORES+. This translation was carried out as a part of the Open Language Data shared task at WMT24. We also present a benchmark of existing translation models bases on this dataset and a new translation model that achieves the state-of-the-art quality of translation into Erzya from Russian and English.</abstract>
+      <url hash="6ca87a5b">2024.wmt-1.49</url>
+      <bibkey>gordeev-etal-2024-flores</bibkey>
+    </paper>
+    <paper id="50">
+      <title><fixed-case>S</fixed-case>panish Corpus and Provenance with Computer-Aided Translation for the <fixed-case>WMT</fixed-case>24 <fixed-case>OLDI</fixed-case> Shared Task</title>
+      <author><first>Jose</first><last>Cols</last><affiliation>University of Washington</affiliation></author>
+      <pages>624-635</pages>
+      <abstract>This paper presents the Seed-CAT submission to the WMT24 Open Language Data Initiative shared task. We detail our data collection method, which involves a computer-aided translation tool developed explicitly for translating Seed corpora. We release a professionally translated Spanish corpus and a provenance dataset documenting the translation process. The quality of the data was validated on the FLORES+ benchmark with English-Spanish neural machine translation models, achieving an average chrF++ score of 34.9.</abstract>
+      <url hash="6e1a3b7e">2024.wmt-1.50</url>
+      <bibkey>cols-2024-spanish</bibkey>
+    </paper>
+    <paper id="51">
+      <title>Efficient Terminology Integration for <fixed-case>LLM</fixed-case>-based Translation in Specialized Domains</title>
+      <author><first>Sejoon</first><last>Kim</last><affiliation>Yonsei University, PwC Korea</affiliation></author>
+      <author><first>Mingi</first><last>Sung</last><affiliation>Yonsei University</affiliation></author>
+      <author><first>Jeonghwan</first><last>Lee</last><affiliation>Yonsei University</affiliation></author>
+      <author><first>Hyunkuk</first><last>Lim</last><affiliation>Yonsei University</affiliation></author>
+      <author><first>Jorge</first><last>Gimenez Perez</last><affiliation>Korea University</affiliation></author>
+      <pages>636-642</pages>
+      <abstract>Traditional machine translation methods typically involve training models directly on large parallel corpora, with limited emphasis on specialized terminology. However, In specialized fields such as patents, finance, biomedical domains, terminology is crucial for translation, with many terminologies that should not be translated based on semantics of the sentence but should be translated following agreed-upon conventions. In this paper we introduce a methodology that efficiently trains models with a smaller amount of data while preserving the accuracy of terminology translation. The terminology extraction model generates a glossary from existing training datasets and further refines the LLM by instructing it to effectively incorporate these terms into translations. We achieve this through a systematic process of term extraction and glossary creation using the Trie Tree algorithm, followed by data reconstruction to teach the LLM how to integrate these specialized terms. This methodology enhances the model’s ability to handle specialized terminology and ensures high-quality translations, particularly in fields where term consistency is crucial. Our approach has demonstrated exceptional performance, achieving the highest translation score among participants in the WMT patent task to date, showcasing its effectiveness and broad applicability in specialized translation domains where general methods often fall short.</abstract>
+      <url hash="9a3dfea2">2024.wmt-1.51</url>
+      <bibkey>kim-etal-2024-efficient</bibkey>
+    </paper>
+    <paper id="52">
+      <title>Rakuten’s Participation in <fixed-case>WMT</fixed-case> 2024 Patent Translation Task</title>
+      <author><first>Ohnmar</first><last>Htun</last><affiliation>Rakuten Institute of Technology-Singapore, Rakuten Asia Pte.Ltd.</affiliation></author>
+      <author><first>Alberto</first><last>Poncelas</last><affiliation>Rakuten Institute of Technology</affiliation></author>
+      <pages>643-646</pages>
+      <abstract>This paper introduces our machine transla- tion system (team sakura), developed for the 2024 WMT Patent Translation Task. Our sys- tem focuses on translations between Japanese- English, Japanese-Korean, and Japanese- Chinese. As large language models have shown good results for various natural language pro- cessing tasks, we have adopted the RakutenAI- 7B-chat model, which has demonstrated effec- tiveness in English and Japanese. We fine-tune this model with patent-domain parallel texts and translate using multiple prompts.</abstract>
+      <url hash="0390d84b">2024.wmt-1.52</url>
+      <bibkey>htun-poncelas-2024-rakutens</bibkey>
+    </paper>
+    <paper id="53">
+      <title>The <fixed-case>SETU</fixed-case>-<fixed-case>ADAPT</fixed-case> Submission for <fixed-case>WMT</fixed-case> 24 Biomedical Shared Task</title>
+      <author><first>Antonio</first><last>Castaldo</last><affiliation>University of Naples “L’Orientale”</affiliation></author>
+      <author><first>Maria</first><last>Zafar</last><affiliation>South East Technological University</affiliation></author>
+      <author><first>Prashanth</first><last>Nayak</last><affiliation>KantanAI</affiliation></author>
+      <author><first>Rejwanul</first><last>Haque</last><affiliation>South East Technological University</affiliation></author>
+      <author><first>Andy</first><last>Way</last><affiliation>Dublin City University</affiliation></author>
+      <author><first>Johanna</first><last>Monti</last><affiliation>University of Naples “L’Orientale”</affiliation></author>
+      <pages>647-653</pages>
+      <abstract>This system description paper presents SETU-ADAPT’s submission to the WMT 2024 Biomedical Shared Task, where we participated for the language pairs English-to-French and English-to-German. Our approach focused on fine-tuning Large Language Models, using in-domain and synthetic data, employing different data augmentation and data retrieval strategies. We introduce a novel MT framework, involving three autonomous agents: a Translator Agent, an Evaluator Agent and a Reviewer Agent. We present our findings and report the quality of the outputs.</abstract>
+      <url hash="d27ec0e3">2024.wmt-1.53</url>
+      <bibkey>castaldo-etal-2024-setu</bibkey>
+    </paper>
+    <paper id="54">
+      <title>Findings of <fixed-case>WMT</fixed-case> 2024 Shared Task on Low-Resource <fixed-case>I</fixed-case>ndic Languages Translation</title>
+      <author><first>Partha</first><last>Pakray</last><affiliation>National Institute of Technology Silchar</affiliation></author>
+      <author><first>Santanu</first><last>Pal</last><affiliation>Wipro</affiliation></author>
+      <author><first>Advaitha</first><last>Vetagiri</last><affiliation>National Institute of Technology Silchar</affiliation></author>
+      <author><first>Reddi</first><last>Krishna</last><affiliation>National Institute of Technology Silchar</affiliation></author>
+      <author><first>Arnab Kumar</first><last>Maji</last><affiliation>North-Eastern Hill University</affiliation></author>
+      <author><first>Sandeep</first><last>Dash</last><affiliation>Assistant Professor</affiliation></author>
+      <author><first>Lenin</first><last>Laitonjam</last><affiliation>IIT Guwahati, NIT Mizoram</affiliation></author>
+      <author><first>Lyngdoh</first><last>Sarah</last><affiliation>North-Eastern Hill University</affiliation></author>
+      <author><first>Riyanka</first><last>Manna</last><affiliation>Amrita Vishwa Vidyapeetham Amaravati</affiliation></author>
+      <pages>654-668</pages>
+      <abstract>This paper presents the results of the low-resource Indic language translation task, organized in conjunction with the Ninth Conference on Machine Translation (WMT) 2024. In this edition, participants were challenged to develop machine translation models for four distinct language pairs: English–Assamese, English-Mizo, English-Khasi, and English-Manipuri. The task utilized the enriched IndicNE-Corp1.0 dataset, which includes an extensive collection of parallel and monolingual corpora for northeastern Indic languages. The evaluation was conducted through a comprehensive suite of automatic metrics—BLEU, TER, RIBES, METEOR, and ChrF—supplemented by meticulous human assessment to measure the translation systems’ performance and accuracy. This initiative aims to drive advancements in low-resource machine translation and make a substantial contribution to the growing body of knowledge in this dynamic field.</abstract>
+      <url hash="0b7634c3">2024.wmt-1.54</url>
+      <bibkey>pakray-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="55">
+      <title>Findings of <fixed-case>WMT</fixed-case> 2024’s <fixed-case>M</fixed-case>ulti<fixed-case>I</fixed-case>ndic22<fixed-case>MT</fixed-case> Shared Task for Machine Translation of 22 <fixed-case>I</fixed-case>ndian Languages</title>
+      <author><first>Raj</first><last>Dabre</last><affiliation>NICT</affiliation></author>
+      <author><first>Anoop</first><last>Kunchukuttan</last><affiliation>Microsoft AI and Research</affiliation></author>
+      <pages>669-676</pages>
+      <abstract>This paper presents the findings of the WMT 2024’s MultiIndic22MT Shared Task, focusing on Machine Translation (MT) of 22 Indian Languages. In this task, we challenged participants with building MT systems which could translate between any or all of 22 Indian languages in the 8th schedule of the Indian constitution and English. For evaluation, we focused on automatic metrics, namely, chrF, chrF++ and BLEU.</abstract>
+      <url hash="4bf75d48">2024.wmt-1.55</url>
+      <bibkey>dabre-kunchukuttan-2024-findings</bibkey>
+    </paper>
+    <paper id="56">
+      <title>Findings of <fixed-case>WMT</fixed-case>2024 <fixed-case>E</fixed-case>nglish-to-Low Resource Multimodal Translation Task</title>
+      <author><first>Shantipriya</first><last>Parida</last><affiliation>Silo AI</affiliation></author>
+      <author><first>Ondřej</first><last>Bojar</last><affiliation>Charles University, MFF UFAL</affiliation></author>
+      <author><first>Idris</first><last>Abdulmumin</last><affiliation>University of Pretoria</affiliation></author>
+      <author><first>Shamsuddeen Hassan</first><last>Muhammad</last><affiliation>Bayero University, Kano</affiliation></author>
+      <author><first>Ibrahim Said</first><last>Ahmad</last><affiliation>Northeastern University</affiliation></author>
+      <pages>677-683</pages>
+      <abstract>This paper presents the results of the English-to-Low Resource Multimodal Translation shared tasks from the Ninth Conference on Machine Translation (WMT2024). This year, 7 teams submitted their translation results for the automatic and human evaluation.</abstract>
+      <url hash="c3a77efb">2024.wmt-1.56</url>
+      <bibkey>parida-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="57">
+      <title>Findings of the <fixed-case>WMT</fixed-case> 2024 Shared Task Translation into Low-Resource Languages of <fixed-case>S</fixed-case>pain: Blending Rule-Based and Neural Systems</title>
+      <author><first>Felipe</first><last>Sánchez-Martínez</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <author><first>Juan Antonio</first><last>Perez-Ortiz</last><affiliation>Departament de Llenguatges i Sistemes Informatics, Universitat d’Alacant</affiliation></author>
+      <author><first>Aaron</first><last>Galiano Jimenez</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <author><first>Antoni</first><last>Oliver</last><affiliation>Universitat Oberta de Catalunya</affiliation></author>
+      <pages>684-698</pages>
+      <abstract>This paper presents the results of the Ninth Conference on Machine Translation (WMT24) Shared Task “Translation into Low-Resource Languages of Spain”’. The task focused on the development of machine translation systems for three language pairs: Spanish-Aragonese, Spanish-Aranese, and Spanish-Asturian. 17 teams participated in the shared task with a total of 87 submissions. The baseline system for all language pairs was Apertium, a rule-based machine translation system that still performs competitively well, even in an era dominated by more advanced non-symbolic approaches. We report and discuss the results of the submitted systems, highlighting the strengths of both neural and rule-based approaches.</abstract>
+      <url hash="6c151d70">2024.wmt-1.57</url>
+      <bibkey>sanchez-martinez-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="58">
+      <title>Findings of the <fixed-case>WMT</fixed-case> 2024 Shared Task on Discourse-Level Literary Translation</title>
+      <author><first>Longyue</first><last>Wang</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author><first>Siyou</first><last>Liu</last><affiliation>University of Macau</affiliation></author>
+      <author><first>Chenyang</first><last>Lyu</last><affiliation>MBZUAI</affiliation></author>
+      <author><first>Wenxiang</first><last>Jiao</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author><first>Xing</first><last>Wang</last><affiliation>Tencent</affiliation></author>
+      <author><first>Jiahao</first><last>Xu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author><first>Zhaopeng</first><last>Tu</last><affiliation>Tencent AI Lab</affiliation></author>
+      <author><first>Yan</first><last>Gu</last><affiliation>China Literature Ltd.</affiliation></author>
+      <author><first>Weiyu</first><last>Chen</last><affiliation>China Literature Ltd.</affiliation></author>
+      <author><first>Minghao</first><last>Wu</last><affiliation>Monash University</affiliation></author>
+      <author><first>Liting</first><last>Zhou</last><affiliation>Dublin City University</affiliation></author>
+      <author><first>Philipp</first><last>Koehn</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Andy</first><last>Way</last><affiliation>ADAPT, Dublin City University</affiliation></author>
+      <author><first>Yulin</first><last>Yuan</last><affiliation>Chinese Language and Literature, Peking University.</affiliation></author>
+      <pages>699-700</pages>
+      <abstract>Translating literary works has perennially stood as an elusive dream in machine translation (MT), a journey steeped in intricate challenges. To foster progress in this domain, we hold a new shared task at WMT 2023, the second edition of the {{em Discourse-Level Literary Translation}. First, we (Tencent AI Lab and China Literature Ltd.) release a copyrighted and document-level Chinese-English web novel corpus. Furthermore, we put forth an industry-endorsed criteria to guide human evaluation process. This year, we totally received 10 submissions from 5 academia and industry teams. We employ both automatic and human evaluations to measure the performance of the submitted systems. The official ranking of the systems is based on the overall human judgments. In addition, our extensive analysis reveals a series of interesting findings on literary and discourse-aware MT. We release data, system outputs, and leaderboard at {url{https://www2.statmt.org/wmt24/literary-translation-task.html}.</abstract>
+      <url hash="b2ab39c8">2024.wmt-1.58</url>
+      <bibkey>wang-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="59">
+      <title>Findings of the <fixed-case>WMT</fixed-case> 2024 Shared Task on Chat Translation</title>
+      <author><first>Wafaa</first><last>Mohammed</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Sweta</first><last>Agrawal</last><affiliation>Instituto de Telecomunicações</affiliation></author>
+      <author><first>Amin</first><last>Farajian</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Vera</first><last>Cabarrão</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Bryan</first><last>Eikema</last><affiliation>University of Amsterdam</affiliation></author>
+      <author><first>Ana C</first><last>Farinha</last><affiliation>Unbabel</affiliation></author>
+      <author><first>José G.</first><last>C. De Souza</last><affiliation>Unbabel</affiliation></author>
+      <pages>701-714</pages>
+      <abstract>This paper presents the findings from the third edition of the Chat Translation Shared Task. As with previous editions, the task involved translating bilingual customer support conversations, specifically focusing on the impact of conversation context in translation quality and evaluation. We also include two new language pairs: English-Korean and English-Dutch, in addition to the set of language pairs from previous editions: English-German, English-French, and English-Brazilian Portuguese.We received 22 primary submissions and 32 contrastive submissions from eight teams, with each language pair having participation from at least three teams. We evaluated the systems comprehensively using both automatic metrics and human judgments via a direct assessment framework.The official rankings for each language pair were determined based on human evaluation scores, considering performance in both translation directions—agent and customer. Our analysis shows that while the systems excelled at translating individual turns, there is room for improvement in overall conversation-level translation quality.</abstract>
+      <url hash="24742440">2024.wmt-1.59</url>
+      <bibkey>mohammed-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="60">
+      <title>Findings of the <fixed-case>WMT</fixed-case> 2024 Shared Task on Non-Repetitive Translation</title>
+      <author><first>Kazutaka</first><last>Kinugawa</last><affiliation>NHK Science &amp; Technology Research Laboratories</affiliation></author>
+      <author><first>Hideya</first><last>Mino</last><affiliation>NHK Science &amp; Technology Research Laboratories</affiliation></author>
+      <author><first>Isao</first><last>Goto</last><affiliation>Ehime University</affiliation></author>
+      <author><first>Naoto</first><last>Shirai</last><affiliation>NHK Science &amp; Technology Research Laboratories</affiliation></author>
+      <pages>715-727</pages>
+      <abstract>The repetition of words in an English sentence can create a monotonous or awkward impression. In such cases, repetition should be avoided appropriately. To evaluate the performance of machine translation (MT) systems in avoiding such repetition and outputting more polished translations, we presented the shared task of controlling the lexical choice of MT systems. From Japanese–English parallel news articles, we collected several hundred sentence pairs in which the source sentences containing repeated words were translated in a style that avoided repetition. Participants were required to encourage the MT system to output tokens in a non-repetitive manner while maintaining translation quality. We conducted human and automatic evaluations of systems submitted by two teams based on an encoder-decoder Transformer and a large language model, respectively. From the experimental results and analysis, we report a series of findings on this task.</abstract>
+      <url hash="e124944d">2024.wmt-1.60</url>
+      <bibkey>kinugawa-etal-2024-findings</bibkey>
+    </paper>
+    <paper id="61">
+      <title>A3-108 Controlling Token Generation in Low Resource Machine Translation Systems</title>
+      <author><first>Saumitra</first><last>Yadav</last><affiliation>International Institute of Information Technology, Hyderabad</affiliation></author>
+      <author><first>Ananya</first><last>Mukherjee</last><affiliation>International Institute of Information Technology Hyderabad</affiliation></author>
+      <author><first>Manish</first><last>Shrivastava</last><affiliation>International Institute of Information Technology Hyderabad</affiliation></author>
+      <pages>728-734</pages>
+      <abstract>Translating for languages with limited resources poses a persistent challenge due to the scarcity of high-quality training data. To enhance translation accuracy, we explored controlled generation mechanisms, focusing on the importance of control tokens. In our experiments, while training, we encoded the target sentence length as a control token to the source sentence, treating it as an additional feature for the source sentence. We developed various NMT models using transformer architecture and conducted experiments across 8 language directions (English = Assamese, Manipuri, Khasi, and Mizo), exploring four variations of length encoding mechanisms. Through comparative analysis against the baseline model, we submitted two systems for each language direction. We report our findings for the same in this work.</abstract>
+      <url hash="876d74ff">2024.wmt-1.61</url>
+      <bibkey>yadav-etal-2024-a3</bibkey>
+    </paper>
+    <paper id="62">
+      <title><fixed-case>S</fixed-case>amsung <fixed-case>R</fixed-case>&amp;<fixed-case>D</fixed-case> Institute <fixed-case>P</fixed-case>hilippines @ <fixed-case>WMT</fixed-case> 2024 <fixed-case>I</fixed-case>ndic <fixed-case>MT</fixed-case> Task</title>
+      <author><first>Matthew Theodore</first><last>Roque</last><affiliation>Samsung Research Philippines (SRPH)</affiliation></author>
+      <author><first>Carlos Rafael</first><last>Catalan</last><affiliation>Samsung Research Philippines (SRPH)</affiliation></author>
+      <author><first>Dan John</first><last>Velasco</last><affiliation>Samsung Research Philippines (SRPH)</affiliation></author>
+      <author><first>Manuel Antonio</first><last>Rufino</last><affiliation>Samsung Research Philippines (SRPH)</affiliation></author>
+      <author><first>Jan Christian Blaise</first><last>Cruz</last><affiliation>Samsung Research Philippines (SRPH)</affiliation></author>
+      <pages>735-741</pages>
+      <abstract>This paper presents the methodology developed by the Samsung R&amp;D Institute Philippines (SRPH) Language Intelligence Team (LIT) for the WMT 2024 Shared Task on Low-Resource Indic Language Translation. We trained standard sequence-to-sequence Transformer models from scratch for both English-to-Indic and Indic-to-English translation directions. Additionally, we explored data augmentation through backtranslation and the application of noisy channel reranking to improve translation quality. A multilingual model trained across all language pairs was also investigated. Our results demonstrate the effectiveness of the multilingual model, with significant performance improvements observed in most language pairs, highlighting the potential of shared language representations in low-resource translation scenarios.</abstract>
+      <url hash="b170e2c6">2024.wmt-1.62</url>
+      <bibkey>roque-etal-2024-samsung</bibkey>
+    </paper>
+    <paper id="63">
+      <title><fixed-case>DLUT</fixed-case>-<fixed-case>NLP</fixed-case> Machine Translation Systems for <fixed-case>WMT</fixed-case>24 Low-Resource <fixed-case>I</fixed-case>ndic Language Translation</title>
+      <author><first>Chenfei</first><last>Ju</last><affiliation>Dalian University of Technology</affiliation></author>
+      <author><first>Junpeng</first><last>Liu</last><affiliation>Dalian University of Technology</affiliation></author>
+      <author><first>Kaiyu</first><last>Huang</last><affiliation>Beijing Jiaotong University</affiliation></author>
+      <author><first>Degen</first><last>Huang</last><affiliation>Dalian University of Technology</affiliation></author>
+      <pages>742-746</pages>
+      <abstract>This paper describes the submission systems of DLUT-NLP team for the WMT24 low-resource Indic language translation shared task. We participated in the translation task of four language pairs, including en-as, en-mz, en-kha, en-mni.</abstract>
+      <url hash="492dfea7">2024.wmt-1.63</url>
+      <bibkey>ju-etal-2024-dlut</bibkey>
+    </paper>
+    <paper id="64">
+      <title><fixed-case>SRIB</fixed-case>-<fixed-case>NMT</fixed-case>’s Submission to the <fixed-case>I</fixed-case>ndic <fixed-case>MT</fixed-case> Shared Task in <fixed-case>WMT</fixed-case> 2024</title>
+      <author><first>Pranamya</first><last>Patil</last><affiliation>Samsung Research</affiliation></author>
+      <author><first>Raghavendra</first><last>Hr</last><affiliation>Samsung Research</affiliation></author>
+      <author><first>Aditya</first><last>Raghuwanshi</last><affiliation>Samsung Research</affiliation></author>
+      <author><first>Kushal</first><last>Verma</last><affiliation>Samsung Research</affiliation></author>
+      <pages>747-750</pages>
+      <abstract>In the context of the Indic Low Resource Ma-chine Translation (MT) challenge at WMT-24, we participated in four language pairs:English-Assamese (en-as), English-Mizo (en-mz), English-Khasi (en-kh), and English-Manipuri (en-mn). To address these tasks,we employed a transformer-based sequence-to-sequence architecture (Vaswani et al., 2017).In the PRIMARY system, which did not uti-lize external data, we first pretrained languagemodels (low resource languages) using avail-able monolingual data before finetuning themon small parallel datasets for translation. Forthe CONTRASTIVE submission approach, weutilized pretrained translation models like In-dic Trans2 (Gala et al., 2023) and appliedLoRA Fine-tuning (Hu et al., 2021) to adaptthem to smaller, low-resource languages, aim-ing to leverage cross-lingual language transfercapabilities (CONNEAU and Lample, 2019).These approaches resulted in significant im-provements in SacreBLEU scores(Post, 2018)for low-resource languages.</abstract>
+      <url hash="44a70947">2024.wmt-1.64</url>
+      <bibkey>patil-etal-2024-srib</bibkey>
+    </paper>
+    <paper id="65">
+      <title><fixed-case>MTNLP</fixed-case>-<fixed-case>IIITH</fixed-case>: Machine Translation for Low-Resource <fixed-case>I</fixed-case>ndic Languages</title>
+      <author><first>Abhinav</first><last>P M</last><affiliation>International Institute of Information Technology</affiliation></author>
+      <author><first>Ketaki</first><last>Shetye</last><affiliation>International Institute of Information Technology</affiliation></author>
+      <author><first>Parameswari</first><last>Krishnamurthy</last><affiliation>International Institute of Information Technology</affiliation></author>
+      <pages>751-755</pages>
+      <abstract>Machine Translation for low-resource languages presents significant challenges, primarily due to limited data availability. We have a baseline model and a primary model. For the baseline model, we first fine-tune the mBART model (mbart-large-50-many-to-many-mmt) for the language pairs English-Khasi, Khasi-English, English-Manipuri, and Manipuri-English. We then augment the dataset by back-translating from Indic languages to English. To enhance data quality, we fine-tune the LaBSE model specifically for Khasi and Manipuri, generating sentence embeddings and applying a cosine similarity threshold of 0.84 to filter out low-quality back-translations. The filtered data is combined with the original training data and used to further fine-tune the mBART model, creating our primary model. The results show that the primary model slightly outperforms the baseline model, with the best performance achieved by the English-to-Khasi (en-kh) primary model, which recorded a BLEU score of 0.0492, a chrF score of 0.3316, and a METEOR score of 0.2589 (on a scale of 0 to 1), with similar results for other language pairs.</abstract>
+      <url hash="9b449da8">2024.wmt-1.65</url>
+      <bibkey>p-m-etal-2024-mtnlp</bibkey>
+    </paper>
+    <paper id="66">
+      <title>Exploration of the <fixed-case>C</fixed-case>ycle<fixed-case>GN</fixed-case> Framework for Low-Resource Languages</title>
+      <author><first>Sören</first><last>Dreano</last><affiliation>Dublin City University</affiliation></author>
+      <author><first>Derek</first><last>Molloy</last><affiliation>Dublin City University</affiliation></author>
+      <author><first>Noel</first><last>Murphy</last><affiliation>Dublin City University</affiliation></author>
+      <pages>756-761</pages>
+      <abstract>CycleGN is a Neural Machine Translation framework relying on the Transformer architecture. The foundational concept of our research posits that in an ideal scenario, retro-translations of generated translations should revert to the original source sentences. Consequently, a pair of models can be trained using a Cycle Consistency Loss only, with one model translating in one direction and the second model in the opposite direction.</abstract>
+      <url hash="aa7a57e8">2024.wmt-1.66</url>
+      <bibkey>dreano-etal-2024-exploration</bibkey>
+    </paper>
+    <paper id="67">
+      <title>The <fixed-case>SETU</fixed-case>-<fixed-case>ADAPT</fixed-case> Submissions to the <fixed-case>WMT</fixed-case>24 Low-Resource <fixed-case>I</fixed-case>ndic Language Translation Task</title>
+      <author><first>Neha</first><last>Gajakos</last><affiliation>ADAPT Centre</affiliation></author>
+      <author><first>Prashanth</first><last>Nayak</last><affiliation>Dublin City University</affiliation></author>
+      <author><first>Rejwanul</first><last>Haque</last><affiliation>South East Technological University</affiliation></author>
+      <author><first>Andy</first><last>Way</last><affiliation>ADAPT Centre</affiliation></author>
+      <pages>762-769</pages>
+      <abstract>This paper presents the SETU-ADAPT’s submissions to the WMT 2024 Low-Resource Indic Language Translation task. We participated in the unconstrained segment of the task, focusing on the Assamese-to-English and English-to-Assamese language pairs. Our approach involves leveraging Large Language Models (LLMs) as the baseline systems for all our MT tasks. Furthermore, we applied various strategies to improve the baseline systems. In our first approach, we fine-tuned LLMs using all the data provided by the task organisers. Our second approach explores in-context learning by focusing on few-shot prompting. In our final approach we explore an efficient data extraction technique based on a fuzzy match-based similarity measure for fine-tuning. We evaluated our systems using BLEU, chrF, WER, and COMET. The experimental results showed that our strategies can effectively improve the quality of translations in low-resource scenarios.</abstract>
+      <url hash="6f4f0249">2024.wmt-1.67</url>
+      <bibkey>gajakos-etal-2024-setu</bibkey>
+    </paper>
+    <paper id="68">
+      <title><fixed-case>SPRING</fixed-case> Lab <fixed-case>IITM</fixed-case>’s Submission to Low Resource <fixed-case>I</fixed-case>ndic Language Translation Shared Task</title>
+      <author><first>Advait</first><last>Joglekar</last><affiliation>Indian Institute of Technology Madras</affiliation></author>
+      <author><first>Hamees Ul Hasan</first><last>Sayed</last><affiliation>Indian Institute of Technology Madras</affiliation></author>
+      <author><first>Srinivasan</first><last>Umesh</last><affiliation>Indian Institute of Technology Madras</affiliation></author>
+      <pages>770-774</pages>
+      <abstract>We develop a robust translation model for four low-resource Indic languages: Khasi, Mizo, Manipuri, and Assamese. Our approach includes a comprehensive pipeline from data collection and preprocessing to training and evaluation, leveraging data from WMT task datasets, BPCC, PMIndia, and OpenLanguageData. To address the scarcity of bilingual data, we use back-translation techniques on monolingual datasets for Mizo and Khasi, significantly expanding our training corpus. We fine-tune the pre-trained NLLB 3.3B model for Assamese, Mizo, and Manipuri, achieving improved performance over the baseline. For Khasi, which is not supported by the NLLB model, we introduce special tokens and train the model on our Khasi corpus. Our training involves masked language modelling, followed by fine-tuning for English-to-Indic and Indic-to-English translations.</abstract>
+      <url hash="3026b775">2024.wmt-1.68</url>
+      <bibkey>joglekar-etal-2024-spring</bibkey>
+    </paper>
+    <paper id="69">
+      <title>Machine Translation Advancements of Low-Resource <fixed-case>I</fixed-case>ndian Languages by Transfer Learning</title>
+      <author><first>Bin</first><last>Wei</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zheng</first><last>Jiawei</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zongyao</first><last>Li</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zhanglin</first><last>Wu</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Jiaxin</first><last>Guo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Daimeng</first><last>Wei</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Zhiqiang</first><last>Rao</last><affiliation>Huawei Translation Service Center, Beijing, China</affiliation></author>
+      <author><first>Shaojun</first><last>Li</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Yuanchang</first><last>Luo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Hengchao</first><last>Shang</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Jinlong</first><last>Yang</last><affiliation>Huawei Technologies Co., Ltd</affiliation></author>
+      <author><first>Yuhao</first><last>Xie</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Hao</first><last>Yang</last><affiliation>Huawei Co. Ltd</affiliation></author>
+      <pages>775-780</pages>
+      <abstract>This paper introduces the submission by Huawei Translation Center (HW-TSC) to the WMT24 Indian Languages Machine Translation (MT) Shared Task. To develop a reliable machine translation system for low-resource Indian languages, we employed two distinct knowledge transfer strategies, taking into account the characteristics of the language scripts and the support available from existing open-source models for Indian languages. For Assamese(as) and Manipuri(mn), we fine-tuned the existing IndicTrans2 open-source model to enable bidirectional translation between English and these languages. For Khasi(kh) and Mizo(mz), we trained a multilingual model as the baseline using bilingual data from this four language pairs as well as additional Bengali data, which share the same language family. This was followed by fine-tuning to achieve bidirectional translation between English and Khasi, as well as English and Mizo. Our transfer learning experiments produced significant results: 23.5 BLEU for en→as, 31.8 BLEU for en→mn, 36.2 BLEU for as→en, and 47.9 BLEU for mn→en on their respective test sets. Similarly, the multilingual model transfer learning experiments yielded impressive outcomes, achieving 19.7 BLEU for en→kh, 32.8 BLEU for en→mz, 16.1 BLEU for kh→en, and 33.9 BLEU for mz→en on their respective test sets. These results not only highlight the effectiveness of transfer learning techniques for low-resource languages but also contribute to advancing machine translation capabilities for low-resource Indian languages.</abstract>
+      <url hash="e8c1fc38">2024.wmt-1.69</url>
+      <bibkey>wei-etal-2024-machine</bibkey>
+    </paper>
+    <paper id="70">
+      <title><fixed-case>NLIP</fixed-case>_<fixed-case>L</fixed-case>ab-<fixed-case>IITH</fixed-case> Low-Resource <fixed-case>MT</fixed-case> System for <fixed-case>WMT</fixed-case>24 <fixed-case>I</fixed-case>ndic <fixed-case>MT</fixed-case> Shared Task</title>
+      <author><first>Pramit</first><last>Sahoo</last><affiliation>Indian Institute of Technology Hyderabad</affiliation></author>
+      <author><first>Maharaj</first><last>Brahma</last><affiliation>Indian Institute of Technology Hyderabad</affiliation></author>
+      <author><first>Maunendra Sankar</first><last>Desarkar</last><affiliation>IIT Hyderabad</affiliation></author>
+      <pages>781-787</pages>
+      <abstract>In this paper, we describe our system for the WMT 24 shared task of Low-Resource Indic Language Translation. We consider eng↔{as, kha, lus, mni} as participating language pairs. In this shared task, we explore the fine-tuning of a pre-trained model motivated by the pre-trained objective of aligning embeddings closer by alignment augmentation (Lin et al.,2020) for 22 scheduled Indian languages. Our primary system is based on language-specific finetuning on a pre-trained model. We achieve chrF2 scores of 50.6, 42.3, 54.9, and 66.3 on the official public test set for eng→as, eng→kha, eng→lus, eng→mni respectively. We also explore multilingual training with/without language grouping and layer-freezing.</abstract>
+      <url hash="65f717b1">2024.wmt-1.70</url>
+      <bibkey>sahoo-etal-2024-nlip</bibkey>
+    </paper>
+    <paper id="71">
+      <title>Yes-<fixed-case>MT</fixed-case>’s Submission to the Low-Resource <fixed-case>I</fixed-case>ndic Language Translation Shared Task in <fixed-case>WMT</fixed-case> 2024</title>
+      <author><first>Yash</first><last>Bhaskar</last><affiliation>IIIT Hyderabad</affiliation></author>
+      <author><first>Parameswari</first><last>Krishnamurthy</last><affiliation>Assistant Professor, IIIT Hyderabad</affiliation></author>
+      <pages>788-792</pages>
+      <abstract>This paper presents the systems submitted by the Yes-MT team for the Low-Resource Indic Language Translation Shared Task at WMT 2024, focusing on translating between English and the Assamese, Mizo, Khasi, and Manipuri languages. The experiments explored various approaches, including fine-tuning pre-trained models like mT5 and IndicBart in both Multilingual and Monolingual settings, LoRA finetune IndicTrans2, zero-shot and few-shot prompting with large language models (LLMs) like Llama 3 and Mixtral 8x7b, LoRA Supervised Fine Tuning Llama 3, and training Transformers from scratch. The results were evaluated on the WMT23 Low-Resource Indic Language Translation Shared Task’s test data using SacreBLEU and CHRF highlighting the challenges of low-resource translation and show the potential of LLMs for these tasks, particularly with fine-tuning.</abstract>
+      <url hash="3d405a41">2024.wmt-1.71</url>
+      <bibkey>bhaskar-krishnamurthy-2024-yes</bibkey>
+    </paper>
+    <paper id="72">
+      <title>System Description of <fixed-case>BV</fixed-case>-<fixed-case>SLP</fixed-case> for <fixed-case>S</fixed-case>indhi-<fixed-case>E</fixed-case>nglish Machine Translation in <fixed-case>M</fixed-case>ulti<fixed-case>I</fixed-case>ndic22<fixed-case>MT</fixed-case> 2024 Shared Task</title>
+      <author><first>Nisheeth</first><last>Joshi</last><affiliation>Banasthali Vidyapith</affiliation></author>
+      <author><first>Pragya</first><last>Katyayan</last><affiliation>University of Petroleum and Energy Studies</affiliation></author>
+      <author><first>Palak</first><last>Arora</last><affiliation>Banasthali Vidyapith</affiliation></author>
+      <author><first>Bharti</first><last>Nathani</last><affiliation>Banasthali Vidyapith</affiliation></author>
+      <pages>793-796</pages>
+      <abstract>This paper presents our machine translation system that was developed for the WAT2024 MultiInidc MT shared task. We built our system for the Sindhi-English language pair. We developed two MT systems. The first system was our baseline system where Sindhi was translated into English. In the second system we used Hindi as a pivot for the translation of text. In both the cases we had identified the name entities and translated them into English as a preprocessing step. Once this was done, the standard NMT process was followed to train and generate MT outputs for the task. The systems were tested on the hidden dataset of the shared task</abstract>
+      <url hash="0f716706">2024.wmt-1.72</url>
+      <bibkey>joshi-etal-2024-system</bibkey>
+    </paper>
+    <paper id="73">
+      <title><fixed-case>WMT</fixed-case>24 System Description for the <fixed-case>M</fixed-case>ulti<fixed-case>I</fixed-case>ndic22<fixed-case>MT</fixed-case> Shared Task on <fixed-case>M</fixed-case>anipuri Language</title>
+      <author><first>Ningthoujam Justwant</first><last>Singh</last><affiliation>National Institute Of Technology, Silchar</affiliation></author>
+      <author><first>Kshetrimayum Boynao</first><last>Singh</last><affiliation>National Institute of Technology Silchar</affiliation></author>
+      <author><first>Ningthoujam Avichandra</first><last>Singh</last><affiliation>National Institute of Technology Silchar</affiliation></author>
+      <author><first>Sanjita</first><last>Phijam</last><affiliation>National Institute of Technology Silchar</affiliation></author>
+      <author><first>Thoudam Doren</first><last>Singh</last><affiliation>National Institute of Technology Silchar</affiliation></author>
+      <pages>797-803</pages>
+      <abstract>This paper presents a Transformer-based Neural Machine Translation (NMT) system developed by the Centre for Natural Language Processing and the Department of Computer Science and Engineering at the National Institute of Technology Silchar, India (NITS-CNLP) for the MultiIndic22MT 2024 Shared Task. The system focused on the English-Manipuri language pair for the WMT24 shared task. The proposed WMT system shows a BLEU score of 6.4, a chrF score of 28.6, and a chrF++ score of 26.6 on the public test set Indic-Conv dataset. Further, in the public test set Indic-Gen dataset, it achieved a BLEU score of 8.1, a chrF score of 32.1, and a chrF++ score of 29.4 on the English-to-Manipuri translation.</abstract>
+      <url hash="9247b052">2024.wmt-1.73</url>
+      <bibkey>singh-etal-2024-wmt24</bibkey>
+    </paper>
+    <paper id="74">
+      <title><fixed-case>NLIP</fixed-case>-Lab-<fixed-case>IITH</fixed-case> Multilingual <fixed-case>MT</fixed-case> System for <fixed-case>WAT</fixed-case>24 <fixed-case>MT</fixed-case> Shared Task</title>
+      <author><first>Maharaj</first><last>Brahma</last><affiliation>Indian Institute of Technology Hyderabad</affiliation></author>
+      <author><first>Pramit</first><last>Sahoo</last><affiliation>Indian Institute of Technology Hyderabad</affiliation></author>
+      <author><first>Maunendra Sankar</first><last>Desarkar</last><affiliation>IIT Hyderabad</affiliation></author>
+      <pages>804-809</pages>
+      <abstract>This paper describes NLIP Lab’s multilingual machine translation system for the WAT24 shared task on multilingual Indic MT task for 22 scheduled languages belonging to 4 language families. We explore pre-training for Indic languages using alignment agreement objectives. We utilize bi-lingual dictionaries to substitute words from source sentences. Furthermore, we fine-tuned language direction-specific multilingual translation models using small and high-quality seed data. Our primary submission is a 243M parameters multilingual translation model covering 22 Indic languages. In the IN22-Gen benchmark, we achieved an average chrF++ score of 46.80 and 18.19 BLEU score for the En-Indic direction. In the Indic-En direction, we achieved an average chrF++ score of 56.34 and 30.82 BLEU score. In the In22-Conv benchmark, we achieved an average chrF++ score of 43.43 and BLEU score of 16.58 in the En-Indic direction, and in the Indic-En direction, we achieved an average of 52.44 and 29.77 for chrF++ and BLEU respectively. Our model is competitive with IndicTransv1 (474M parameter model).</abstract>
+      <url hash="8841977b">2024.wmt-1.74</url>
+      <bibkey>brahma-etal-2024-nlip</bibkey>
+    </paper>
+    <paper id="75">
+      <title><fixed-case>DCU</fixed-case> <fixed-case>ADAPT</fixed-case> at <fixed-case>WMT</fixed-case>24: <fixed-case>E</fixed-case>nglish to Low-resource Multi-Modal Translation Task</title>
+      <author><first>Sami</first><last>Haq</last><affiliation>Dublin City University</affiliation></author>
+      <author><first>Rudali</first><last>Huidrom</last><affiliation>ADAPT Research Centre, Dublin City University</affiliation></author>
+      <author><first>Sheila</first><last>Castilho</last><affiliation>Dublin City University</affiliation></author>
+      <pages>810-814</pages>
+      <abstract>This paper presents the system description of “DCU_NMT’s” submission to the WMT-WAT24 English-to-Low-Resource Multimodal Translation Task. We participated in the English-to-Hindi track, developing both text-only and multimodal neural machine translation (NMT) systems. The text-only systems were trained from scratch on constrained data and augmented with back-translated data. For the multimodal approach, we implemented a context-aware transformer model that integrates visual features as additional contextual information. Specifically, image descriptions generated by an image captioning model were encoded using BERT and concatenated with the textual input.The results indicate that our multimodal system, trained solely on limited data, showed improvements over the text-only baseline in both the challenge and evaluation sets, suggesting the potential benefits of incorporating visual information.</abstract>
+      <url hash="0659d57b">2024.wmt-1.75</url>
+      <bibkey>haq-etal-2024-dcu</bibkey>
+    </paper>
+    <paper id="76">
+      <title><fixed-case>E</fixed-case>nglish-to-Low-Resource Translation: A Multimodal Approach for <fixed-case>H</fixed-case>indi, <fixed-case>M</fixed-case>alayalam, <fixed-case>B</fixed-case>engali, and <fixed-case>H</fixed-case>ausa</title>
+      <author><first>Ali</first><last>Hatami</last><affiliation>University of Galway</affiliation></author>
+      <author><first>Shubhanker</first><last>Banerjee</last><affiliation>University of Galway</affiliation></author>
+      <author><first>Mihael</first><last>Arcan</last><affiliation>Lua Health</affiliation></author>
+      <author><first>Bharathi</first><last>Chakravarthi</last><affiliation>University of Galway</affiliation></author>
+      <author><first>Paul</first><last>Buitelaar</last><affiliation>University of Galway</affiliation></author>
+      <author><first>John</first><last>Mccrae</last><affiliation>University of Galway</affiliation></author>
+      <pages>815-822</pages>
+      <abstract>Multimodal machine translation leverages multiple data modalities to enhance translation quality, particularly for low-resourced languages. This paper uses a Multimodal model that integrates visual information with textual data to improve translation accuracy from English to Hindi, Malayalam, Bengali, and Hausa. This approach employs a gated fusion mechanism to effectively combine the outputs of textual and visual encoders, enabling more nuanced translations that consider both language and contextual visual cues. The performance of the multimodal model was evaluated against the text-only machine translation model based on BLEU, ChrF2 and TER. Experimental results demonstrate that the multimodal approach consistently outperforms the text-only baseline, highlighting the potential of integrating visual information in low-resourced language translation tasks.</abstract>
+      <url hash="516807e4">2024.wmt-1.76</url>
+      <bibkey>hatami-etal-2024-english</bibkey>
+    </paper>
+    <paper id="77">
+      <title><fixed-case>O</fixed-case>dia<fixed-case>G</fixed-case>en<fixed-case>AI</fixed-case>’s Participation in <fixed-case>WMT</fixed-case>2024 <fixed-case>E</fixed-case>nglish-to-Low Resource Multimodal Translation Task</title>
+      <author><first>Shantipriya</first><last>Parida</last><affiliation>Silo AI</affiliation></author>
+      <author><first>Shashikanta</first><last>Sahoo</last><affiliation>Government College of Engineering Kalahandi, India</affiliation></author>
+      <author><first>Sambit</first><last>Sekhar</last><affiliation>Odia Generative AI</affiliation></author>
+      <author><first>Upendra</first><last>Jena</last><affiliation>Creanovation Technologies Pvt Ltd., India</affiliation></author>
+      <author><first>Sushovan</first><last>Jena</last><affiliation>IIT Mandi, India</affiliation></author>
+      <author><first>Kusum</first><last>Lata</last><affiliation>Sharda University, India</affiliation></author>
+      <pages>823-828</pages>
+      <abstract>This paper covers the system description of the team “ODIAGEN’s” submission to the WMT~2024 English-to-Low-Resource Multimodal Translation Task. We participated in the English-to-Low Resource Multimodal Translation Task, in two of the tasks, i.e. Text-only Translation and Multi-modal Translation. For Text-only Translation, we trained the Mistral-7B model for English to Multi-lingual (Hindi, Bengali, Malayalam, Hausa). For Multi-modal Translation (using both image and text), we trained the PaliGemma-3B model for English to Hindi translation.</abstract>
+      <url hash="1d4d2b7f">2024.wmt-1.77</url>
+      <bibkey>parida-etal-2024-odiagenais</bibkey>
+    </paper>
+    <paper id="78">
+      <title>Arewa <fixed-case>NLP</fixed-case>’s Participation at <fixed-case>WMT</fixed-case>24</title>
+      <author><first>Mahmoud</first><last>Ahmad</last><affiliation>Federal University of Technology Babura(FUTB)</affiliation></author>
+      <author><first>Auwal</first><last>Khalid</last><affiliation>Bayero University Kano</affiliation></author>
+      <author><first>Lukman</first><last>Aliyu</last><affiliation>Arewa data Science</affiliation></author>
+      <author><first>Babangida</first><last>Sani</last><affiliation>Arewa data Science</affiliation></author>
+      <author><first>Mariya</first><last>Abdullahi</last><affiliation>Bayero University Kano</affiliation></author>
+      <pages>829-832</pages>
+      <abstract>This paper presents the work of our team, “ArewaNLP,” for the WMT 2024 shared task. The paper describes the system submitted to the Ninth Conference on Machine Translation (WMT24). We participated in the English-Hausa text-only translation task. We fine-tuned the OPUS-MT-en-ha transformer model and our submission achieved competitive results in this task. We achieve a BLUE score of 27.76, 40.31 and 5.85 on the Development Test, Evaluation Test and Challenge Test respectively.</abstract>
+      <url hash="5e0e7777">2024.wmt-1.78</url>
+      <bibkey>ahmad-etal-2024-arewa</bibkey>
+    </paper>
+    <paper id="79">
+      <title>Multimodal Machine Translation for Low-Resource <fixed-case>I</fixed-case>ndic Languages: A Chain-of-Thought Approach Using Large Language Models</title>
+      <author><first>Pawan</first><last>Rajpoot</last><affiliation>Self</affiliation></author>
+      <author><first>Nagaraj</first><last>Bhat</last><affiliation>Self</affiliation></author>
+      <author><first>Ashish</first><last>Shrivastava</last><affiliation>Self</affiliation></author>
+      <pages>833-838</pages>
+      <abstract>This paper presents the approach and results of team v036 in the English-to-Low-Resource Multi-Modal Translation Task at the Ninth Conference on Machine Translation (WMT24). Our team tackled the challenge of translating English source text to low-resource Indic languages, specifically Hindi, Malayalam, and Bengali, while leveraging visual context provided alongside the text data. We used InternVL2 for extracting the image context along with Knowledge Distillation from bigger LLMs to train Small Language Model on the tranlsation task. During current shared task phase, we submitted best models (for this task), and overall we got rank 3 on Hindi, Bengali, and Malyalam datasets. We also open source our models on huggingface.</abstract>
+      <url hash="cde9a1d1">2024.wmt-1.79</url>
+      <bibkey>rajpoot-etal-2024-multimodal</bibkey>
+    </paper>
+    <paper id="80">
+      <title>Chitranuvad: Adapting Multi-lingual <fixed-case>LLM</fixed-case>s for Multimodal Translation</title>
+      <author><first>Shaharukh</first><last>Khan</last><affiliation>Krutrim AI</affiliation></author>
+      <author><first>Ayush</first><last>Tarun</last><affiliation>Krutrim AI</affiliation></author>
+      <author><first>Ali</first><last>Faraz</last><affiliation>Krutrim AI</affiliation></author>
+      <author><first>Palash</first><last>Kamble</last><affiliation>Krutrim AI</affiliation></author>
+      <author><first>Vivek</first><last>Dahiya</last><affiliation>Krutrim AI</affiliation></author>
+      <author><first>Praveen</first><last>Pokala</last><affiliation>Krutrim AI</affiliation></author>
+      <author><first>Ashish</first><last>Kulkarni</last><affiliation>Krutrim</affiliation></author>
+      <author><first>Chandra</first><last>Khatri</last><affiliation>Krutrim AI</affiliation></author>
+      <author><first>Abhinav</first><last>Ravi</last><affiliation>Krutrim AI</affiliation></author>
+      <author><first>Shubham</first><last>Agarwal</last><affiliation>Krutrim AI</affiliation></author>
+      <pages>839-851</pages>
+      <abstract>In this work, we provide the system description of our submission as part of the English-to-Lowres Multimodal Translation Task at theWorkshop on Asian Translation (WAT2024). We introduce Chitranuvad, a multimodal model that effectively integrates Multilingual LLMand a vision module for Multimodal Translation. Our method uses a ViT image encoder to extract visual representations as visual tokenembeddings which are projected to the LLM space by an adapter layer and generates translation in an autoregressive fashion. We participated in all the three tracks (Image Captioning, Text-only and Multimodal translationtasks) for Indic languages (ie. English translation to Hindi, Bengali and Malyalam) and achieved SOTA results for Hindi in all of themon the Challenge set while remaining competitive for the other languages in the shared task.</abstract>
+      <url hash="44ddae4b">2024.wmt-1.80</url>
+      <bibkey>khan-etal-2024-chitranuvad</bibkey>
+    </paper>
+    <paper id="81">
+      <title>Brotherhood at <fixed-case>WMT</fixed-case> 2024: Leveraging <fixed-case>LLM</fixed-case>-Generated Contextual Conversations for Cross-Lingual Image Captioning</title>
+      <author><first>Siddharth</first><last>Betala</last><affiliation>Indian Institute of Technology Madras</affiliation></author>
+      <author><first>Ishan</first><last>Chokshi</last><affiliation>Indian Institute of Technology Madras</affiliation></author>
+      <pages>852-861</pages>
+      <abstract>In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning.Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language.This method achieved competitive results, scoring 37.90 BLEU on the English-Hindi Challenge Set and ranking first and second for English-Hausa on the Challenge and Evaluation Leaderboards, respectively. We conduct additional experiments on a subset of 250 images, exploring the trade-offs between BLEU scores and semantic similarity across various weighting schemes.</abstract>
+      <url hash="d37ca6eb">2024.wmt-1.81</url>
+      <bibkey>betala-chokshi-2024-brotherhood</bibkey>
+    </paper>
+    <paper id="82">
+      <title><fixed-case>TIM</fixed-case>-<fixed-case>UNIGE</fixed-case> Translation into Low-Resource Languages of <fixed-case>S</fixed-case>pain for <fixed-case>WMT</fixed-case>24</title>
+      <author><first>Jonathan</first><last>Mutal</last><affiliation>Unige</affiliation></author>
+      <author><first>Lucía</first><last>Ormaechea</last><affiliation>Université de Genève</affiliation></author>
+      <pages>862-870</pages>
+      <abstract>We present the results of our constrained submission to the WMT 2024 shared task, which focuses on translating from Spanish into two low-resource languages of Spain: Aranese (spa-arn) and Aragonese (spa-arg). Our system integrates real and synthetic data generated by large language models (e.g., BLOOMZ) and rule-based Apertium translation systems. Built upon the pre-trained NLLB system, our translation model utilizes a multistage approach, progressively refining the initial model through the sequential use of different datasets, starting with large-scale synthetic or crawled data and advancing to smaller, high-quality parallel corpora. This approach resulted in BLEU scores of 30.1 for Spanish to Aranese and 61.9 for Spanish to Aragonese.</abstract>
+      <url hash="0f88a7cd">2024.wmt-1.82</url>
+      <bibkey>mutal-ormaechea-2024-tim</bibkey>
+    </paper>
+    <paper id="83">
+      <title><fixed-case>TAN</fixed-case>-<fixed-case>IBE</fixed-case> Participation in the Shared Task: Translation into Low-Resource Languages of <fixed-case>S</fixed-case>pain</title>
+      <author><first>Antoni</first><last>Oliver</last><affiliation>Universitat Oberta de Catalunya</affiliation></author>
+      <pages>871-877</pages>
+      <abstract>This paper describes the systems presented by the TAN-IBE team into the WMT24 Shared task Translation into Low-Resource Languages of Spain. The aim of this joint task was to train systems for Spanish-Asturian, Spanish-Aragonese and Spanish-Aranesian. Our team presented systems for all three language pairs and for two types of submission: for Spanish-Aragonese and Spanish-Aranese we participated with constrained submissions, and for Spanish-Asturian with an open submission.</abstract>
+      <url hash="7b63f60f">2024.wmt-1.83</url>
+      <bibkey>oliver-2024-tan</bibkey>
+    </paper>
+    <paper id="84">
+      <title>Enhaced Apertium System: Translation into Low-Resource Languages of <fixed-case>S</fixed-case>pain <fixed-case>S</fixed-case>panish–<fixed-case>A</fixed-case>sturian</title>
+      <author><first>Sofía</first><last>García</last><affiliation>imaxin|software</affiliation></author>
+      <pages>878-884</pages>
+      <abstract>We present the Spanish–Asturian Apertium translation system, which has been enhanced and refined by our team of linguists for the shared task: Low Resource Languages of Spain of this WMT24 under the closed submission. While our system did not rank among the top 10 in terms of results, we believe that Apertium’s translations are of a commendable standard and demonstrate competitiveness with respect to the other systems.</abstract>
+      <url hash="2c190741">2024.wmt-1.84</url>
+      <bibkey>garcia-2024-enhaced</bibkey>
+    </paper>
+    <paper id="85">
+      <title><fixed-case>U</fixed-case>niversitat d’Alacant’s Submission to the <fixed-case>WMT</fixed-case> 2024 Shared Task on Translation into Low-Resource Languages of <fixed-case>S</fixed-case>pain</title>
+      <author><first>Aaron</first><last>Galiano Jimenez</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <author><first>Víctor M.</first><last>Sánchez-Cartagena</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <author><first>Juan Antonio</first><last>Perez-Ortiz</last><affiliation>Departament de Llenguatges i Sistemes Informatics, Universitat d’Alacant</affiliation></author>
+      <author><first>Felipe</first><last>Sánchez-Martínez</last><affiliation>Universitat d’Alacant</affiliation></author>
+      <pages>885-891</pages>
+      <abstract>This paper describes the submissions of the Transducens group of the Universitat d’Alacant to the WMT 2024 Shared Task on Translation into Low-Resource Languages of Spain; in particular, the task focuses on the translation from Spanish into Aragonese, Aranese and Asturian. Our submissions use parallel and monolingual data to fine-tune the NLLB-1.3B model and to investigate the effectiveness of synthetic corpora and transfer-learning between related languages such as Catalan, Galician and Valencian. We also present a many-to-many multilingual neural machine translation model focused on the Romance languages of Spain.</abstract>
+      <url hash="8ec4ed77">2024.wmt-1.85</url>
+      <bibkey>galiano-jimenez-etal-2024-universitat</bibkey>
+    </paper>
+    <paper id="86">
+      <title><fixed-case>S</fixed-case>amsung <fixed-case>R</fixed-case>&amp;<fixed-case>D</fixed-case> Institute <fixed-case>P</fixed-case>hilippines @ <fixed-case>WMT</fixed-case> 2024 Low-resource Languages of <fixed-case>S</fixed-case>pain Shared Task</title>
+      <author><first>Dan John</first><last>Velasco</last><affiliation>Samsung Research Philippines (SRPH)</affiliation></author>
+      <author><first>Manuel Antonio</first><last>Rufino</last><affiliation>Samsung Research Philippines (SRPH)</affiliation></author>
+      <author><first>Jan Christian Blaise</first><last>Cruz</last><affiliation>Samsung Research Philippines (SRPH)</affiliation></author>
+      <pages>892-900</pages>
+      <abstract>This paper details the submission of Samsung R&amp;D Institute Philippines (SRPH) Language Intelligence Team (LIT) to the WMT 2024 Low-resource Languages of Spain shared task. We trained translation models for Spanish to Aragonese, Spanish to Aranese/Occitan, and Spanish to Asturian using a standard sequence-to-sequence Transformer architecture, augmenting it with a noisy-channel reranking strategy to select better outputs during decoding. For Spanish to Asturian translation, our method reaches comparable BLEU scores to a strong commercial baseline translation system using only constrained data, backtranslations, noisy channel reranking, and a shared vocabulary spanning all four languages.</abstract>
+      <url hash="97e91c61">2024.wmt-1.86</url>
+      <bibkey>velasco-etal-2024-samsung</bibkey>
+    </paper>
+    <paper id="87">
+      <title>Back to the Stats: Rescuing Low Resource Neural Machine Translation with Statistical Methods</title>
+      <author><first>Menan</first><last>Velayuthan</last><affiliation>University of Moratuwa</affiliation></author>
+      <author><first>Dilith</first><last>Jayakody</last><affiliation>University of Moratuwa</affiliation></author>
+      <author><first>Nisansa</first><last>De Silva</last><affiliation>University of Moratuwa</affiliation></author>
+      <author><first>Aloka</first><last>Fernando</last><affiliation>University of Moratuwa</affiliation></author>
+      <author><first>Surangika</first><last>Ranathunga</last><affiliation>Massey University</affiliation></author>
+      <pages>901-907</pages>
+      <abstract>This paper describes our submission to the WMT24 shared task for Low-Resource Languages of Spain in the Constrained task category. Due to the lack of deep learning-based data filtration methods for these languages, we propose a purely statistical-based, two-stage pipeline for data filtration. In the primary stage, we begin by removing spaces and punctuation from the source sentences (Spanish) and deduplicating them. We then filter out sentence pairs with inconsistent language predictions by the language identification model, followed by the removal of pairs with anomalous sentence length and word count ratios, using the development set statistics as the threshold. In the secondary stage, for corpora of significant size, we employ a Jensen Shannon divergence-based method to curate training data of the desired size. Our filtered data allowed us to complete a two-step training process in under 3 hours, with GPU power consumption kept below 1 kWh, making our system both economical and eco-friendly. The source code, training data, and best models are available on the project’s GitHub page.</abstract>
+      <url hash="774d49e9">2024.wmt-1.87</url>
+      <bibkey>velayuthan-etal-2024-back</bibkey>
+    </paper>
+    <paper id="88">
+      <title>Hybrid Distillation from <fixed-case>RBMT</fixed-case> and <fixed-case>NMT</fixed-case>: <fixed-case>H</fixed-case>elsinki-<fixed-case>NLP</fixed-case>’s Submission to the Shared Task on Translation into Low-Resource Languages of <fixed-case>S</fixed-case>pain</title>
+      <author><first>Ona</first><last>De Gibert</last><affiliation>University of Helsinki</affiliation></author>
+      <author><first>Mikko</first><last>Aulamo</last><affiliation>University of Helsinki</affiliation></author>
+      <author><first>Yves</first><last>Scherrer</last><affiliation>University of Oslo</affiliation></author>
+      <author><first>Jörg</first><last>Tiedemann</last><affiliation>University of Helsinki</affiliation></author>
+      <pages>908-917</pages>
+      <abstract>The Helsinki-NLP team participated in the 2024 Shared Task on Translation into Low-Resource languages of Spain with four multilingual systems covering all language pairs. The task consists in developing Machine Translation (MT) models to translate from Spanish into Aragonese, Aranese and Asturian. Our models leverage known approaches for multilingual MT, namely, data filtering, fine-tuning, data tagging, and distillation. We use distillation to merge the knowledge from neural and rule-based systems and explore the trade-offs between translation quality and computational efficiency. We demonstrate that our distilled models can achieve competitive results while significantly reducing computational costs. Our best models ranked 4th, 5th, and 2nd in the open submission track for Spanish–Aragonese, Spanish–Aranese, and Spanish–Asturian, respectively. We release our code and data publicly at https://github.com/Helsinki-NLP/lowres-spain-st.</abstract>
+      <url hash="3c9a5bf6">2024.wmt-1.88</url>
+      <bibkey>de-gibert-etal-2024-hybrid</bibkey>
+    </paper>
+    <paper id="89">
+      <title>Robustness of Fine-Tuned <fixed-case>LLM</fixed-case>s for Machine Translation with Varying Noise Levels: Insights for <fixed-case>A</fixed-case>sturian, <fixed-case>A</fixed-case>ragonese and Aranese</title>
+      <author><first>Martin</first><last>Bär</last><affiliation>University of the Basque Country</affiliation></author>
+      <author><first>Elisa</first><last>Forcada Rodríguez</last><affiliation>University of the Basque Country</affiliation></author>
+      <author><first>Maria</first><last>Garcia-Abadillo</last><affiliation>University of the Basque Country</affiliation></author>
+      <pages>918-924</pages>
+      <abstract>We present the LCT-LAP proposal for the shared task on Translation into Low-Resource Languages of Spain at WMT24 within the constrained submission category. Our work harnesses encoder-decoder models pretrained on higher-resource Iberian languages to facilitate MT model training for Asturian, Aranese and Aragonese. Furthermore, we explore the robustness of these models when fine-tuned on datasets with varying levels of alignment noise. We fine-tuned a Spanish-Galician model using Asturian data filtered by BLEU score thresholds of 5, 15, 30 and 60, identifying BLEU 15 as the most effective. This threshold was then applied to the Aranese and Aragonese datasets. Our findings indicate that filtering the corpora reduces computational costs and improves performance compared to using nearly raw data or data filtered with language identification. However, it still falls short of the performance achieved by the rule-based system Apertium in Aranese and Aragonese.</abstract>
+      <url hash="df247c4c">2024.wmt-1.89</url>
+      <bibkey>bar-etal-2024-robustness</bibkey>
+    </paper>
+    <paper id="90">
+      <title>Training and Fine-Tuning <fixed-case>NMT</fixed-case> Models for Low-Resource Languages Using Apertium-Based Synthetic Corpora</title>
+      <author><first>Aleix</first><last>Sant</last><affiliation>Barcelona Supercomputing Center</affiliation></author>
+      <author><first>Daniel</first><last>Bardanca</last><affiliation>CITIUS</affiliation></author>
+      <author><first>José Ramom</first><last>Pichel Campos</last><affiliation>CITIUS</affiliation></author>
+      <author><first>Francesca</first><last>De Luca Fornaciari</last><affiliation>BSC Barcelona Supercomputing Center</affiliation></author>
+      <author><first>Carlos</first><last>Escolano</last><affiliation>Universitat PolitÃ ̈cnica de Catalunya, Barcelona Supercomputing Center</affiliation></author>
+      <author><first>Javier</first><last>Garcia Gilabert</last><affiliation>Barcelona Super Computing Center</affiliation></author>
+      <author><first>Pablo</first><last>Gamallo</last><affiliation>CITIUS, University of Santiago de Compostela</affiliation></author>
+      <author><first>Audrey</first><last>Mash</last><affiliation>BSC</affiliation></author>
+      <author><first>Xixian</first><last>Liao</last><affiliation>Barcelona Supercomputing Center</affiliation></author>
+      <author><first>Maite</first><last>Melero</last><affiliation>BSC</affiliation></author>
+      <pages>925-933</pages>
+      <abstract>In this paper, we present the two strategies employed for the WMT24 Shared Task on Translation into Low-Resource Languages of Spain. We participated in the language pairs of Spanish-to-Aragonese, Spanish-to-Aranese, and Spanish-to-Asturian, developing neural-based translation systems and moving away from rule-based approaches for these language directions. To create these models, two distinct strategies were employed. The first strategy involved a thorough cleaning process and curation of the limited provided data, followed by fine-tuning the multilingual NLLB-200-600M model (Constrained Submission). The other strategy involved training a transformer from scratch using a vast amount of synthetic data (Open Submission). Both approaches relied on generated synthetic data and resulted in high ChrF and BLEU scores. However, given the characteristics of the task, the strategy used in the Constrained Submission resulted in higher scores that surpassed the baselines across the three translation directions, whereas the strategy employed in the Open Submission yielded slightly lower scores than the highest baseline.</abstract>
+      <url hash="34e2fc39">2024.wmt-1.90</url>
+      <bibkey>sant-etal-2024-training</bibkey>
+    </paper>
+    <paper id="91">
+      <title>Vicomtech@<fixed-case>WMT</fixed-case> 2024: Shared Task on Translation into Low-Resource Languages of <fixed-case>S</fixed-case>pain</title>
+      <author><first>David</first><last>Ponce</last><affiliation>Vicomtech</affiliation></author>
+      <author><first>Harritxu</first><last>Gete</last><affiliation>Vicomtech</affiliation></author>
+      <author><first>Thierry</first><last>Etchegoyhen</last><affiliation>Vicomtech</affiliation></author>
+      <pages>934-942</pages>
+      <abstract>We describe Vicomtech’s participation in the WMT 2024 Shared Task on translation into low-resource languages of Spain. We addressed all three languages of the task, namely Aragonese, Aranese and Asturian, in both constrained and open settings. Our work mainly centred on exploiting different types of corpora via data filtering, selection and combination methods, along with synthetic data generated with translation models based on rules, neural sequence-to-sequence or large language models. We improved or matched the best baselines in all three language pairs and present complementary results on additional test sets.</abstract>
+      <url hash="57fc9787">2024.wmt-1.91</url>
+      <bibkey>ponce-etal-2024-vicomtech</bibkey>
+    </paper>
+    <paper id="92">
+      <title><fixed-case>SJTU</fixed-case> System Description for the <fixed-case>WMT</fixed-case>24 Low-Resource Languages of <fixed-case>S</fixed-case>pain Task</title>
+      <author><first>Tianxiang</first><last>Hu</last><affiliation>Shanghai Jiao Tong University</affiliation></author>
+      <author><first>Haoxiang</first><last>Sun</last><affiliation>ShanghaiJiaotong University</affiliation></author>
+      <author><first>Ruize</first><last>Gao</last><affiliation>SJTU</affiliation></author>
+      <author><first>Jialong</first><last>Tang</last><affiliation>Institute of Software, Chinese Academy of Sciences</affiliation></author>
+      <author><first>Pei</first><last>Zhang</last><affiliation>Alibaba-inc</affiliation></author>
+      <author><first>Baosong</first><last>Yang</last><affiliation>Alibaba Damo Academy, Alibaba Inc.</affiliation></author>
+      <author><first>Rui</first><last>Wang</last><affiliation>Shanghai Jiao Tong University</affiliation></author>
+      <pages>943-948</pages>
+      <abstract>We participate in the translation task on Spanish to Aragonese, Spanish to Aranese and Spanish to Asturian. Initially, we conduct preliminary experiments to assess the basic translation capabilities of various models and evaluate the impact of fine-tuning with different data types. We then choose to fine-tune the Qwen2-0.5B model using a forward synthesized pseudo-corpus from the Apertium translation system to replicate its fundamental performance. Building on this distillation model, we explore three optimization strategies across the three language directions: (1) Assembling the provided FLORES+ dev sets into a 5-shot format translation training dataset and performing few-shot fine-tuning to enhance model performance. (2) Utilizing the FLORES+ dev sets as training data and applying the Contrastive Preference Optimization (CPO) strategy for further refinement. (3) Retrieving the 20 most similar translation examples from the FLORES+ dev sets using the BM25 algorithm and performing 20-shot translations with the Claude 3.5-sonnet model. After evaluating these strategies, we select the best-performing approach for each language pair as our submission result.</abstract>
+      <url hash="cb45df37">2024.wmt-1.92</url>
+      <bibkey>hu-etal-2024-sjtu</bibkey>
+    </paper>
+    <paper id="93">
+      <title>Multilingual Transfer and Domain Adaptation for Low-Resource Languages of <fixed-case>S</fixed-case>pain</title>
+      <author><first>Yuanchang</first><last>Luo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zhanglin</first><last>Wu</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Daimeng</first><last>Wei</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Hengchao</first><last>Shang</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Zongyao</first><last>Li</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Jiaxin</first><last>Guo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zhiqiang</first><last>Rao</last><affiliation>Huawei Translation Service Center, Beijing, China</affiliation></author>
+      <author><first>Shaojun</first><last>Li</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Jinlong</first><last>Yang</last><affiliation>Huawei Technologies Co., Ltd</affiliation></author>
+      <author><first>Yuhao</first><last>Xie</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Zheng</first><last>Jiawei</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Bin</first><last>Wei</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Hao</first><last>Yang</last><affiliation>Huawei Co. Ltd</affiliation></author>
+      <pages>949-954</pages>
+      <abstract>This article introduces the submission status of the Translation into Low-Resource Languages of Spain task at (WMT 2024) by Huawei Translation Service Center (HW-TSC). We participated in three translation tasks: spanish to aragonese (es2arg), spanish to aranese (es2arn), and spanish to asturian (es2ast). For these three translation tasks, we use training strategies such as multilingual transfer, regularized dropout, forward translation and back translation, labse denoising, transduction ensemble learning and other strategies to neural machine translation (NMT) model based on training deep transformer-big architecture. By using these enhancement strategies, our submission achieved a competitive result in the final evaluation.</abstract>
+      <url hash="d7317150">2024.wmt-1.93</url>
+      <bibkey>luo-etal-2024-multilingual</bibkey>
+    </paper>
+    <paper id="94">
+      <title><fixed-case>TRIBBLE</fixed-case> - <fixed-case>TR</fixed-case>anslating <fixed-case>IB</fixed-case>erian languages Based on Limited <fixed-case>E</fixed-case>-resources</title>
+      <author><first>Igor</first><last>Kuzmin</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <author><first>Piotr</first><last>Przybyła</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <author><first>Euan</first><last>Mcgill</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <author><first>Horacio</first><last>Saggion</last><affiliation>Universitat Pompeu Fabra</affiliation></author>
+      <pages>955-959</pages>
+      <abstract>In this short overview paper, we describe our system submission for the language pairs Spanish to Aragonese (spa-arg), Spanish to Aranese (spa-arn), and Spanish to Asturian (spa-ast). We train a unified model for all language pairs in the constrained scenario. In addition, we add two language control tokens for Aragonese and Aranese Occitan, as there is already one present for Asturian. We take the distilled NLLB-200 model with 600M parameters and extend special tokens with 2 tokens that denote target languages (arn_Latn, arg_Latn) because Asturian was already presented in NLLB-200 model. We adapt the model by training on a special regime of data augmentation with both monolingual and bilingual training data for the language pairs in this challenge.</abstract>
+      <url hash="5e2f0999">2024.wmt-1.94</url>
+      <bibkey>kuzmin-etal-2024-tribble</bibkey>
+    </paper>
+    <paper id="95">
+      <title><fixed-case>C</fixed-case>loud<fixed-case>S</fixed-case>heep System for <fixed-case>WMT</fixed-case>24 Discourse-Level Literary Translation</title>
+      <author><first>Lisa</first><last>Liu</last><affiliation>University of California, San Diego</affiliation></author>
+      <author><first>Ryan</first><last>Liu</last><affiliation>University of California, San Diego</affiliation></author>
+      <author><first>Angela</first><last>Tsai</last><affiliation>University of California, San Diego</affiliation></author>
+      <author><first>Jingbo</first><last>Shang</last><affiliation>University of California, San Diego</affiliation></author>
+      <pages>960-966</pages>
+      <abstract>This paper describes the CloudSheep translation system for WMT24 Discourse-Level Literary Translation shared task. We participated in the Chinese-English direction on the unconstrained track. Our approach to the task used a pipeline of different tools in order to maximize the translation accuracy and flow of the text by combining the strengths of each tool. In particular, our focus was to translate names consistently and idioms correctly. To achieve consistent names throughout a text, a custom name dictionary was generated for each text, containing person and place names, along with their translations. A common honorific dictionary was applied for consistency with titles, especially in historical or cultivation novels. The names were found and translated with GPT 3.5-turbo. To achieve accurate and concise translations of idioms, which are often translated literally and verbosely, we integrated the CC-CEDICT library to provide official definitions. Then, we used GPT-4 to pick the best dictionary definition that fit the context and rephrase it to fit grammatically within a sentence. For the translation of non-name and non-idiom terms, we used Google Translate. We compared our approach’s performance with Google Translate as a baseline using BLEU, chrF, and COMET, as well as A/B testing.</abstract>
+      <url hash="38dc3663">2024.wmt-1.95</url>
+      <bibkey>liu-etal-2024-cloudsheep</bibkey>
+    </paper>
+    <paper id="96">
+      <title>Final Submission of <fixed-case>SJTUL</fixed-case>ove<fixed-case>F</fixed-case>iction to Literary Task</title>
+      <author><first>Haoxiang</first><last>Sun</last><affiliation>ShanghaiJiaotong University</affiliation></author>
+      <author><first>Tianxiang</first><last>Hu</last><affiliation>Shanghai Jiao Tong University</affiliation></author>
+      <author><first>Ruize</first><last>Gao</last><affiliation>SJTU</affiliation></author>
+      <author><first>Jialong</first><last>Tang</last><affiliation>Institute of Software, Chinese Academy of Sciences</affiliation></author>
+      <author><first>Pei</first><last>Zhang</last><affiliation>Alibaba-inc</affiliation></author>
+      <author><first>Baosong</first><last>Yang</last><affiliation>Alibaba Damo Academy, Alibaba Inc.</affiliation></author>
+      <author><first>Rui</first><last>Wang</last><affiliation>Shanghai Jiao Tong University</affiliation></author>
+      <pages>967-972</pages>
+      <abstract>This paper describes Shanghai Jiao Tong University (SJTU LoveFiction) Discourse-Level Literary Translation systems for the WMT24shared task. We participate in the literary translation task on Chinese → English, Chinese →German and Chinese → Russian with uncon-strained tack.Check our paper for detail.</abstract>
+      <url hash="30366acf">2024.wmt-1.96</url>
+      <bibkey>sun-etal-2024-final</bibkey>
+    </paper>
+    <paper id="97">
+      <title>Context-aware and Style-related Incremental Decoding Framework for Discourse-Level Literary Translation</title>
+      <author><first>Yuanchang</first><last>Luo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Jiaxin</first><last>Guo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Daimeng</first><last>Wei</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Hengchao</first><last>Shang</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Zongyao</first><last>Li</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zhanglin</first><last>Wu</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Zhiqiang</first><last>Rao</last><affiliation>Huawei Translation Service Center, Beijing, China</affiliation></author>
+      <author><first>Shaojun</first><last>Li</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Jinlong</first><last>Yang</last><affiliation>Huawei Technologies Co., Ltd</affiliation></author>
+      <author><first>Hao</first><last>Yang</last><affiliation>Huawei Co. Ltd</affiliation></author>
+      <pages>973-979</pages>
+      <abstract>This report outlines our approach for the WMT24 Discourse-Level Literary Translation Task, focusing on the Chinese-English language pair in the Constrained Track. Translating literary texts poses significant challenges due to the nuanced meanings, idiomatic expressions, and intricate narrative structures inherent in such works. To address these challenges, we leveraged the Chinese-Llama2 model, specifically enhanced for this task through a combination of Continual Pre-training (CPT) and Supervised Fine-Tuning (SFT). Our methodology includes a novel Incremental Decoding framework, which ensures that each sentence is translated with consideration of its broader context, maintaining coherence and consistency throughout the text. This approach allows the model to capture long-range dependencies and stylistic elements, producing translations that faithfully preserve the original literary quality. Our experiments demonstrate significant improvements in both sentence-level and document-level BLEU scores, underscoring the effectiveness of our proposed framework in addressing the complexities of document-level literary translation.</abstract>
+      <url hash="9d6cac1a">2024.wmt-1.97</url>
+      <bibkey>luo-etal-2024-context</bibkey>
+    </paper>
+    <paper id="98">
+      <title><fixed-case>N</fixed-case>ovel<fixed-case>T</fixed-case>rans: System for <fixed-case>WMT</fixed-case>24 Discourse-Level Literary Translation</title>
+      <author><first>Yuchen</first><last>Liu</last><affiliation>nlp2ct,University of Macau</affiliation></author>
+      <author><first>Yutong</first><last>Yao</last><affiliation>University of Macau</affiliation></author>
+      <author><first>Runzhe</first><last>Zhan</last><affiliation>University of Macau</affiliation></author>
+      <author><first>Yuchu</first><last>Lin</last><affiliation>DeepTranx</affiliation></author>
+      <author><first>Derek F.</first><last>Wong</last><affiliation>University of Macau</affiliation></author>
+      <pages>980-986</pages>
+      <abstract>This paper describes our submission system, NovelTrans, from NLP²CT and DeepTranx for the WMT24 Discourse-Level Literary Translation Task in Chinese-English, Chinese-German, and Chinese-Russian language pairs under unconstrained conditions. For our primary system, three translations are done by GPT4o using three different settings of additional information and a terminology table generated by online models. The final result is composed of sentences that have the highest xCOMET score compared with the corresponding sentences in other results. Our system achieved an xCOMET score of 79.14 which is higher than performing a direct chapter-level translation on our dataset.</abstract>
+      <url hash="c198e4f6">2024.wmt-1.98</url>
+      <bibkey>liu-etal-2024-noveltrans</bibkey>
+    </paper>
+    <paper id="99">
+      <title><fixed-case>L</fixed-case>in<fixed-case>C</fixed-case>hance-<fixed-case>NTU</fixed-case> for Unconstrained <fixed-case>WMT</fixed-case>2024 Literary Translation</title>
+      <author><first>Kechen</first><last>Li</last><affiliation>Jiangsu Linchance Technology Co., Ltd.</affiliation></author>
+      <author><first>Yaotian</first><last>Tao</last><affiliation>Jiangsu Linchance Technology Co., Ltd.</affiliation></author>
+      <author><first>Hongyi</first><last>Huang</last><affiliation>Jiangsu Linchance Technology Co., Ltd.</affiliation></author>
+      <author><first>Tianbo</first><last>Ji</last><affiliation>Nantong University</affiliation></author>
+      <pages>987-992</pages>
+      <abstract>The rapid growth of deep learning has spurred significant advancements across industries, par- ticularly in machine translation through large language models (LLMs). However, translat- ing literary still presents challenges, including cross-cultural nuances, complex language struc- tures, metaphorical expressions, and cultural differences. To address these issues, this study utilizes the Llama and Phi models using both LoRA and full-parameter techniques, along-side a prompt-based translation system. Full-parameter tuning of the Llama-3-Chinese-8B-Instruct model was unsuccessful due to mem-ory constraints. In terms of the WMT task, the fully fine-tuned Phi 3 model was selected for submission due to its more natural and flu-ent translations. Nonetheless, results showed that LoRA and the prompt-based system sig- nificantly improved the Llama3 model’s perfor- mance, surpassing other models in BLEU and ROUGE evaluations.</abstract>
+      <url hash="bcac2b64">2024.wmt-1.99</url>
+      <bibkey>li-etal-2024-linchance</bibkey>
+    </paper>
+    <paper id="100">
+      <title>Improving Context Usage for Translating Bilingual Customer Support Chat with Large Language Models</title>
+      <author><first>Jose</first><last>Pombal</last><affiliation>Unbabel</affiliation></author>
+      <author><first>Sweta</first><last>Agrawal</last><affiliation>Instituto de Telecomunicações</affiliation></author>
+      <author><first>André</first><last>Martins</last><affiliation>Unbabel, Instituto de Telecomunicacoes</affiliation></author>
+      <pages>993-1003</pages>
+      <abstract>This paper describes Unbabel+IT’s submission to the Chat Shared Task held at the Workshop of Machine Translation 2024. The task focuses on translating customer support chats between agents and customers communicating in different languages. We present two strategies for adapting state-of-the-art language models to better utilize contextual information when translating such conversations. Our training strategy involves finetuning the model on chat datasets with context-augmented instructions, resulting in a specialized model, TOWERCHAT. For inference, we propose a novel quality-aware decoding approach that leverages a context-aware metric, CONTEXTCOMET, to select the optimal translation from a pool of candidates. We evaluate our proposed approach on the official shared task datasets for ten language pairs, showing that our submission consistently outperforms baselines on all and competing systems on 8 out of 10 language pairs across multiple automated metrics. Remarkably, TOWERCHAT outperforms our contrastive submission based on the much larger TOWER-V2-70B model while being 10× smaller. According to human evaluation, our system outperforms all other systems and baselines across all language pairs. These results underscore the importance of context-aware training and inference in handling complex bilingual dialogues.</abstract>
+      <url hash="a6dc869d">2024.wmt-1.100</url>
+      <bibkey>pombal-etal-2024-improving</bibkey>
+    </paper>
+    <paper id="101">
+      <title>Optimising <fixed-case>LLM</fixed-case>-Driven Machine Translation with Context-Aware Sliding Windows</title>
+      <author><first>Xinye</first><last>Yang</last><affiliation>The University of Sheffield</affiliation></author>
+      <author><first>Yida</first><last>Mu</last><affiliation>The University of Sheffield</affiliation></author>
+      <author><first>Kalina</first><last>Bontcheva</last><affiliation>The University of Sheffield</affiliation></author>
+      <author><first>Xingyi</first><last>Song</last><affiliation>University of Sheffield</affiliation></author>
+      <pages>1004-1010</pages>
+      <abstract>This paper describes SheffieldGATE’s submission to WMT 2024 Chat Shared Translation Task. We participate in three language pairs: English-German, English-Dutch, and English-Portuguese (Brazil). In this work, we introduce a context-aware sliding window decoding method to track dependencies between chat messages. We fine-tune a large pre-trained language model based on the training data provided by the shared task Our experiments (i) compare the model performance between multilingual and bilingual fine-tuning and (ii) assess the impact of different window sizes. Our experimental results demonstrate that utilising contextual information yields superior performance in document-level translation compared to translating documents as isolated text segments, and that models fine-tuned with multilingual data perform better than those fine-tuned with bilingual data.</abstract>
+      <url hash="cd04d7a6">2024.wmt-1.101</url>
+      <bibkey>yang-etal-2024-optimising</bibkey>
+    </paper>
+    <paper id="102">
+      <title>Context-Aware <fixed-case>LLM</fixed-case> Translation System Using Conversation Summarization and Dialogue History</title>
+      <author><first>Mingi</first><last>Sung</last><affiliation>Yonsei University</affiliation></author>
+      <author><first>Seungmin</first><last>Lee</last><affiliation>Yonsei University</affiliation></author>
+      <author><first>Jiwon</first><last>Kim</last><affiliation>Yonsei University</affiliation></author>
+      <author><first>Sejoon</first><last>Kim</last><affiliation>Yonsei University, PwC Korea</affiliation></author>
+      <pages>1011-1015</pages>
+      <abstract>Translating conversational text, particularly in customer support contexts, presents unique challenges due to its informal and unstructured nature. We propose a context-aware LLM translation system that leverages conversation summarization and dialogue history to enhance translation quality for the English-Korean language pair. Our approach incorporates the two most recent dialogues as raw data and a summary of earlier conversations to manage context length effectively. We demonstrate that this method significantly improves translation accuracy, maintaining coherence and consistency across conversations. This system offers a practical solution for customer support translation tasks, addressing the complexities of conversational text.</abstract>
+      <url hash="15111700">2024.wmt-1.102</url>
+      <bibkey>sung-etal-2024-context</bibkey>
+    </paper>
+    <paper id="103">
+      <title>Enhancing Translation Quality: A Comparative Study of Fine-Tuning and Prompt Engineering in Dialog-Oriented Machine Translation Systems. Insights from the <fixed-case>MULTITAN</fixed-case>-<fixed-case>GML</fixed-case> Team</title>
+      <author><first>Lichao</first><last>Zhu</last><affiliation>Paris Cité University</affiliation></author>
+      <author><first>Maria</first><last>Zimina</last><affiliation>CLILLAC-ARP, Paris Diderot</affiliation></author>
+      <author><first>Behnoosh</first><last>Namdarzadeh</last><affiliation>Université Paris Cité</affiliation></author>
+      <author><first>Nicolas</first><last>Ballier</last><affiliation>Université de Paris</affiliation></author>
+      <author><first>Jean-Baptiste</first><last>Yunès</last><affiliation>Université Paris Cité</affiliation></author>
+      <pages>1016-1022</pages>
+      <abstract>For this shared task, we have used several machine translation engines to produce translations (en ⇔ fr) by fine-tuning a dialog-oriented NMT engine and having NMT baseline translations post-edited with prompt engineering. Our objectives are to test the effectiveness of a fine-tuning strategy with help of a robust NMT model, to draw out a from-translation-to-post-editing pipeline, and to evaluate the strong and weak points of NMT systems.</abstract>
+      <url hash="1743e109">2024.wmt-1.103</url>
+      <bibkey>zhu-etal-2024-enhancing</bibkey>
+    </paper>
+    <paper id="104">
+      <title>The <fixed-case>SETU</fixed-case>-<fixed-case>ADAPT</fixed-case> Submissions to <fixed-case>WMT</fixed-case> 2024 Chat Translation Tasks</title>
+      <author><first>Maria</first><last>Zafar</last><affiliation>South East Technological University</affiliation></author>
+      <author><first>Antonio</first><last>Castaldo</last><affiliation>University of Naples “L’Orientale”</affiliation></author>
+      <author><first>Prashanth</first><last>Nayak</last><affiliation>KantanAI</affiliation></author>
+      <author><first>Rejwanul</first><last>Haque</last><affiliation>South East Technological University</affiliation></author>
+      <author><first>Andy</first><last>Way</last><affiliation>Dubin City University</affiliation></author>
+      <pages>1023-1030</pages>
+      <abstract>This paper presents the SETU-ADAPT submissions to the WMT24 Chat Translation Task. Large language models (LLM) currently provides the state-of-the-art solutions in many natural language processing (NLP) problems including machine translation (MT). For the WMT24 Chat Translation Task we leveraged LLMs for their MT capabilities. In order to adapt the LLMs for a specific domain of interest, we explored different fine-tuning and prompting strategies. We also employed efficient data retrieval methods to curate the data used for fine-tuning. We carried out experiments for two language pairs: German-to-English and French-to-English. Our MT models were evaluated using three metrics: BLEU, chrF and COMET. In this paper we describes our experiments including training setups, results and findings.</abstract>
+      <url hash="05dd1b9b">2024.wmt-1.104</url>
+      <bibkey>zafar-etal-2024-setu-adapt</bibkey>
+    </paper>
+    <paper id="105">
+      <title>Exploring the Traditional <fixed-case>NMT</fixed-case> Model and Large Language Model for Chat Translation</title>
+      <author><first>Jinlong</first><last>Yang</last><affiliation>Huawei Technologies Co., Ltd</affiliation></author>
+      <author><first>Hengchao</first><last>Shang</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Daimeng</first><last>Wei</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Jiaxin</first><last>Guo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zongyao</first><last>Li</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zhanglin</first><last>Wu</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Zhiqiang</first><last>Rao</last><affiliation>Huawei Translation Service Center, Beijing, China</affiliation></author>
+      <author><first>Shaojun</first><last>Li</last><affiliation>Huawei Technologies Co., Ltd.</affiliation></author>
+      <author><first>Yuhao</first><last>Xie</last><affiliation>HW-TSC</affiliation></author>
+      <author><first>Yuanchang</first><last>Luo</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Zheng</first><last>Jiawei</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Bin</first><last>Wei</last><affiliation>Huawei Translation Services Center</affiliation></author>
+      <author><first>Hao</first><last>Yang</last><affiliation>Huawei Co. Ltd</affiliation></author>
+      <pages>1031-1037</pages>
+      <abstract>This paper describes the submissions of Huawei Translation Services Center(HW-TSC) to WMT24 chat translation shared task on English↔Germany (en-de) bidirection. The experiments involved fine-tuning models using chat data and exploring various strategies, including Minimum Bayesian Risk (MBR) decoding and self-training. The results show significant performance improvements in certain directions, with the MBR self-training method achieving the best results. The Large Language Model also discusses the challenges and potential avenues for further research in the field of chat translation.</abstract>
+      <url hash="e64570d8">2024.wmt-1.105</url>
+      <bibkey>yang-etal-2024-exploring-traditional</bibkey>
+    </paper>
+    <paper id="106">
+      <title>Graph Representations for Machine Translation in Dialogue Settings</title>
+      <author><first>Lea</first><last>Krause</last><affiliation>Vrije Universiteit Amsterdam</affiliation></author>
+      <author><first>Selene</first><last>Baez Santamaria</last><affiliation>Vrije Universiteit Amsterdam</affiliation></author>
+      <author><first>Jan-Christoph</first><last>Kalo</last><affiliation>University of Amsterdam</affiliation></author>
+      <pages>1038-1046</pages>
+      <abstract>In this paper, we present our approach to the WMT24 - Chat Task, addressing the challenge of translating chat conversations.Chat conversations are characterised by their informal, ungrammatical nature and strong reliance on context posing significant challenges for machine translation systems. To address these challenges, we augment large language models with explicit memory mechanisms designed to enhance coherence and consistency across dialogues. Specifically, we employ graph representations to capture and utilise dialogue context, leveraging concept connectivity as a compressed memory. Our approach ranked second place for Dutch and French, and third place for Portuguese and German, based on COMET-22 scores and human evaluation.</abstract>
+      <url hash="aea2dbe4">2024.wmt-1.106</url>
+      <bibkey>krause-etal-2024-graph</bibkey>
+    </paper>
+    <paper id="107">
+      <title>Reducing Redundancy in <fixed-case>J</fixed-case>apanese-to-<fixed-case>E</fixed-case>nglish Translation: A Multi-Pipeline Approach for Translating Repeated Elements in <fixed-case>J</fixed-case>apanese</title>
+      <author><first>Qiao</first><last>Wang</last><affiliation>Waseda University</affiliation></author>
+      <author><first>Yixuan</first><last>Huang</last><affiliation>Graduate School of International Culture and Communication Studies Waseda University</affiliation></author>
+      <author><first>Zheng</first><last>Yuan</last><affiliation>King’s College London</affiliation></author>
+      <pages>1047-1055</pages>
+      <abstract>This paper presents a multi-pipeline Japanese-to-English machine translation (MT) system designed to address the challenge of translating repeated elements from Japanese into fluent and lexically diverse English. The system is developed as part of the Non-Repetitive Translation Task at WMT24, which focuses on minimizing redundancy while maintaining high translation quality. Our approach utilizes MeCab, the de facto NLP tool for Japanese, for the identification of repeated elements, and Claude Sonnet 3.5, a large language model (LLM), for translation and proofreading. The system effectively accomplishes the shared task by identifying and translating in a diversified manner 89.79{% of the 470 repeated instances in the testing dataset, and achieving an average translation quality score of 4.60 out of 5, significantly surpassing the baseline score of 3.88. Analysis also revealed the challenges encountered, particularly in identifying standalone noun-suffix elements and occasional cases of consistent translations or mistranslations.</abstract>
+      <url hash="3fd17baf">2024.wmt-1.107</url>
+      <bibkey>wang-etal-2024-reducing</bibkey>
+    </paper>
+    <paper id="108">
+      <title><fixed-case>SYSTRAN</fixed-case> @ <fixed-case>WMT</fixed-case>24 Non-Repetitive Translation Task</title>
+      <author><first>Marko</first><last>Avila</last><affiliation>CHAPSVISION</affiliation></author>
+      <author><first>Josep</first><last>Crego</last><affiliation>CHAPSVISION</affiliation></author>
+      <pages>1056-1062</pages>
+      <abstract>Many contemporary NLP systems rely on neural decoders for text generation, which demonstrate an impressive ability to generate text approaching human fluency levels. However, in the case of neural machine translation networks, they often grapple with the production of repetitive content, also known as repetitive diction or word repetition, an aspect they weren’t explicitly trained to address. While not inherently negative, this repetition can make writing seem monotonous or awkward if not used intentionally for emphasis or stylistic purposes. This paper presents our submission to the WMT 2024 Non-Repetitive Translation Task, for which we adopt a repetition penalty method applied at learning inspired by the principles of label smoothing. No additional work is needed at inference time. We modify the ground-truth distribution to steer the model towards discouraging repetitions. Experiments show the ability of the proposed methods in reducing repetitions within neural machine translation engines, without compromising efficiency or translation quality.</abstract>
+      <url hash="725c0945">2024.wmt-1.108</url>
+      <bibkey>avila-crego-2024-systran</bibkey>
+    </paper>
+    <paper id="109">
+      <title>Mitigating Metric Bias in Minimum <fixed-case>B</fixed-case>ayes Risk Decoding</title>
+      <author><first>Geza</first><last>Kovacs</last><affiliation>Google</affiliation></author>
+      <author><first>Daniel</first><last>Deutsch</last><affiliation>Google</affiliation></author>
+      <author><first>Markus</first><last>Freitag</last><affiliation>Google Research</affiliation></author>
+      <pages>1063-1094</pages>
+      <abstract>While Minimum Bayes Risk (MBR) decoding using metrics such as COMET or MetricX has outperformed traditional decoding methods such as greedy or beam search, it introduces a challenge we refer to as metric bias. As MBR decoding aims to produce translations that score highly according to a specific utility metric, this very process makes it impossible to use the same metric for both decoding and evaluation, as any improvement might simply be due to reward hacking rather than reflecting real quality improvements. In this work we demonstrate that compared to human ratings, neural metrics not only overestimate the quality of MBR decoding when the same metric is used as the utility metric, but they also overestimate the quality of MBR/QE decoding with other neural utility metrics as well. We also show that the metric bias issue can be mitigated by using an ensemble of utility metrics during MBR decoding: human evaluations show that MBR decoding using an ensemble of utility metrics outperforms a single utility metric.</abstract>
+      <url hash="40bdc032">2024.wmt-1.109</url>
+      <bibkey>kovacs-etal-2024-mitigating</bibkey>
+    </paper>
+    <paper id="110">
+      <title>Beyond Human-Only: Evaluating Human-Machine Collaboration for Collecting High-Quality Translation Data</title>
+      <author><first>Zhongtao</first><last>Liu</last><affiliation>Google</affiliation></author>
+      <author><first>Parker</first><last>Riley</last><affiliation>Google Translate</affiliation></author>
+      <author><first>Daniel</first><last>Deutsch</last><affiliation>Google</affiliation></author>
+      <author><first>Alison</first><last>Lui</last><affiliation>Google Translate</affiliation></author>
+      <author><first>Mengmeng</first><last>Niu</last><affiliation>Google Translate</affiliation></author>
+      <author><first>Apurva</first><last>Shah</last><affiliation>Google Inc</affiliation></author>
+      <author><first>Markus</first><last>Freitag</last><affiliation>Google Research</affiliation></author>
+      <pages>1095-1106</pages>
+      <abstract>Collecting high-quality translations is crucial for the development and evaluation of machine translation systems. However, traditional human-only approaches are costly and slow. This study presents a comprehensive investigation of 11 approaches for acquiring translation data, including human-only, machine-only, and hybrid approaches. Our findings demonstrate that human-machine collaboration can match or even exceed the quality of human-only translations, while being more cost-efficient. Error analysis reveals the complementary strengths between human and machine contributions, highlighting the effectiveness of collaborative methods. Cost analysis further demonstrates the economic benefits of human-machine collaboration methods, with some approaches achieving top-tier quality at around 60% of the cost of traditional methods. We release a publicly available dataset containing nearly 18,000 segments of varying translation quality with corresponding human ratings to facilitate future research.</abstract>
+      <url hash="5daaf200">2024.wmt-1.110</url>
+      <bibkey>liu-etal-2024-beyond-human</bibkey>
+    </paper>
+    <paper id="111">
+      <title>How Effective Are State Space Models for Machine Translation?</title>
+      <author><first>Hugo</first><last>Pitorro</last><affiliation>Technical University of Munich</affiliation></author>
+      <author><first>Pavlo</first><last>Vasylenko</last><affiliation>Sapienza University of Rome</affiliation></author>
+      <author><first>Marcos</first><last>Treviso</last><affiliation>Instituto de Telecomunicacoes</affiliation></author>
+      <author><first>André</first><last>Martins</last><affiliation>Unbabel, Instituto de Telecomunicacoes</affiliation></author>
+      <pages>1107-1124</pages>
+      <abstract>Transformers are the current architecture of choice for NLP, but their attention layers do not scale well to long contexts. Recent works propose to replace attention with linear recurrent layers - this is the case for state space models, which enjoy efficient training and inference. However, it remains unclear whether these models are competitive with transformers in machine translation (MT). In this paper, we provide a rigorous and comprehensive experimental comparison between transformers and linear recurrent models for MT. Concretely, we experiment with RetNet, Mamba, and hybrid versions of Mamba which incorporate attention mechanisms. Our findings demonstrate that Mamba is highly competitive with transformers on sentence and paragraph-level datasets, where in the latter both models benefit from shifting the training distribution towards longer sequences. Further analysis show that integrating attention into Mamba improves translation quality, robustness to sequence length extrapolation, and the ability to recall named entities.</abstract>
+      <url hash="9ff7414c">2024.wmt-1.111</url>
+      <bibkey>pitorro-etal-2024-effective</bibkey>
+    </paper>
+    <paper id="112">
+      <title>Evaluation and Large-scale Training for Contextual Machine Translation</title>
+      <author><first>Matt</first><last>Post</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Marcin</first><last>Junczys-Dowmunt</last><affiliation>Microsoft</affiliation></author>
+      <pages>1125-1139</pages>
+      <abstract>Despite the fact that context is known to be vital for resolving a range of translation ambiguities, most traditional machine translation systems continue to be trained and to operate at the sentence level. A common explanation is the lack of document-level annotations for existing training data. This work investigates whether having such annotations would be helpful for training traditional MT systems at scale. We build large-scale, state-of-the-art contextual MT systems into German, French, and Russian, fixing the datasets while comparing the effect of sourcing contextual training samples from both parallel and back-translated data. We then evaluate these contextual models across a range of contextual test sets from the literature, where we find that (a) document annotations from both mined parallel and back-translated monolingual data are helpful, but that the best contextual MT systems do not draw contextual samples from the parallel data. We also make two points related to evaluation: (b) contrastive score-based metrics on challenge sets are not discriminative; instead, models must be tested directly on their ability to generate correct outputs, and (c) standard corpus-level metrics such as COMET work best in settings that are dense in contextual phenomena.</abstract>
+      <url hash="8006c2ea">2024.wmt-1.112</url>
+      <bibkey>post-junczys-dowmunt-2024-evaluation</bibkey>
+    </paper>
+    <paper id="113">
+      <title>A Multi-task Learning Framework for Evaluating Machine Translation of Emotion-loaded User-generated Content</title>
+      <author><first>Shenbin</first><last>Qian</last><affiliation>University of Surrey</affiliation></author>
+      <author><first>Constantin</first><last>Orasan</last><affiliation>University of Surrey</affiliation></author>
+      <author><first>Diptesh</first><last>Kanojia</last><affiliation>University of Surrey</affiliation></author>
+      <author><first>Félix</first><last>Do Carmo</last><affiliation>University of Surrey</affiliation></author>
+      <pages>1140-1154</pages>
+      <abstract>Machine translation (MT) of user-generated content (UGC) poses unique challenges, including handling slang, emotion, and literary devices like irony and sarcasm. Evaluating the quality of these translations is challenging as current metrics do not focus on these ubiquitous features of UGC. To address this issue, we utilize an existing emotion-related dataset that includes emotion labels and human-annotated translation errors based on Multi-dimensional Quality Metrics. We extend it with sentence-level evaluation scores and word-level labels, leading to a dataset suitable for sentence- and word-level translation evaluation and emotion classification, in a multi-task setting. We propose a new architecture to perform these tasks concurrently, with a novel combined loss function, which integrates different loss heuristics, like the Nash and Aligned losses. Our evaluation compares existing fine-tuning and multi-task learning approaches, assessing generalization with ablative experiments over multiple datasets. Our approach achieves state-of-the-art performance and we present a comprehensive analysis for MT evaluation of UGC.</abstract>
+      <url hash="c378a55b">2024.wmt-1.113</url>
+      <bibkey>qian-etal-2024-multi</bibkey>
+    </paper>
+    <paper id="114">
+      <title>On Instruction-Finetuning Neural Machine Translation Models</title>
+      <author><first>Vikas</first><last>Raunak</last><affiliation>Microsoft</affiliation></author>
+      <author><first>Roman</first><last>Grundkiewicz</last><affiliation>Microsoft Research</affiliation></author>
+      <author><first>Marcin</first><last>Junczys-Dowmunt</last><affiliation>Microsoft</affiliation></author>
+      <pages>1155-1166</pages>
+      <abstract>In this work, we introduce instruction finetuning for Neural Machine Translation (NMT) models, which distills instruction following capabilities from Large Language Models (LLMs) into orders-of-magnitude smaller NMT models. Our instruction-finetuning recipe for NMT models enables customization of translations for a limited but disparate set of translation-specific tasks.We show that NMT models are capable of following multiple instructions simultaneously and demonstrate capabilities of zero-shot composition of instructions.We also show that through instruction finetuning, traditionally disparate tasks such as formality-controlled machine translation, multi-domain adaptation as well as multi-modal translations can be tackled jointly by a single instruction finetuned NMT model, at a performance level comparable to LLMs such as GPT-3.5-Turbo.To the best of our knowledge, our work is among the first to demonstrate the instruction-following capabilities of traditional NMT models, which allows for faster, cheaper and more efficient serving of customized translations.</abstract>
+      <url hash="1fe02a59">2024.wmt-1.114</url>
+      <bibkey>raunak-etal-2024-instruction</bibkey>
+    </paper>
+    <paper id="115">
+      <title>Benchmarking Visually-Situated Translation of Text in Natural Images</title>
+      <author><first>Elizabeth</first><last>Salesky</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Philipp</first><last>Koehn</last><affiliation>Johns Hopkins University</affiliation></author>
+      <author><first>Matt</first><last>Post</last><affiliation>Microsoft</affiliation></author>
+      <pages>1167-1182</pages>
+      <abstract>We introduce a benchmark, Vistra, for visually-situated translation of English text in natural images to four target languages. We describe the dataset construction and composition. We benchmark open-source and commercial OCR and MT models on Vistra, and present both quantitative results and a taxonomy of common OCR error classes with their effect on downstream MT. Finally, we assess direct image-to-text translation with a multimodal LLM, and show that it is able in some cases but not yet consistently to disambiguate possible translations with visual context. We show that this is an unsolved and challenging task even for strong commercial models. We hope that the creation and release of this benchmark which is the first of its kind for these language pairs will encourage further research in this direction.</abstract>
+      <url hash="befd90c5">2024.wmt-1.115</url>
+      <bibkey>salesky-etal-2024-benchmarking</bibkey>
+    </paper>
+    <paper id="116">
+      <title>Analysing Translation Artifacts: A Comparative Study of <fixed-case>LLM</fixed-case>s, <fixed-case>NMT</fixed-case>s, and Human Translations</title>
+      <author><first>Fedor</first><last>Sizov</last><affiliation>Saarland University</affiliation></author>
+      <author><first>Cristina</first><last>España-Bonet</last><affiliation>DFKI GmbH</affiliation></author>
+      <author><first>Josef</first><last>Van Genabith</last><affiliation>DFKI</affiliation></author>
+      <author><first>Roy</first><last>Xie</last><affiliation>Duke University</affiliation></author>
+      <author><first>Koel</first><last>Dutta Chowdhury</last><affiliation>Saarland Informatics Campus,Saarland University</affiliation></author>
+      <pages>1183-1199</pages>
+      <abstract>Translated texts exhibit a range of characteristics that make them appear distinct from texts originally written in the same target language. With the rise of Large Language Models (LLMs), which are designed for a wide range of language generation and understanding tasks, there has been significant interest in their application to Machine Translation. While several studies have focused on improving translation quality through fine-tuning or few-shot prompting techniques, there has been limited exploration of how LLM-generated translations qualitatively differ from those produced by Neural Machine Translation (NMT) models, and human translations. Our study employs explainability methods such as Leave-One-Out (LOO) and Integrated Gradients (IG) to analyze the lexical features distinguishing human translations from those produced by LLMs and NMT systems. Specifically, we apply a two-stage approach: first, classifying texts based on their origin – whether they are original or translations – and second, extracting significant lexical features (highly attributed input words) using post-hoc interpretability methods. Our analysis shows that different methods of feature extraction vary in their effectiveness, with LOO being generally better at pinpointing critical input words and IG capturing a broader range of important words. Finally, our results show that while LLMs and NMT systems can produce translations of a good quality, they still differ from texts originally written by native speakers. Specifically, we find that while some LLMs often align closely with human translations, traditional NMT systems exhibit distinct characteristics, particularly in their use of certain linguistic features.</abstract>
+      <url hash="c463ec29">2024.wmt-1.116</url>
+      <bibkey>sizov-etal-2024-analysing</bibkey>
+    </paper>
+    <paper id="117">
+      <title>How Grammatical Features Impact Machine Translation: A New Test Suite for <fixed-case>C</fixed-case>hinese-<fixed-case>E</fixed-case>nglish <fixed-case>MT</fixed-case> Evaluation</title>
+      <author><first>Huacheng</first><last>Song</last><affiliation>The Hong Kong Polytechnic University</affiliation></author>
+      <author><first>Yi</first><last>Li</last><affiliation>Shanghai International Studies University</affiliation></author>
+      <author><first>Yiwen</first><last>Wu</last><affiliation>Nanyang Technological University</affiliation></author>
+      <author><first>Yu</first><last>Liu</last><affiliation>Nanyang Technological University</affiliation></author>
+      <author><first>Jingxia</first><last>Lin</last><affiliation>Nanyang Technological University</affiliation></author>
+      <author><first>Hongzhi</first><last>Xu</last><affiliation>Shanghai International Studies University</affiliation></author>
+      <pages>1200-1221</pages>
+      <abstract>Machine translation (MT) evaluation has evolved toward a trend of fine-grained granularity, enabling a more precise diagnosis of hidden flaws and weaknesses of MT systems from various perspectives. This paper examines how MT systems are potentially affected by certain grammatical features, offering insights into the challenges these features pose and suggesting possible directions for improvement. We develop a new test suite by extracting 7,848 sentences from a multi-domain Chinese-English parallel corpus. All the Chinese text was further annotated with 43 grammatical features using a semi-automatic method. This test suite was subsequently used to evaluate eight state-of-the-art MT systems according to six different automatic evaluation metrics. The results reveal intriguing patterns of MT performance associated with different domains and various grammatical features, highlighting the test suite’s effectiveness. The test suite was made publicly available and it will serve as an important benchmark for evaluating and diagnosing Chinese-English MT systems.</abstract>
+      <url hash="677514bd">2024.wmt-1.117</url>
+      <bibkey>song-etal-2024-grammatical</bibkey>
+    </paper>
+    <paper id="118">
+      <title>Improving Statistical Significance in Human Evaluation of Automatic Metrics via Soft Pairwise Accuracy</title>
+      <author><first>Brian</first><last>Thompson</last><affiliation>Amazon</affiliation></author>
+      <author><first>Nitika</first><last>Mathur</last><affiliation>The University of Melbourne</affiliation></author>
+      <author><first>Daniel</first><last>Deutsch</last><affiliation>Google</affiliation></author>
+      <author><first>Huda</first><last>Khayrallah</last><affiliation>Microsoft</affiliation></author>
+      <pages>1222-1234</pages>
+      <abstract>Selecting an automatic metric that best emulates human annotators is often non-trivial, because there is no clear definition of “best emulates.” A meta-metric is required to compare the human judgments to the automatic metric scores, and metric rankings depend on the choice of meta-metric. We propose Soft Pairwise Accuracy (SPA), a new meta-metric that builds on Pairwise Accuracy (PA) but incorporates the statistical significance of both the human judgments and the metric scores. We show that SPA is more stable than PA with respect to changes in the number of systems/segments used for evaluation. We also show that PA can only assign a small set of distinct output values to metrics, and this results in many metrics being artificially assigned the exact same PA score. We demonstrate that SPA fixes this issue. Finally, we show that SPA is more discriminative than PA, producing more statistically significant comparisons between metrics. SPA was selected as the official system-level metric for the 2024 WMT Metrics Shared Task.</abstract>
+      <url hash="69654eb0">2024.wmt-1.118</url>
+      <bibkey>thompson-etal-2024-improving</bibkey>
+    </paper>
+    <paper id="119">
+      <title>Speech Is More than Words: Do Speech-to-Text Translation Systems Leverage Prosody?</title>
+      <author><first>Ioannis</first><last>Tsiamas</last><affiliation>Polytechnic University of Catalonia (UPC)</affiliation></author>
+      <author><first>Matthias</first><last>Sperber</last><affiliation>Apple</affiliation></author>
+      <author><first>Andrew</first><last>Finch</last><affiliation>Apple Inc.</affiliation></author>
+      <author><first>Sarthak</first><last>Garg</last><affiliation>Apple</affiliation></author>
+      <pages>1235-1257</pages>
+      <abstract>The prosody of a spoken utterance, including features like stress, intonation and rhythm, can significantly affect the underlying semantics, and as a consequence can also affect its textual translation. Nevertheless, prosody is rarely studied within the context of speech-to-text translation (S2TT) systems. In particular, end-to-end (E2E) systems have been proposed as well-suited for prosody-aware translation because they have direct access to the speech signal when making translation decisions, but the understanding of whether this is successful in practice is still limited. A main challenge is the difficulty of evaluating prosody awareness in translation. To address this challenge, we introduce an evaluation methodology and a focused benchmark (named ContraProSt) aimed at capturing a wide range of prosodic phenomena. Our methodology uses large language models and controllable text-to-speech (TTS) to generate contrastive examples. Through experiments in translating English speech into German, Spanish, and Japanese, we find that (a) S2TT models possess some internal representation of prosody, but the prosody signal is often not strong enough to affect the translations, (b) E2E systems outperform cascades of speech recognition and text translation systems, confirming their theoretical advantage in this regard, and (c) certain cascaded systems also capture prosodic information in the translation, but only to a lesser extent that depends on the particulars of the transcript’s surface form.</abstract>
+      <url hash="84609d32">2024.wmt-1.119</url>
+      <bibkey>tsiamas-etal-2024-speech</bibkey>
+    </paper>
+    <paper id="120">
+      <title>Cultural Adaptation of Menus: A Fine-Grained Approach</title>
+      <author><first>Zhonghe</first><last>Zhang</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Xiaoyu</first><last>He</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Vivek</first><last>Iyer</last><affiliation>The University of Edinburgh</affiliation></author>
+      <author><first>Alexandra</first><last>Birch</last><affiliation>University of Edinburgh</affiliation></author>
+      <pages>1258-1271</pages>
+      <abstract>Machine Translation of Culture-Specific Items (CSIs) poses significant challenges. Recent work on CSI translation has shown some success using Large Language Models (LLMs) to adapt to different languages and cultures; however, a deeper analysis is needed to examine the benefits and pitfalls of each method. In this paper, we introduce the ChineseMenuCSI dataset, the largest for Chinese-English menu corpora, annotated with CSI vs Non-CSI labels and a fine-grained test set. We define three levels of CSI figurativeness for a more nuanced analysis and develop a novel methodology for automatic CSI identification, which outperforms GPT-based prompts in most categories. Importantly, we are the first to integrate human translation theories into LLM-driven translation processes, significantly improving translation accuracy, with COMET scores increasing by up to 7 points. The code and dataset are available at https://github.com/Henry8772/ChineseMenuCSI.</abstract>
+      <url hash="ee243c75">2024.wmt-1.120</url>
+      <bibkey>zhang-etal-2024-cultural</bibkey>
+    </paper>
+    <paper id="121">
+      <title>Pitfalls and Outlooks in Using <fixed-case>COMET</fixed-case></title>
+      <author><first>Vilém</first><last>Zouhar</last><affiliation>ETH Zurich, Charles University</affiliation></author>
+      <author><first>Pinzhen</first><last>Chen</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Tsz Kin</first><last>Lam</last><affiliation>The University of Edinburgh</affiliation></author>
+      <author><first>Nikita</first><last>Moghe</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Barry</first><last>Haddow</last><affiliation>University of Edinburgh</affiliation></author>
+      <pages>1272-1288</pages>
+      <abstract>The COMET metric has blazed a trail in the machine translation community, given its strong correlation with human judgements of translation quality.Its success stems from being a modified pre-trained multilingual model finetuned for quality assessment.However, it being a machine learning model also gives rise to a new set of pitfalls that may not be widely known. We investigate these unexpected behaviours from three aspects:1) technical: obsolete software versions and compute precision; 2) data: empty content, language mismatch, and translationese at test time as well as distribution and domain biases in training; 3) usage and reporting: multi-reference support and model referencing in the literature. All of these problems imply that COMET scores are not comparable between papers or even technical setups and we put forward our perspective on fixing each issue.Furthermore, we release the sacreCOMET package that can generate a signature for the software and model configuration as well as an appropriate citation.The goal of this work is to help the community make more sound use of the COMET metric.</abstract>
+      <url hash="9f7bfa0f">2024.wmt-1.121</url>
+      <bibkey>zouhar-etal-2024-pitfalls</bibkey>
+    </paper>
+    <paper id="122">
+      <title>Post-edits Are Preferences Too</title>
+      <author><first>Nathaniel</first><last>Berger</last><affiliation>Heidelberg University</affiliation></author>
+      <author><first>Stefan</first><last>Riezler</last><affiliation>Heidelberg University</affiliation></author>
+      <author><first>Miriam</first><last>Exel</last><affiliation>SAP SE</affiliation></author>
+      <author><first>Matthias</first><last>Huck</last><affiliation>SAP SE</affiliation></author>
+      <pages>1289-1300</pages>
+      <abstract>Preference Optimization (PO) techniques are currently one of the state of the art techniques for fine-tuning large language models (LLMs) on pairwise preference feedback from human annotators. However, in machine translation, this sort of feedback can be difficult to solicit. Additionally, Kreuzer et al. (2018) have shown that, for machine translation, pairwise preferences are less reliable than other forms of human feedback, such as 5-point ratings.We examine post-edits to see if they can be a source of reliable human preferences by construction. In PO, a human annotator is shown sequences $s_1$ and $s_2$ and asked for a preference judgment, while for post-editing, editors create $s_1$ and know that it should be better than $s_2$. We attempt to use these implicit preferences for PO and show that it helps the model move towards post-edit like hypotheses and away from machine translation-like hypotheses. Furthermore, we show that best results are obtained by pre-training the model with supervised fine-tuning (SFT) on post-edits in order to promote post-edit like hypotheses to the top output ranks.</abstract>
+      <url hash="c65f8d6e">2024.wmt-1.122</url>
+      <bibkey>berger-etal-2024-post</bibkey>
+    </paper>
+    <paper id="123">
+      <title>Translating Step-by-Step: Decomposing the Translation Process for Improved Translation Quality of Long-Form Texts</title>
+      <author><first>Eleftheria</first><last>Briakou</last><affiliation>Google</affiliation></author>
+      <author><first>Jiaming</first><last>Luo</last><affiliation>Google</affiliation></author>
+      <author><first>Colin</first><last>Cherry</last><affiliation>Google</affiliation></author>
+      <author><first>Markus</first><last>Freitag</last><affiliation>Google Research</affiliation></author>
+      <pages>1301-1317</pages>
+      <abstract>In this paper we present a step-by-step approach to long-form text translation, drawing on established processes in translation studies. Instead of viewing machine translation as a single, monolithic task, we propose a framework that engages language models in a multi-turn interaction, encompassing pre-translation research, drafting, refining, and proofreading, resulting in progressively improved translations.Extensive automatic evaluations using Gemini 1.5 Pro across ten language pairs show that translating step-by-step yields large translation quality improvements over conventional zero-shot prompting approaches and earlier human-like baseline strategies, resulting in state-of-the-art results on WMT 2024.</abstract>
+      <url hash="0b8c3a56">2024.wmt-1.123</url>
+      <bibkey>briakou-etal-2024-translating</bibkey>
+    </paper>
+    <paper id="124">
+      <title>Scaling Laws of Decoder-Only Models on the Multilingual Machine Translation Task</title>
+      <author><first>Gaëtan</first><last>Caillaut</last><affiliation>Lingua Custodia</affiliation></author>
+      <author><first>Mariam</first><last>Nakhlé</last><affiliation>Lingua Custodia</affiliation></author>
+      <author><first>Raheel</first><last>Qader</last><affiliation>Lingua Custodia</affiliation></author>
+      <author><first>Jingshu</first><last>Liu</last><affiliation>Lingua Custodia</affiliation></author>
+      <author><first>Jean-Gabriel</first><last>Barthélemy</last><affiliation>Lingua Custodia</affiliation></author>
+      <pages>1318-1331</pages>
+      <abstract>Recent studies have showcased remarkable capabilities of decoder-only models in many NLP tasks, including translation. Yet, the machine translation field has been largely dominated by encoder-decoder models based on the Transformer architecture. As a consequence, scaling laws of encoder-decoder models for neural machine translation have already been well studied, but decoder-only models have received less attention.This work explores the scaling laws of decoder-only models on the multilingual and multidomain translation task. We trained a collection of six decoder-only models, ranging from 70M to 7B parameters, on a sentence-level, multilingual (8 languages) and multidomain (9 domains) dataset. We conducted a series of experiments showing that the loss of decoder-only models can be estimated using a scaling law similar to the one discovered for large language models, but we also show that this scaling law has difficulties to generalize to too large models or to a different data distribution. We also study different scaling methods and show that scaling the depth and the width of a model lead to similar test loss improvements, but with different impact on the model’s efficiency.</abstract>
+      <url hash="5a4691b8">2024.wmt-1.124</url>
+      <bibkey>caillaut-etal-2024-scaling</bibkey>
+    </paper>
+    <paper id="125">
+      <title>Shortcomings of <fixed-case>LLM</fixed-case>s for Low-Resource Translation: Retrieval and Understanding Are Both the Problem</title>
+      <author><first>Sara</first><last>Court</last><affiliation>The Ohio State University</affiliation></author>
+      <author><first>Micha</first><last>Elsner</last><affiliation>The Ohio State University</affiliation></author>
+      <pages>1332-1354</pages>
+      <abstract>This work investigates the in-context learning abilities of pretrained large language models (LLMs) when instructed to translate text from a low-resource language into a high-resource language as part of an automated machine translation pipeline. We conduct a set of experiments translating Southern Quechua to Spanish and examine the informativity of various types of information retrieved from a constrained database of digitized pedagogical materials (dictionaries and grammar lessons) and parallel corpora. Using both automatic and human evaluation of model output, we conduct ablation studies that manipulate (1) context type (morpheme translations, grammar descriptions, and corpus examples), (2) retrieval methods (automated vs. manual), and (3) model type. Our results suggest that even relatively small LLMs are capable of utilizing prompt context for zero-shot low-resource translation when provided a minimally sufficient amount of relevant linguistic information. However, the variable effects of prompt type, retrieval method, model type, and language community-specific factors highlight the limitations of using even the best LLMs as translation systems for the majority of the world’s 7,000+ languages and their speakers.</abstract>
+      <url hash="01f59d35">2024.wmt-1.125</url>
+      <bibkey>court-elsner-2024-shortcomings</bibkey>
+    </paper>
+    <paper id="126">
+      <title>Introducing the <fixed-case>N</fixed-case>ews<fixed-case>P</fixed-case>a<fixed-case>LM</fixed-case> <fixed-case>MBR</fixed-case> and <fixed-case>QE</fixed-case> Dataset: <fixed-case>LLM</fixed-case>-Generated High-Quality Parallel Data Outperforms Traditional Web-Crawled Data</title>
+      <author><first>Mara</first><last>Finkelstein</last><affiliation>Google</affiliation></author>
+      <author><first>David</first><last>Vilar</last><affiliation>Google</affiliation></author>
+      <author><first>Markus</first><last>Freitag</last><affiliation>Google Research</affiliation></author>
+      <pages>1355-1372</pages>
+      <abstract>Recent research in neural machine translation (NMT) has shown that training on high-quality machine-generated data can outperform training on human-generated data. This work accompanies the first-ever release of a LLM-generated, MBR-decoded and QE-reranked dataset with both sentence-level and multi-sentence examples. We perform extensive experiments to demonstrate the quality of our dataset in terms of its downstream impact on NMT model performance. We find that training from scratch on our (machine-generated) dataset outperforms training on the (web-crawled) WMT’23 training dataset (which is 300 times larger), and also outperforms training on the top-quality subset of the WMT’23 training dataset. We also find that performing self-distillation by finetuning the LLM which generated this dataset outperforms the LLM’s strong few-shot baseline. These findings corroborate the quality of our dataset, and demonstrate the value of high-quality machine-generated data in improving performance of NMT models.</abstract>
+      <url hash="7cb443f3">2024.wmt-1.126</url>
+      <bibkey>finkelstein-etal-2024-introducing</bibkey>
+    </paper>
+    <paper id="127">
+      <title>Is Preference Alignment Always the Best Option to Enhance <fixed-case>LLM</fixed-case>-Based Translation? An Empirical Analysis</title>
+      <author><first>Hippolyte</first><last>Gisserot-Boukhlef</last><affiliation>MICS-CentraleSupelec/Artefact</affiliation></author>
+      <author><first>Ricardo</first><last>Rei</last><affiliation>Unbabel/INESC-ID</affiliation></author>
+      <author><first>Emmanuel</first><last>Malherbe</last><affiliation>Artefact</affiliation></author>
+      <author><first>Céline</first><last>Hudelot</last><affiliation>MICS-CentraleSupelec</affiliation></author>
+      <author><first>Pierre</first><last>Colombo</last><affiliation>L2S CentraleSupelec</affiliation></author>
+      <author><first>Nuno M.</first><last>Guerreiro</last><affiliation>Instituto de Telecomunicacoes, University of Lisbon</affiliation></author>
+      <pages>1373-1392</pages>
+      <abstract>Neural metrics for machine translation (MT) evaluation have become increasingly prominent due to their superior correlation with human judgments compared to traditional lexical metrics. Researchers have therefore utilized neural metrics through quality-informed decoding strategies, achieving better results than likelihood-based methods. With the rise of Large Language Models (LLMs), preference-based alignment techniques have gained attention for their potential to enhance translation quality by optimizing model weights directly on preferences induced by quality estimators. This study focuses on Contrastive Preference Optimization (CPO) and conducts extensive experiments to evaluate the impact of preference-based alignment on translation quality. Our findings indicate that while CPO consistently outperforms Supervised Fine-Tuning (SFT) on high-quality data with regard to the alignment metric, it may lead to instability across downstream evaluation metrics, particularly between neural and lexical ones. Additionally, we demonstrate that relying solely on the base model for generating candidate translations achieves performance comparable to using multiple external systems, while ensuring better consistency across downstream metrics.</abstract>
+      <url hash="1eaac882">2024.wmt-1.127</url>
+      <bibkey>gisserot-boukhlef-etal-2024-preference</bibkey>
+    </paper>
+    <paper id="128">
+      <title>Quality or Quantity? On Data Scale and Diversity in Adapting Large Language Models for Low-Resource Translation</title>
+      <author><first>Vivek</first><last>Iyer</last><affiliation>The University of Edinburgh</affiliation></author>
+      <author><first>Bhavitvya</first><last>Malik</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Pavel</first><last>Stepachev</last><affiliation>The University of Edinburgh</affiliation></author>
+      <author><first>Pinzhen</first><last>Chen</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Barry</first><last>Haddow</last><affiliation>University of Edinburgh</affiliation></author>
+      <author><first>Alexandra</first><last>Birch</last><affiliation>University of Edinburgh</affiliation></author>
+      <pages>1393-1409</pages>
+      <abstract>Despite the recent popularity of Large Language Models (LLMs) in Machine Translation (MT), their performance in low-resource languages (LRLs) still lags significantly behind Neural Machine Translation (NMT) models. In this work, we explore what it would take to adapt LLMs for the low-resource setting. Particularly, we re-examine the role of two factors: a) the importance and application of parallel data, and b) diversity in Supervised Fine-Tuning (SFT). Recently, parallel data has seen reduced use in adapting LLMs for MT, while data diversity has been embraced to promote transfer across languages and tasks. However, for low-resource LLM-MT, we show that the opposite is true for both considerations: a) {emph{parallel data} is critical during both pre-training and SFT; b) diversity tends to cause {emph{interference} instead of transfer. Our experiments with three LLMs across two low-resourced language groups—Indigenous American and North-East Indian—reveal consistent trends, underscoring the generalizability of our findings. We believe these insights will be valuable for scaling to massively multilingual LLM-MT models that can effectively serve LRLs.</abstract>
+      <url hash="ae1619c1">2024.wmt-1.128</url>
+      <bibkey>iyer-etal-2024-quality</bibkey>
+    </paper>
+    <paper id="129">
+      <title>Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation</title>
+      <author><first>Myung</first><last>Jiyoon</last><affiliation>Modulabs</affiliation></author>
+      <author><first>Jihyeon</first><last>Park</last><affiliation>Modulabs</affiliation></author>
+      <author><first>Jungki</first><last>Son</last><affiliation>Modulabs</affiliation></author>
+      <author><first>Kyungro</first><last>Lee</last><affiliation>Modulabs</affiliation></author>
+      <author><first>Joohyung</first><last>Han</last><affiliation>Modulabs</affiliation></author>
+      <pages>1410-1427</pages>
+      <abstract>This paper addresses the challenge of accurately translating technical terms, which are crucial for clear communication in specialized fields. We introduce the Parenthetical Terminology Translation (PTT) task, designed to mitigate potential inaccuracies by displaying the original term in parentheses alongside its translation. To implement this approach, we generated a representative PTT dataset using a collaborative approach with large language models and applied knowledge distillation to fine-tune traditional Neural Machine Translation (NMT) models and small-sized Large Language Models (sLMs). Additionally, we developed a novel evaluation metric to assess both overall translation accuracy and the correct parenthetical presentation of terms. Our findings indicate that sLMs did not consistently outperform NMT models, with fine-tuning proving more effective than few-shot prompting, particularly in models with continued pre-training in the target language. These insights contribute to the advancement of more reliable terminology translation methodologies.</abstract>
+      <url hash="0e19a1bb">2024.wmt-1.129</url>
+      <bibkey>jiyoon-etal-2024-efficient</bibkey>
+    </paper>
+    <paper id="130">
+      <title>Assessing the Role of Imagery in Multimodal Machine Translation</title>
+      <author><first>Nicholas</first><last>Kashani Motlagh</last><affiliation>Ohio State University</affiliation></author>
+      <author><first>Jim</first><last>Davis</last><affiliation>Ohio State University</affiliation></author>
+      <author><first>Jeremy</first><last>Gwinnup</last><affiliation>Air Force Research Laboratory</affiliation></author>
+      <author><first>Grant</first><last>Erdmann</last><affiliation>Air Force Research Laboratory</affiliation></author>
+      <author><first>Tim</first><last>Anderson</last><affiliation>Air Force Research Laboratory</affiliation></author>
+      <pages>1428-1439</pages>
+      <abstract>In Multimodal Machine Translation (MMT), the use of visual data has shown only marginal improvements compared to text-only models. Previously, the CoMMuTE dataset and associated metric were proposed to score models on tasks where the imagery is necessary to disambiguate between two possible translations for each ambiguous source sentence. In this work, we introduce new metrics within the CoMMuTE domain to provide deeper insights into image-aware translation models. Our proposed metrics differ from the previous CoMMuTE scoring method by 1) assessing the impact of multiple images on individual translations and 2) evaluating a model’s ability to jointly select each translation for each image context. Our results challenge the conventional views of poor visual comprehension capabilities of MMT models and show that models can indeed meaningfully interpret visual information, though they may not leverage it sufficiently in the final decision.</abstract>
+      <url hash="dcac855a">2024.wmt-1.130</url>
+      <bibkey>kashani-motlagh-etal-2024-assessing</bibkey>
+    </paper>
+    <paper id="131">
+      <title>Error Span Annotation: A Balanced Approach for Human Evaluation of Machine Translation</title>
+      <author><first>Tom</first><last>Kocmi</last><affiliation>Cohere</affiliation></author>
+      <author><first>Vilém</first><last>Zouhar</last><affiliation>ETH Zurich, Charles University</affiliation></author>
+      <author><first>Eleftherios</first><last>Avramidis</last><affiliation>German Research Center for Artificial Intelligence (DFKI)</affiliation></author>
+      <author><first>Roman</first><last>Grundkiewicz</last><affiliation>Microsoft Research</affiliation></author>
+      <author><first>Marzena</first><last>Karpinska</last><affiliation>University of Massachusetts Amherst</affiliation></author>
+      <author><first>Maja</first><last>Popović</last><affiliation>ADAPT, Dublin City University</affiliation></author>
+      <author><first>Mrinmaya</first><last>Sachan</last><affiliation>ETH Zurich</affiliation></author>
+      <author><first>Mariya</first><last>Shmatova</last><affiliation>Dubformer</affiliation></author>
+      <pages>1440-1453</pages>
+      <abstract>High-quality Machine Translation (MT) evaluation relies heavily on human judgments.Comprehensive error classification methods, such as Multidimensional Quality Metrics (MQM), are expensive as they are time-consuming and can only be done by experts, whose availability may be limited especially for low-resource languages.On the other hand, just assigning overall scores, like Direct Assessment (DA), is simpler and faster and can be done by translators of any level, but is less reliable.In this paper, we introduce Error Span Annotation (ESA), a human evaluation protocol which combines the continuous rating of DA with the high-level error severity span marking of MQM.We validate ESA by comparing it to MQM and DA for 12 MT systems and one human reference translation (English to German) from WMT23. The results show that ESA offers faster and cheaper annotations than MQM at the same quality level, without the requirement of expensive MQM experts.</abstract>
+      <url hash="73e11508">2024.wmt-1.131</url>
+      <bibkey>kocmi-etal-2024-error</bibkey>
+    </paper>
+    <paper id="132">
+      <title>Neural Methods for Aligning Large-Scale Parallel Corpora from the Web for South and <fixed-case>E</fixed-case>ast <fixed-case>A</fixed-case>sian Languages</title>
+      <author><first>Philipp</first><last>Koehn</last><affiliation>Johns Hopkins University</affiliation></author>
+      <pages>1454-1466</pages>
+      <abstract>We introduce neural methods and a toxicity filtering step to the hierarchical web mining approach of Paracrawl (Bañón et al., 2020), showing large improvements. We apply these methods to web-scale parallel corpus mining for 9 South and East Asian national languages, creating training resources for machine translation that yield better translation quality for most of these languages than existing publicly available datasets in OPUS. Our methods also generally lead to better results than the global mining approach of Schwenk et al. (2021).</abstract>
+      <url hash="29cf9a1d">2024.wmt-1.132</url>
+      <bibkey>koehn-2024-neural</bibkey>
+    </paper>
+    <paper id="133">
+      <title>Plug, Play, and Fuse: Zero-Shot Joint Decoding via Word-Level Re-ranking across Diverse Vocabularies</title>
+      <author><first>Sai</first><last>Koneru</last><affiliation>Karlsruhe Institute of Technology</affiliation></author>
+      <author><first>Matthias</first><last>Huck</last><affiliation>SAP SE</affiliation></author>
+      <author><first>Miriam</first><last>Exel</last><affiliation>SAP SE</affiliation></author>
+      <author><first>Jan</first><last>Niehues</last><affiliation>Karlsruhe Institut of Technology</affiliation></author>
+      <pages>1467-1481</pages>
+      <abstract>Recent advancements in NLP have resulted in models with specialized strengths, such as processing multimodal inputs or excelling in specific domains. However, real-world tasks, like multimodal translation, often require a combination of these strengths, such as handling both translation and image processing. While individual translation and vision models are powerful, they typically lack the ability to perform both tasks in a single system. Combining these models poses challenges, particularly due to differences in their vocabularies, which limit the effectiveness of traditional ensemble methods to post-generation techniques like N-best list re-ranking. In this work, we propose a novel zero-shot ensembling strategy that allows for the integration of different models during the decoding phase without the need for additional training. Our approach re-ranks beams during decoding by combining scores at the word level, using heuristics to predict when a word is completed. We demonstrate the effectiveness of this method in machine translation scenarios, showing that it enables the generation of translations that are both speech- and image-aware while also improving overall translation quality.</abstract>
+      <url hash="692ec7c0">2024.wmt-1.133</url>
+      <bibkey>koneru-etal-2024-plug</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/xml/2024.wnu.xml b/data/xml/2024.wnu.xml
new file mode 100644
index 0000000000..65827f7924
--- /dev/null
+++ b/data/xml/2024.wnu.xml
@@ -0,0 +1,113 @@
+<?xml version='1.0' encoding='UTF-8'?>
+<collection id="2024.wnu">
+  <volume id="1" ingest-date="2024-11-05" type="proceedings">
+    <meta>
+      <booktitle>Proceedings of the The 6th Workshop on Narrative Understanding</booktitle>
+      <editor><first>Yash Kumar</first><last>Lal</last></editor>
+      <editor><first>Elizabeth</first><last>Clark</last></editor>
+      <editor><first>Mohit</first><last>Iyyer</last></editor>
+      <editor><first>Snigdha</first><last>Chaturvedi</last></editor>
+      <editor><first>Anneliese</first><last>Brei</last></editor>
+      <editor><first>Khyathi Raghavi</first><last>Chandu</last></editor>
+      <publisher>Association for Computational Linguistics</publisher>
+      <address>Miami, Florida, USA</address>
+      <month>November</month>
+      <year>2024</year>
+      <url hash="92f47fb5">2024.wnu-1</url>
+      <venue>wnu</venue>
+    </meta>
+    <frontmatter>
+      <url hash="0446fe20">2024.wnu-1.0</url>
+      <bibkey>wnu-2024-1</bibkey>
+    </frontmatter>
+    <paper id="1">
+      <title>Narration as Functions: from Events to Narratives</title>
+      <author><first>Junbo</first><last>Huang</last><affiliation>University of Hamburg</affiliation></author>
+      <author><first>Ricardo</first><last>Usbeck</last><affiliation>Leuphana University Lüneburg</affiliation></author>
+      <pages>1-7</pages>
+      <abstract>Identifying events from text has a long past in narrative analysis, but a short history in Natural Language Processing (NLP). In this position paper, a question is asked: given the telling of a sequence of real-world events by a news narrator, what do NLP event extraction models capture, and what do they miss? Insights from critical discourse analysis (CDA) and from a series of movements in literary criticism motivate us to model the narrated logic in news narratives.As a result, a computational framework is proposed to model the function of news narration, which shapes the narrated world, consumed by news narratees. As a simplification, we represent the causal logic between events depicted in the narrated world.</abstract>
+      <url hash="688a62eb">2024.wnu-1.1</url>
+      <bibkey>huang-usbeck-2024-narration</bibkey>
+    </paper>
+    <paper id="2">
+      <title>How to tame your plotline: A framework for goal-driven interactive fairy tale generation</title>
+      <author><first>Marina</first><last>Ermolaeva</last><affiliation>SaluteDevices</affiliation></author>
+      <author><first>Anastasia</first><last>Shakhmatova</last><affiliation>SaluteDevices</affiliation></author>
+      <author><first>Alina</first><last>Nepomnyashchikh</last><affiliation>SaluteDevices</affiliation></author>
+      <author><first>Alena</first><last>Fenogenova</last><affiliation>SaluteDevices</affiliation></author>
+      <pages>8-31</pages>
+      <abstract>Automatic storytelling is a difficult NLP task that poses a challenge even for state-of-the-art large language models. This paper proposes a pipeline for interactive fairy tale generation in a mixed-initiative setting. Our approach introduces a story goal as a stopping condition, imposes minimal structure on the narrative in the form of a simple emotional arc, and controls the transition between the stages of the story via system prompt engineering. The resulting framework reconciles creating a structured and complete short-form narrative with retaining player agency and allowing users to influence the storyline through their input. We evaluate our approach with several proprietary and open-source language models and examine its transferability to different languages, specifically English and Russian.</abstract>
+      <url hash="d2e16168">2024.wnu-1.2</url>
+      <bibkey>ermolaeva-etal-2024-tame</bibkey>
+    </paper>
+    <paper id="3">
+      <title>Understanding Transmedia Storytelling: Reception and Narrative Comprehension in Bill Willingham’s Fables Franchise</title>
+      <author><first>Victoria</first><last>Lagrange</last><affiliation>Kennesaw State University</affiliation></author>
+      <pages>32-36</pages>
+      <abstract>This study explores the reception and understanding of the transmedia ensemble surrounding Bill Willingham’s Fables (2002-2015), a comic series reimagining fairytale characters in a modern setting. Fables expands its narrative across multiple media, including spin-off comics, a novel, and the video game The Wolf Among Us. This research investigates key questions: Can we identify a distinct group of transmedia consumers? What elements of the narrative sustain interest across media? A survey of 58 participants reveals that while most enter the franchise through the comic series, a significant number are introduced via the video game. The findings indicate that Fables fans are highly engaged transmedia consumers, with a majority exploring several parts of the franchise wanting to pursue narrative exploration. This study offers insights into how transmedia narratives are consumed, emphasizing the role of familiar story elements in encouraging cross-media engagement.</abstract>
+      <url hash="675e8595">2024.wnu-1.3</url>
+      <bibkey>lagrange-2024-understanding</bibkey>
+    </paper>
+    <paper id="4">
+      <title>Using Large Language Models for Understanding Narrative Discourse</title>
+      <author><first>Andrew</first><last>Piper</last><affiliation>McGill University</affiliation></author>
+      <author><first>Sunyam</first><last>Bagga</last><affiliation>McGill University</affiliation></author>
+      <pages>37-46</pages>
+      <abstract>In this study, we explore the application of large language models (LLMs) to analyze narrative discourse within the framework established by the field of narratology. We develop a set of elementary narrative features derived from prior theoretical work that focus on core dimensions of narrative, including time, setting, and perspective. Through experiments with GPT-4 and fine-tuned open-source models like Llama3, we demonstrate the models’ ability to annotate narrative passages with reasonable levels of agreement with human annotators. Leveraging a dataset of human-annotated passages spanning 18 distinct narrative and non-narrative genres, our work provides empirical support for the deictic theory of narrative communication. This theory posits that a fundamental function of storytelling is the focalization of attention on distant human experiences to facilitate social coordination. We conclude with a discussion of the possibilities for LLM-driven narrative discourse understanding.</abstract>
+      <url hash="295b5072">2024.wnu-1.4</url>
+      <bibkey>piper-bagga-2024-using</bibkey>
+    </paper>
+    <paper id="7">
+      <title>Is It Safe to Tell Your Story? Towards Achieving Privacy for Sensitive Narratives</title>
+      <author><first>Mohammad</first><last>Shokri</last><affiliation>Graduate Center, City University of New York</affiliation></author>
+      <author><first>Allison</first><last>Bishop</last><affiliation>Proof Trading - City College, City University of New York</affiliation></author>
+      <author><first>Sarah Ita</first><last>Levitan</last><affiliation>Hunter College (CUNY)</affiliation></author>
+      <pages>47-54</pages>
+      <abstract>Evolving tools for narrative analysis present an opportunity to identify common structure in stories that are socially important to tell, such as stories of survival from domestic abuse. A greater structural understanding of such stories could lead to stronger protections against de-anonymization, as well as future tools to help survivors navigate the complex trade-offs inherent in trying to tell their stories safely. In this work we explore narrative patterns within a small set of domestic violence stories, identifying many similarities. We then propose a method to assess the safety of sharing a story based on a distance feature vector.</abstract>
+      <url hash="85b11698">2024.wnu-1.7</url>
+      <bibkey>shokri-etal-2024-safe</bibkey>
+    </paper>
+    <paper id="9">
+      <title>Annotating Mystery Novels: Guidelines and Adaptations</title>
+      <author><first>Nuette</first><last>Heyns</last><affiliation>North West University</affiliation></author>
+      <author><first>Menno</first><last>Van Zaanen</last><affiliation>South African Centre for Digital Language Resources</affiliation></author>
+      <pages>55-66</pages>
+      <abstract>To understand how stories are structured, we would like to be able to analyze the architecture of narratives. This article reviews and compares existing annotation guidelines for scene and narrative level annotation. We propose new guidelines, based on existing ones, and show how these can be effectively extended from general-purpose to specialized contexts, such as mystery novels which feature unique narrative elements like red herrings and plot twists. This provides a controlled environment for examining genre-specific event structuring. Additionally, we present a newly annotated genre-specific dataset of mystery novels, offering valuable resources for training and evaluating models in narrative understanding. This study aims to enhance annotation practices and advance the development of computational models for narrative analysis.</abstract>
+      <url hash="e9440215">2024.wnu-1.9</url>
+      <bibkey>heyns-van-zaanen-2024-annotating</bibkey>
+    </paper>
+    <paper id="12">
+      <title>Causal Micro-Narratives</title>
+      <author><first>Mourad</first><last>Heddaya</last><affiliation>University of Chicago</affiliation></author>
+      <author><first>Qingcheng</first><last>Zeng</last><affiliation>Northwestern University</affiliation></author>
+      <author><first>Alexander</first><last>Zentefis</last><affiliation>Yale University</affiliation></author>
+      <author><first>Rob</first><last>Voigt</last><affiliation>Northwestern University</affiliation></author>
+      <author><first>Chenhao</first><last>Tan</last><affiliation>University of Chicago</affiliation></author>
+      <pages>67-84</pages>
+      <abstract>We present a novel approach to classify causal micro-narratives from text. These narratives are sentence-level explanations of the cause(s) and/or effect(s) of a target subject. The approach requires only a subject-specific ontology of causes and effects, and we demonstrate it with an application to inflation narratives. Using a human-annotated dataset spanning historical and contemporary US news articles for training, we evaluate several large language models (LLMs) on this multi-label classification task. The best-performing model—a fine-tuned Llama 3.1 8B—achieves F1 scores of 0.87 on narrative detection and 0.71 on narrative classification. Comprehensive error analysis reveals challenges arising from linguistic ambiguity and highlights how model errors often mirror human annotator disagreements. This research establishes a framework for extracting causal micro-narratives from real-world data, with wide-ranging applications to social science research.</abstract>
+      <url hash="f2c7b6cc">2024.wnu-1.12</url>
+      <bibkey>heddaya-etal-2024-causal</bibkey>
+    </paper>
+    <paper id="15">
+      <title>Media Framing through the Lens of Event-Centric Narratives</title>
+      <author><first>Rohan</first><last>Das</last><affiliation>University of Colorado Boulder</affiliation></author>
+      <author><first>Aditya</first><last>Chandra</last><affiliation>University of Colorado Boulder</affiliation></author>
+      <author><first>I-Ta</first><last>Lee</last><affiliation>Purdue University</affiliation></author>
+      <author><first>Maria Leonor</first><last>Pacheco</last><affiliation>University of Colorado Boulder</affiliation></author>
+      <pages>85-98</pages>
+      <abstract>From a communications perspective, a frame defines the packaging of the language used in such a way as to encourage certain interpretations and to discourage others. For example, a news article can frame immigration as either a boost or a drain on the economy, and thus communicate very different interpretations of the same phenomenon. In this work, we argue that to explain framing devices we have to look at the way narratives are constructed. As a first step in this direction, we propose a framework that extracts events and their relations to other events, and groups them into high-level narratives that help explain frames in news articles. We show that our framework can be used to analyze framing in U.S. news for two different domains: immigration and gun control.</abstract>
+      <url hash="43491601">2024.wnu-1.15</url>
+      <bibkey>das-etal-2024-media</bibkey>
+    </paper>
+    <paper id="16">
+      <title><fixed-case>BERT</fixed-case>-based Annotation of Oral Texts Elicited via Multilingual Assessment Instrument for Narratives</title>
+      <author><first>Timo</first><last>Baumann</last><affiliation>Ostbayerische Technische Hochschule Regensburg</affiliation></author>
+      <author><first>Korbinian</first><last>Eller</last><affiliation>Ostbayerische Technische Hochschule Regensburg</affiliation></author>
+      <author><first>Natalia</first><last>Gagarina</last><affiliation>Leibniz-Zentrum Allgemeine Sprachwissenschaft (ZAS), Berlin</affiliation></author>
+      <pages>99-104</pages>
+      <abstract>We investigate how NLP can help annotate the structure and complexity of oral narrative texts elicited via the Multilingual Assessment Instrument for Narratives (MAIN). MAIN is a theory-based tool designed to evaluate the narrative abilities of children who are learning one or more languages from birth or early in their development. It provides a standardized way to measure how well children can comprehend and produce stories across different languages and referential norms for children between 3 and 12 years old. MAIN has been adapted to over ninety languages and is used in over 65 countries. The MAIN analysis focuses on story structure and story complexity which are typically evaluated manually based on scoring sheets. We here investigate the automation of this process using BERT-based classification which already yields promising results.</abstract>
+      <url hash="5aefebcd">2024.wnu-1.16</url>
+      <bibkey>baumann-etal-2024-bert</bibkey>
+    </paper>
+  </volume>
+</collection>
diff --git a/data/yaml/venues/customnlp4u.yaml b/data/yaml/venues/customnlp4u.yaml
new file mode 100644
index 0000000000..bdd0e0d648
--- /dev/null
+++ b/data/yaml/venues/customnlp4u.yaml
@@ -0,0 +1,3 @@
+acronym: CustomNLP4U
+name: 'The 1st Workshop on Customizable NLP: Progress and Challenges in Customizing
+  NLP for a Domain, Application, Group, or Individual (CustomNLP4U)'
diff --git a/data/yaml/venues/futured.yaml b/data/yaml/venues/futured.yaml
new file mode 100644
index 0000000000..3fe585b53f
--- /dev/null
+++ b/data/yaml/venues/futured.yaml
@@ -0,0 +1,2 @@
+acronym: FuturED
+name: Workshop on the Future of Event Detection
diff --git a/data/yaml/venues/nlp4science.yaml b/data/yaml/venues/nlp4science.yaml
new file mode 100644
index 0000000000..901bf68a40
--- /dev/null
+++ b/data/yaml/venues/nlp4science.yaml
@@ -0,0 +1,2 @@
+acronym: NLP4Science
+name: The 1st Workshop on NLP for Science
diff --git a/data/yaml/venues/wikinlp.yaml b/data/yaml/venues/wikinlp.yaml
new file mode 100644
index 0000000000..91a7978d4e
--- /dev/null
+++ b/data/yaml/venues/wikinlp.yaml
@@ -0,0 +1,2 @@
+acronym: WikiNLP
+name: The First Workshop on Advancing Natural Language Processing for Wikipedia