This repository contains the dataset and baseline code for the MAUD supplementary extraction task, as described in the appendix of "MAUD: An Expert-Annotated Legal NLP Dataset for Merger Agreement Understanding".
For the main MAUD dataset and baselines, see github.com/TheAtticusProject/maud.
bugs: The baselines reported in the papers are underperforming due to a training bug. See #1 .
pip install torch transformers tensorboard pandas scikit-learn tqdm
During the first run, feature caching and evaluation requires a lot of CPU memory (>=150 GB)
and will
save about 25 GB of files on the hard disk.
This CPU requirement can be reduced, at the expense of speed,
by lowering the --threads
count in run_maud.sh
.
Training uses around 22 GB of GPU memory.
./run_maud.sh
./run_maud_best_hp.sh