project.yml

title: Affiliation Extraction and Parsing from OpenALEX Preprints

description: |
  ## Design
  This project uses spaCy to extract and parse affiliations from preprints in PDF format. The pipeline consists of several components:

  - **PDF-to-text conversion**: Extracts plain text blocks from PDF files using [pyMuPDF](https://pymupdf.readthedocs.io/en/latest/).
  - **Text categorization**: Predicts text blocks containing affiliations using a binary text categorization model so they can be extracted separately.
  - **Named entity recognition**: Parses the extracted text blocks to identify named entities like organizations, people, and locations.
  - **Relation extraction**: Builds a graph of affiliations and their relationships to authors and institutions.

  For a full overview of the project, including possible next steps, see [the report](https://docs.google.com/document/d/1kdqEFBh0IolQUq-xybDajOPejoLRKy1cF4b7LycTv60/edit?usp=sharing).

  ## Setup
  To use the workflow in this project, you need to start by installing spaCy, ideally in a new virtual environment:
  ```sh
  pip install spacy
  ```
  Then, before running any of the workflows, make sure to install the dependencies by running the following command:
  ```sh
  weasel run dependencies:install
  ```
  This will ensure you have the latest copies of pretrained models as well as the `thinc-apple-ops` package if you are running on an M-series Mac, which significantly speeds up training times.

  ## Data
  Preprint data was sourced from [sul-dlss-labs/preprints-evaluation-dataset](https://github.com/sul-dlss-labs/preprints-evaluation-dataset). You can pull down a copy of the spreadsheet listing all preprints with:
  ```sh
  weasel assets
  ```
  This command will check the CSV file's checksum to ensure you are using the same copy of the data as the project was developed with. To fetch the actual PDF files and their metadata and convert them to plaintext, run:
  ```sh
  weasel run preprints:download
  weasel run preprints:clean
  ```
  Note that **you need to be on Stanford VPN** to fetch files from SDR.
  
  ## Annotating
  Annotated training datasets are stored in the [datasets/](datasets/) directory. They have also been pre-exported in the binary format used by spaCy in the [corpus/](corpus/) directory and are already configured as training data in the `config.cfg` files in the `configs/[ner|textcat]` directory.

  If you have a local copy of [prodigy](https://prodi.gy/), you can use it to annotate more data for training. To recreate the dataset for text categorization, run:
  ```sh
  weasel run dataset:textcat:create
  ```
  Then, you can annotate the data using:
  ```sh
  weasel run annotate:textcat
  ```
  To recreate the dataset for named entity recognition, run:
  ```sh
  weasel run dataset:ner:create
  ```
  And then annotate the data using:
  ```sh
  weasel run annotate:ner
  ```
  The annotation tasks will open in your browser, and you can use the Prodigy UI to annotate more data.

  ## Training
  You can train the text categorization model using:
  ```sh
  weasel run train_textcat
  ```
  And the named entity recognition model using:
  ```sh
  weasel run train_ner
  ```
  These commands are parameterized by the `embedding` and `transformer_model_name` variables in the project.yml file. You can choose to use spaCy's built-in token-to-vector embedding layer (`tok2vec`) or a transformer-based model (`transformer`). If you choose to use a transformer-based model, you can specify the model name (`transformer_model_name`) from the Hugging Face model hub.

  If you change these parameters and re-run the training command, spaCy saves the results in the `training/` directory. The best-scoring run and the last run are always saved to `training/[ner|textcat]/model-best` and `training/[ner|textcat]/model-last`, respectively.

  You can further customize the training configuration by editing the `config_[embedding].cfg` file in the `configs/[ner|textcat]` directory. If you do this, it's usually worth it to run:
  ```sh
  spacy debug config configs/[ner|textcat]/config_[embedding].cfg
  ```
  This will check the configuration file for errors and print out a summary of the settings.

  ## Visualizing
  There are several interfaces built with [Streamlit](https://streamlit.io/) to help debug the various parts of the pipeline. You can view these with:
  ```sh
  streamlit run scripts/visualize.py
  ```
  This will open a browser window with the Streamlit interface, allowing you to preview different data and model parameters.

  ## API
  There is an example API built with [FastAPI](https://github.com/fastapi) that processes a PDF file sent as form data and returns predicted affiliations as JSON. You can start a development server with:
  ```sh
  fastapi dev scripts/api.py
  ```
  Then, POST a PDF file using:
  ```sh
  curl -F file=@assets/preprints/pdf/W2901173781.pdf "http://localhost:8000/analyze"
  ```


vars:
  embedding: "tok2vec" # tok2vec, transformer
  config_file: "config_${embedding}.cfg"
  transformer_model_name: "roberta-base" # any huggingface model name
  gpu_id: -1 # -1 for CPU, 0 for GPU

directories:
  - assets
  - training
  - results

assets:
  - dest: "assets/preprints.csv"
    git:
      repo: https://github.com/sul-dlss-labs/preprints-evaluation-dataset.git
      branch: main
      path: records.csv
    checksum: 68138588160e22998f6e6921f4e705d8 # md5
    description: List of preprint documents in the training dataset

workflows:
  prepare:
    - preprints:download
    - preprints:clean
    - dataset:textcat:create

commands:
  - name: dependencies:install
    help: Install dependencies
    script:
      - pip install -r requirements.txt
      - python -m spacy download en_core_web_lg # for NER and to compare with -trf
      - python -m spacy download en_core_web_trf # for NER and to compare with -lg
      - /bin/bash -c "if [[ $(arch) == 'arm64' ]]; then pip install thinc-apple-ops; fi" # for M-series macs

  - name: preprints:download
    help: Download preprint PDFs and metadata (**Needs Stanford VPN**)
    outputs:
      - assets/preprints/pdf
    script:
      - python scripts/download_preprints.py assets/preprints.csv assets/preprints/pdf
      - python scripts/download_metadata.py assets/preprints.csv assets/preprints/json

  - name: preprints:clean
    help: Extract plain text from preprint PDFs
    deps:
      - assets/preprints/pdf
    outputs:
      - assets/preprints/txt
    script:
      - python scripts/clean_preprints.py assets/preprints/pdf assets/preprints/txt

  - name: dataset:textcat:create
    help: Create a dataset for annotating text categorization training data
    deps:
      - assets/preprints/txt
    outputs:
      - assets/preprints_textcat.jsonl
    script:
      - python scripts/create_textcat_dataset.py assets/preprints_textcat.jsonl

  - name: annotate:textcat
    help: Annotate binary training data for text categorization
    deps:
      - assets/preprints_textcat.jsonl
    script:
      - python -m prodigy textcat.manual preprints_textcat assets/preprints_textcat.jsonl --label AFFILIATION

  - name: dataset:ner:create
    help: Create a dataset for annotating NER data
    deps:
      - assets/preprints/txt
    outputs:
      - assets/preprints.jsonl
    script:
      - python scripts/create_ner_dataset.py assets/preprints.jsonl

  - name: annotate:ner
    help: Annotate training data for NER by correcting and updating an existing model
    deps:
      - assets/preprints.jsonl
    script:
      - python -m prodigy ner.correct preprints_ner en_core_web_trf assets/preprints.jsonl --label PERSON,ORG,GPE --update

  - name: train_textcat
    help: Train spaCy text categorization pipeline for affiliation extraction
    deps:
      - "configs/textcat/${vars.config_file}"
      - corpus/textcat/train.spacy
      - corpus/textcat/dev.spacy
    script:
      - "python -m spacy train configs/textcat/${vars.config_file} --output training/textcat --gpu-id ${vars.gpu_id} --vars.transformer_model_name ${vars.transformer_model_name}"

  - name: train_ner
    help: Train spaCy NER pipeline for affiliation parsing
    deps:
      - "configs/ner/${vars.config_file}"
      - corpus/ner/train.spacy
      - corpus/ner/dev.spacy
    script:
      - "python -m spacy train configs/ner/${vars.config_file} --output training/ner --gpu-id ${vars.gpu_id} --vars.transformer_model_name ${vars.transformer_model_name}"