Paperoni

Paperoni is Mila's tool to collect publications from our researchers and generate HTML or reports from them.

Install

First clone the repo, then:

pip install -e .

Configuration

Create a YAML configuration file named config.yaml in the directory where you want to put the data with the following content:

paperoni:
  paths:
    database: papers.db
    history: history
    cache: cache
    requests_cache: requests-cache
    permanent_requests_cache: permanent-requests-cache
  institution_patterns:
    - pattern: ".*\\buniversit(y|é)\\b.*"
      category: "academia"

All paths are relative to the configuration file. Insitution patterns are regular expressions used to recognize affiliations when parsing PDFs (along with other heuristics).

Make sure to set the $GIFNOC_FILE environment variable to the path to that file.

Start the web app

To start the web app on port 8888, execute the following command:

grizzlaxy -m paperoni.webapp --port 8888

You can also add this section to the configuration file (same file as the paperoni config):

grizzlaxy:
  module: paperoni.webapp
  port: 8888

And then you would just need to run grizzlaxy or grizzlaxy --config config-file.yaml.

Once papers are in the system, the app can be used to validate them or perform searches. There are some steps to follow in order to populate the database:

Add researchers

Go to http://127.0.0.1:8888/author-institution
Enter a researcher's name, role at the institution, as well as a start date. The end date can be left blank, and then click Add/Edit
You can edit a row by clicking on it, changing e.g. the end date and clicking Add/Edit
Then, add IDs on Semantic Scholar: click on the number in the Semantic Scholar IDs column, which will open a new window.
This will query Semantic Scholar with the researcher's name. Each box represents a different Semantic Scholar ID. Select:
- Yes if the listed papers are indeed from the researcher. This ID will be scraped for this researcher.
- No if the listed papers are not from the researcher. This ID will not be scraped.

Ignore OpenReview IDs for the time being, they might not work properly at the moment.

Scrape

The scraping currently needs to be done on the command line.

# Scrape from semantic_scholar
paperoni acquire semantic_scholar

# Get more information for the scraped papers
# E.g. download from arxiv and analyze author list to find affiliations
# It can be wise to use --limit to avoid hitting rate limits
paperoni acquire refine --limit 500

# Merge entries for the same paper; paperoni acquire does not do it automatically
paperoni merge paper_link

# Merge entries based on paper name
paperoni merge paper_name

Other merging functions are author_link and author_name for authors (not papers) and venue_link for venues.

Validate

Go to http://127.0.0.1:8888/validation to validate papers. Basically, you click "Yes" if the paper should be in the collection and "No" if it should not be according to your criteria (because it comes from a homonym of the researcher, is in the wrong field, is just not a paper, etc. -- it depends on your use case.)

Name		Name	Last commit message	Last commit date
Latest commit History 348 Commits
ansible-role		ansible-role
docs		docs
jobs		jobs
paperoni		paperoni
scripts		scripts
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.yaml		config.yaml
pinned-requirements.txt		pinned-requirements.txt
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paperoni

Install

Configuration

Start the web app

Add researchers

Scrape

Validate

About

Releases

Packages

Contributors 4

Languages

License

mila-iqia/paperoni

Folders and files

Latest commit

History

Repository files navigation

Paperoni

Install

Configuration

Start the web app

Add researchers

Scrape

Validate

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages