Skip to content

Dataset snapshot release v0.2.0

Latest
Compare
Choose a tag to compare
@mattbierbaum mattbierbaum released this 29 Apr 23:18
· 72 commits to master since this release

Release of arXiv public dataset libraries with ability to gather and process:

  1. arXiv metadata provided by OAI
  2. PDFs downloaded from S3
  3. Full plain text generated by pdftotext
  4. Internal co-citation network
  5. Parsed author lines (v0.2.0)

The binaries available are:

  • arxiv-metadata-hash-abstracts-v0.2.0-2019-03-01.json.gz
    Full metadata downloaded from (1) with hashed abstracts in place of the abstract text.
  • internal-references-v0.2.0-2019-03-01.json.gz
    Snapshot of the internal co-citation network at the time of release generated with (4).
  • authors-parsed-v0.2.0-2019-03-01.json.gz
    Parsed author lines at time of release generated by (5).
  • manifest-index-v0.2.0-2019-03-01.json.gz
    A detailed file level manifest dictionary mapping the tarpdf files in the S3 manifest to the arXiv file
    paths they contain. This can be used to target a subset of the arXiv bulk download, as discussed in
    this issue.