papergraph is a rust library and binary to build and manage a citation graph of Semantic Scholar, focused on AI/ML papers (for now). Data is stored in a postgres database with a Hasura GraphQL backend (schema) on top for easy graph queries. It comes with Jupyter notebooks that show you how to analyze and visualize the data.
Live version at https://papergraph.dbz.dev
Thanks to @ArtirKel for the useful feedback and ideas.
The folllowing notebooks work out of the box using a publicly available API endpoint for the data. You can run them locally, or in the cloud via Google Colab. Please read the caveats about the public endpoint below!
- Simple Analysis |
- Example to query the citation graph for a specific paper and analyze it with pandas
- Advanced Graph Analysis with networkx |
- Uses networkx to build and visualize the graph structure
- Finding landmark papers - Papers with a large citations may be considered landmark papers. The ideas in such papers often form the foundation for incremental improvements. Given some arbitrary paper you're interested in, you may want to know which landmark papers you should study for the required background knowledge.
- Reference research - When writing a paper, you don't want to miss prior work. Looking through the citation graph for a related paper can help you find potentially interesting papers to read and cite.
- Graph Analysis - Run sophisticated graph algorithms on the dataset to gain insights
The database is publicly available at http://34.107.246.233/v1/graphql
, so please be gentle with your queries! This is running on a small postgres server that I'm paying for, so please don't overload it with automated scripts. Be nice :) As long as you're running queries by hand through notebooks everything should be fine.
If you want to do lots of queries you should clone this repo and build the database yourself locally or in the cloud. Instructions for this are below. If you are running Kubernetes, you can also use the scripts in deploy/
.
TODO. See this issue
Requirements:
- Docker
If you want to build the database from scratch, you must download the full S2 research corpus. The total compressed size is currently around ~120GB.
Clone the repo
git clone https://github.com/dennybritz/papergraph
cd papergraph
aws s3 sync --no-sign-request s3://ai2-s2-research-public/open-corpus/2020-04-10/ data/s2-research-corpus
Start up an empty postgres database server and create the schema
export DATABASE_URL=postgres://papergraph:papergraph@postgres:5432/papergraph
export RUST_LOG=info
# Run the postgres docker container
docker-compose up postgres
# Setup the datase and run migrations
docker run --rm --network papergraph_default \
-e DATABASE_URL \
dennybritz/papergraph \
diesel database setup
Now that we have a postgres server with the right database schema running, we need to insert the data:
# Assuming you downloaded the data into /data
# as shown in the AWS command above
DATA_PATH=data/s2-research-corpus/s2-corpus-017.gz
# Repeat this for all files you want to insert
# This will take a while. On my laptop, each file takes around 1min.
docker run --rm -it --network papergraph_default \
-e DATABASE_URL -e RUST_LOG \
-v `pwd`/${DATA_PATH}:/data/${DATA_PATH} \
dennybritz/papergraph \
papergraph insert -d /data/${DATA_PATH}
Now that have seeded the database, we can also start Hasura to serve the graphql API. Stop the postgres docker process with ctrl+c
and run
docker-compose up
You should now be able to access the API via http://localhost:8080
.
papergraph
is updated when new data snapshots become available. This typically happens once a month. This means it will not contain all the latest papers.
Generating postgres database dumps
pg_dump -h localhost -p 15432 -F tar -U papergraph papergraph > pg_dump.tar
Build docker image
docker build -t dennybritz/papergraph .
Export graphql schema
gq http://34.107.246.233/v1/graphql --introspect > hasura/schema.graphql