LDA in arXiv

applying LDA to the arXiv full article database

step 1: download

following the arXiv Documentation on Kaggle to have bullk acces to the PDF's from the cloud storage.

# Download all the Computer Science (cs) PDF source files
gsutil cp -r gs://arxiv-dataset/arxiv/cs/pdf  ./a_local_directory/

step 2: set-up GROBID

according to it's own doc's, "GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents".

To work with GROBID we need to:

install and Set-up the GROBID java REST service (preferably in a high-memory server).
install the Grobid python client, to acces the service via python.

step 3: run GROBID

to run grobid, you first need to star the GROBID REST service, by running the following commands on the grobid directory:

#open a new screen named 'GROBID'
screen -S GROBID

#start GROBID service then exit the screen
./gradlew run

after running the server you may use the Grobid python client to parse the pdf's with the 'processFulltextDocument' command and omitting the '--output' to create the tei.xml files alongside the pdf's:

grobid_client --input ~/your_pdf_directory processFulltextDocument

step 3: search, clean and run LDA (under construction)

at this step we use a packege of functions that handle the data in the .tei.xml files created with grobid, that search phrases containing specific terms and then running LDA on those phrases.

to run tests simply:

from functions import * 
test_final('your__testfile_name')

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
circle_pack.ipynb		circle_pack.ipynb
functions.py		functions.py
mongo_dev.ipynb		mongo_dev.ipynb
sankey.py		sankey.py
sankey_diagram.ipynb		sankey_diagram.ipynb
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LDA in arXiv

step 1: download

step 2: set-up GROBID

step 3: run GROBID

step 3: search, clean and run LDA (under construction)

About

Releases

Packages

Languages

License

talesshift/LDA_arXiv

Folders and files

Latest commit

History

Repository files navigation

LDA in arXiv

step 1: download

step 2: set-up GROBID

step 3: run GROBID

step 3: search, clean and run LDA (under construction)

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages