applying LDA to the arXiv full article database
following the arXiv Documentation on Kaggle to have bullk acces to the PDF's from the cloud storage.
# Download all the Computer Science (cs) PDF source files
gsutil cp -r gs://arxiv-dataset/arxiv/cs/pdf ./a_local_directory/
according to it's own doc's, "GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents".
To work with GROBID we need to:
- install and Set-up the GROBID java REST service (preferably in a high-memory server).
- install the Grobid python client, to acces the service via python.
to run grobid, you first need to star the GROBID REST service, by running the following commands on the grobid directory:
#open a new screen named 'GROBID'
screen -S GROBID
#start GROBID service then exit the screen
./gradlew run
after running the server you may use the Grobid python client to parse the pdf's with the 'processFulltextDocument' command and omitting the '--output' to create the tei.xml files alongside the pdf's:
grobid_client --input ~/your_pdf_directory processFulltextDocument
at this step we use a packege of functions that handle the data in the .tei.xml files created with grobid, that search phrases containing specific terms and then running LDA on those phrases.
to run tests simply:
from functions import *
test_final('your__testfile_name')