Analysis of publication domain by statistical analysis of word counts.
First, publications need to be found using the Microsoft Academic Knowledge API.
This is a responsibility of the discover
package.
- Specify domains of interest within
config.json
{
...
"DOMAINS": [
"cancer",
"another_domain"
]
}
- Find your Microsoft Academic Knowledge API key here. You should copy Key 1 and use it in the next step.
- Discover relevant publications with
discover
package. Provide number of papers around 5 times greater than the number you actually want to download - not all papers are downloadable.
python -m discover --api-key <your-copied-key> --count <count-of-papers>
You can find discovery files within data/pubs
directory, named as
<domain_name>.json
.
- Download publications as PDF files via
download
module. Here you provide an actual number of papers to download.
python -m download --count <count-of-papers>
You can find downloaded publications within data/pubs/<domain-name>
directory,
named as <publication-title>.pdf
.
- Convert downloaded publications to TXT format via
convert
module.
python -m convert
You can find converted publications within data/pubs/<domain-name>
directory,
named as <publication-title>.txt
.
Feasibility was checked more-or-less during the topic selection classes. Proposed flow is as follows:
- Specify domains to gather papers for
- Use Microsoft Academic Knowledge API to find publications for a domain
- Download found publications
- Convert PDFs to TXT
- Use TFIDF embedding to produce paper features
- Use ANOVA / nonparametric alternative for checking, which words make a difference
Flow of the API is simple:
- Select domain by constructing query expression with interpret endpoint.
- Use evaluate endpoint with provided query to find papers in the domain.
Several links may be useful:
- Interpret domain as an query expression API docs
- Interpret endpoint test
- Query evaluation API docs
- Evaluate endpoint test
- API keys
- Paper entity attributes
There is a package pdftotext
for Python 2 and 3.
There is an implementation in Python within a package called scikit-learn
.
You can check it here.
There are some parameters to play with - understanding them may be key to success.
Here you can find theoretical background.