Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieve the updated citations #19

Open
liuml07 opened this issue May 28, 2016 · 3 comments
Open

Retrieve the updated citations #19

liuml07 opened this issue May 28, 2016 · 3 comments
Assignees

Comments

@liuml07
Copy link
Member

liuml07 commented May 28, 2016

We got the citations once, and saved the results in the citations.txt file. And now there are a few of new citations in the google scholar. We don't want to run the program all from the start again. Ideally, we can retrieve the citations incrementally. Moreover, during the process of continuous crawling, there may be new citations as well. We don't want to mess this up.

#23 is a good start to support this idea as it added google scholar id tags to bibtex items.

@liuml07
Copy link
Member Author

liuml07 commented Sep 19, 2016

As discussions in #23 go well, I think this issue is of good shape to do. Would you like to work on this @yilihong ? Thanks.

@liuml07 liuml07 assigned yilihong and shiqiezi and unassigned shiqiezi Sep 19, 2016
@yilihong
Copy link
Collaborator

yilihong commented Sep 19, 2016

@liuml07

Yes, I can give it a try this week. Have a couple of deadlines coming up but I can spend some time on this. I will try to minimize structural change.

One thing I will check is maybe we can first gather all of citation_ids by looping over the htmls, then do a diff with the citation_ids from the bib file, and then loop over the remaining citation_ids?

Any thoughts?

@liuml07
Copy link
Member Author

liuml07 commented Sep 19, 2016

@yilihong No hurry. This is a non-profit program anyway. I appreciate your contribution very much.

I think the basic idea of looping over the htmls should work just fine. The logic is clear. My minor concern is that, we have to sleep 100 seconds between requests and as a result, we may not make real progress (by donwloading the BibTex page and/or pdf files) while we're looping over the htmls aka gather all of citation_idsf. If we get blocked somehow in this period, we may have wasted chance to at least get something. This period is about 1/10 of total running time. This is not a deal breaker though.

Building a set of gscholar ids from citation.bib file and check each citations on-the-fly seems not bad from this perspective.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants