This scraper gets detected by the server and you will get bolcked.
DO NOT try on public networks to avoid blocking your IP.
Use on Google Colab is recommended. Feel free to make a copy of the same code available at this link for use.
- The server detects activities. Continuous scraping or scrapning data for an author with more than a couple of hundred publications throws erros. As a temporary solution, the partial results are saved into a json file and are later loaded to try and scrape the remaining data. The following solutions did not work:
- Using sleep between requests.
- Using proxies (due to connection challenges).
- The returned HTML sometimes does not contain some of the details (marked with a #TODO comment). As of October, 2022, these details include the link to the publication, the paper description, and the link to the author's photo.
- If on Google Colab, mount your Google Drive.
- Define your DATA_PATH variable.
- Set the AUTHORID variable.
- Create your author object using:
author_obj = create_author(AUTHORID)
- Scrape the data about the author using:
author_obj.scrape()
- To see if all publications details are retrieved, check:
author_obj.all_publications_extracted
- If the previous step gives you False, run:
author_obj.scrape()
The data will be saved as json files in your DATA_PATH. If you are trying to re-scrape data on an author from scratch, destroy the json file named AUTHORID.json before creating the author_object again.