Skip to content

Latest commit

 

History

History
59 lines (37 loc) · 2.87 KB

README.md

File metadata and controls

59 lines (37 loc) · 2.87 KB

Searching for digitized books by OCLC identifier

This repository has scripts to search the following websites for digitized books by their OCLC numbers.


Data

Formatting your OCLC numbers for searching.

OCLC identifiers should be entered into a spreadsheet in a column called 'oclc_id'. The OCLC identifiers should not have any prefixes like "ocm", "on", or "(OCoLC)". Save your spreadsheet as a UTF-8 encoded CSV. It does not matter if the identifiers are saved as integers or strings, as the scripts automatically converts identifiers into strings.

When your CSV is ready, put it in the same folder location as the scripts below on your local system.

Test data

There is a folder called "test-data" in the repository with test data and results. This can help with formatting and troubleshooting the scripts on your local system.

  • test.csv: A CSV with 9 items (3 items findable by OCLC number for each website). These items were selected at random.
  • hathiTrustResults_test.csv: The results from running test.csv against searchHathiTrustByOCLC.py.
  • googleBooksResults_test.csv: The results from running test.csv against searchGoogleBooksByOCLC.py.
  • internetArchiveResults_test.csv: The results from running test.csv against searchInternetArchivesByOCLC.py.

Scripts

Requirements

searchGoogleBooksByOCLC.py

Setup: Register for a Google API key to search. Go to Google's APIs & Services Credentials page and register for an API key using a Google account. Then create a Python file in the same folder as this script called googleKey.py with the following code:

key='##########'

Be sure to add googleKey.py to your gitignore.

Search limits: There is a 60-second pause after searching a set of 100 OCLC numbers as Google Books limits the number of books searched per minute via API. So, if you have 1000 OCLC identifiers to search, this script will take at least 10 minutes. I'm sure there is a better solution, I just don't know what it is. There is also a daily limit of books you can search via API. Avoid searching more than 1000 identifiers in a 24-hour period. You will get an error if this occurs, just try rerunning your script the next day.

searchHathiTrustByOCLC.py

This searches the oclc field in HathiTrust.

searchInternetArchiveByOCLC.py

This searches two metadata fields in the Internet Archive for an OCLC number: external-identifier and oclc_id.

combineMyResults.py

This script combines CSV results generated by running the above three scripts.