-
Notifications
You must be signed in to change notification settings - Fork 3
Useful console commands (harvest, WoS, Pubmed, others)
Starting rails console on a server:
bundle exec rails c -e production
sunetid='peter12345'
author=Author.find_by_sunetid(sunetid)
author.cap_profile_id
author.author_identities # gives the users alternate names data
All sources:
RAILS_ENV=production bundle exec rake harvest:all_authors # no lookback window = harvest all authors for all available times
RAILS_ENV=production bundle exec rake harvest:all_authors_update # lookback window = the default update timeframe specified in the settings.yml file for pubmed and WoS
Just Pubmed:
RAILS_ENV=production bundle exec rake pubmed:harvest_authors # no lookback window = harvest all authors for all available times
RAILS_ENV=production bundle exec rake pubmed:harvest_authors_update # lookback window = the default update timeframe specified in Settings.PUBMED.regular_harvest_timeframe
RAILS_ENV=production bundle exec rake pubmed:harvest_authors_update["52"] # lookback window = specified by Pubmed relDate parameter
Just WOS:
RAILS_ENV=production bundle exec rake wos:harvest_authors # no lookback window = harvest all authors for all available times
RAILS_ENV=production bundle exec rake wos:harvest_authors_update # lookback window = the default update timeframe specified in Settings.WOS.regular_harvest_timeframe
RAILS_ENV=production bundle exec rake wos:harvest_authors_update["26W"] # lookback window = specified by WOS load_time_span parameter
On the rails console:
cap_profile_id=123
author=Author.find_by_cap_profile_id(cap_profile_id);
pub_count = author.publications.count;
# set up your options, you really only need one of the below
options = {} # accept defaults as defined in config/settings.yml
options = {load_time_span: '52W', relDate: '365'} # if you want to change the default lookup harvesting time, send in options for each harvester, this is an example for 1 year (load_time_span is for WoS, relDate is for Pubmed)
options = {load_time_span: nil, relDate: nil} # if you want to harvest for all time
# to harvest just one source
WebOfScience.harvester.process_author(author, options) # for WoS only
Pubmed.harvester.process_author(author, options) # for Pubmed only
# to harvest all available sources (e.g. will do the two above automatically)
AllSources.harvester.process_author(author, options)
new_pub_count = author.publications.count;
new_pub_count - pub_count # see the number of new publications harvested
Keep in mind you may get new publications added to a profile without actually creating new publication records, rather you will get new "contribution" records. In other words, you are associating an existing publication row with this author (via the contribution model).
Edit the top of the script to set a limit, how far back to harvest and to adjust the author query. Defaults to 1000 authors, 12 weeks back and most recently updated authors. You can also modify this script to use the Pubmed
harvester or the AllSources
harvester if needed. This should be fairly rare in practice.
bundle exec rails runner -e production script/batch_wos_harvest.rb
Users can update their personal information, alternate identities and other settings on the Profiles side, and we need to be sure our database remains in sync. In order to do this, we run a nightly cron task (scheduled in the config/schedule.rb
file) that makes an API call against their system to return any profile data that has changed and we then update our end.
The cron job runs a rake task with a parameter of "1", which specifies to look back only one day:
RAILS_ENV=production bundle exec rake cap:poll[1]
If needed for some reason, you can manually run that rake task for a longer period of time (e.g. if the task has failed for a while and needs to catch up). You can also run a separate rake task for a specific individual by passing in a specific cap_profile_id. This could be useful if you need an immediate update of someone's updated information so can re-run a harvest for them:
RAILS_ENV=production bundle exec rake cap:poll_data_for_cap_profile_id[12345] # just print the data for debugging
RAILS_ENV=production bundle exec rake cap:poll_for_cap_profile_id[12345] # actually update our database
start_time = Time.zone.now - 1.day
Publication.where('created_at >= ?', start_time).count
Contribution.where('created_at >= ?', start_time).count
cap_profile_id=1234
status = 'new' # only find new publications (could also be 'approved' or 'denied')
author=Author.find_by_cap_profile_id(cap_profile_id);
# print out the citations
author.contributions.where("status = ?", status).order(:created_at).each {|c| puts "#{c.created_at} : #{c.publication.pub_hash[:apa_citation]}\r\n\n"};nil
If you'd like to see the exact query that will be sent to both the WoS and Pubmed harvesters for a given author to be sure it looks reasonable:
cap_profile_id = '203382'
author=Author.find_by_cap_profile_id(cap_profile_id);
puts "CAP_PROFILE_ID: #{author.cap_profile_id}";
puts "NUM PUBS: #{author.contributions.count}";
puts "PRIMARY AUTHOR NAME: #{author.cap_first_name} #{author.cap_last_name}";
author_query = WebOfScience::QueryAuthor.new(author);
puts "ALL NAMES: #{author_query.name_query.send(:names).join(";")}";
puts "INSTITUTIONS: #{author_query.name_query.send(:institutions).join(";")}";
puts "WOS (by name): #{author_query.name_query.send(:name_query)}";
puts "WOS (by orcid): #{author_query.orcid_query.send(:orcid_query)}" if author.orcidid;
pm_query = Pubmed::QueryAuthor.new(author, {});
puts "Pubmed: #{pm_query.send(:term)}";
Note there is no de-duplication against results already on a user's profile. This is just the result for a WoS query for that author.
cap_profile_id=1234
author=Author.find_by_cap_profile_id(cap_profile_id);
author_query = WebOfScience::QueryAuthor.new(author)
uids = author_query.uids; # fetch the ids
names_uids = author_query.name_query.uids # only UIDs from name search
orcid_uids = author_query.orcid_query.uids # only UIDs from ORCID search (only for users with an orcid, else you get an error)
# look at the publications
uids.each do |uid|
result = WebOfScience.queries.retrieve_by_id([uid]).next_batch.to_a.first
puts "#{uid} : #{result.pub_hash[:apa_citation]}\n\r"
end;nil
# interrogate what the query will look like
author_query.name_query.send(:name_query) # the name query that will be sent to WoS, taking into account alternate names and institutions
author_query.orcid_query.send(:orcid_query) # the orcid query that will be sent to WoS (only makes sense if the user has an orcid, else returns an error)
# with optional timespan, e.g. go back 1 year
author_query = WebOfScience::QueryAuthor.new(author,{load_time_span: '52W'})
uids_year = author_query.uids
# the authors current UIDs
uids_current = author.publications.map(&:wos_uid);
# just the new publications between the UIDs harvested above and their current list of UIDs
uids_year - uids_current
# an arbitrary name
author=Author.new(preferred_first_name:'Donald',preferred_last_name:'Duck')
author_query = WebOfScience::QueryAuthor.new(author)
author_query.name_query.send(:name_query) # the name query that will be sent to WoS, taking into account alternate names and institutions
uids = author_query.uids;
author=Author.new(preferred_first_name:'Donald',preferred_last_name:'Duck')
# WoS
author_query = WebOfScience::QueryAuthor.new(author)
author_query.name_query.send(:name_query) # the name query that will be sent to WoS, taking into account alternate names and institutions
author_query.orcid_query.send(:orcid_query) # the orcid query that will be sent to WoS (only makes sense if the user has an orcid, else returns an error)
# Pubmed
author_query = Pubmed::QueryAuthor.new(author)
author_query.send(:term)
Print out the return data from PubMed for the given pmid
:
pmid='29273806'
pm_xml = Pubmed.client.fetch_records_for_pmid_list(pmid);
or from terminal:
RAILS_ENV=production bundle exec rake pubmed:publication[12345]
uid = 'WOS:000087898000028'
results = WebOfScience.queries.retrieve_by_id([uid]).next_batch.to_a;
results.each { |rec| rec.print }
puts results.first.titles["item"]
results.map(&:pub_hash)
or from terminal:
RAILS_ENV=production bundle exec rake wos:publication['WOS:000087898000028']
The WoS API does a partial string match on the DOI, it can return many results.
doi = '10.1118/1.598623'
results = WebOfScience.queries.user_query("DO=#{doi}").next_batch.to_a;
results.each { |rec| rec.print }
results[0].uid
=> "WOS:000081515000015"
results.map(&:pub_hash)
pmid='29273806'
results = WebOfScience.queries.retrieve_by_id(["MEDLINE:#{pmid}"]).next_batch.to_a;
results.each { |rec| rec.print }
results.map(&:pub_hash)
See the pubmed query that would be sent for a harvest
sunetid='petucket'
author=Author.find_by_sunetid(sunetid);
query = Pubmed::QueryAuthor.new(author, {});nil
query.send(:term) # see the query
# fetch the pmids
pmids_from_query = query.pmids
You can also do this manually:
Pass in a name and a database to search, and get back IDs. The example below returns JSON for the specified author for either Stanford or Princeton, up to 5000 max, searching the pubmed database. Pulled from documentation at https://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=Casciotti,Karen[author] AND (princeton[AD] OR stanford[AD])&retmax=5000&retmode=json
you can also specify a reldate=XXX
parameter to only lookback the specified number of days
Send an arbitrary query to WoS and fetch the results. See the WoS API documentation for query syntax.
query = 'AU=("Casciotti,Karen") AND AD=("Stanford" OR "Princeton" OR "Woods Hole")'
# fetch UIDs only, not full records (fast)
uids = WebOfScience.queries.user_query_uids(query).merged_uids
# fetch first batch of full records (slower)
retriever = WebOfScience.queries.user_query(query);
results = retriever.next_batch.to_a;
puts retriever.records_found # shows total number of results returned
puts retriever.records_retrieved # show total number returned in the given batch
results.each {|result| puts result.pub_hash[:title]}; # print all of the titles
results = retriever.next_batch.to_a if retriever.next_batch? # get the next batch if available
# fetch all records at once (even slower if there are multiple pages)
results = retriever.send(:merged_records).to_a;
# look at the citations
results.each {|result| puts "#{result.pub_hash[:apa_citation]}\n\r"};nil
e.g. set a really specific query with additional options from the WoS API, such as load_time_span
query = 'AU=("TestUser,First") AND AD=("Stanford" OR "Nebraska")' # publications involving both Stanford and University of Nebraska
params = {database: 'WOS', load_time_span: '4W'}
query_params = WebOfScience::UserQueryRestRetriever::Query.new(params)
retriever = WebOfScience.queries.user_query(query, query_params:);
results = retriever.next_batch.to_a;
puts retriever.records_found # shows total number of results returned
puts retriever.records_retrieved # show total number returned in the given batch
results.each {|result| puts result.pub_hash[:title]} # print all of the titles
results = retriever.next_batch.to_a if retriever.next_batch? # get the next batch if available
e.g. Search for a WoS record by title using WoS API and show the first record as a pub hash
title = "ESTIMATION OF THE AVERAGE WARFARIN MAINTENANCE DOSE IN SAUDI POPULATION"
puts WebOfScience.queries.user_query("TI=\"#{title}\"").next_batch.to_a.first.pub_hash
find all the MEDLINE records
records = WebOfScienceSourceRecord.where(database: 'MEDLINE').map {|src| src.record };
record = medline_records.sample;
record.print # view XML
record.pub_hash # data returned by sul_pub API
- inspect some random records with PMIDs
records = WebOfScienceSourceRecord.where.not(pmid: nil)
.limit(500)
.sample(25)
.map { |src| src.record };
update pub_hash for WoS provenance record
uid='WOS:000425499800006' # WOS:000393359400001
record = WebOfScienceSourceRecord.find_by_uid(uid)
pub = Publication.find_by(wos_uid: uid)
authors = pub.authors
wos_record = WebOfScience::Record.new(record: record.source_data,encoded_record: false)
pub.pub_hash = wos_record.pub_hash
pub.pubhash_needs_update = true
pub.save! # update the pub_hash with wos_record data
# add in any supplementary pubmed data if needed
processor = WebOfScience::ProcessRecords.new(authors.first,WebOfScience::Records.new(records:wos_record.source_data))
processor.send(:pubmed_additions,processor.send(:records))
update pub_hash for a PMID provenance record
pmid='29632959'
pub = Publication.find_by_pmid(pmid)
pmsr = PubmedSourceRecord.find_by_pmid(pub.pmid)
pub.pub_hash = pmsr.source_as_hash
pub.pubhash_needs_update = true
pub.save
OR
pmid='29632959'
pub = Publication.find_by_pmid(pmid)
pub.rebuild_pub_hash
pub.save
refresh a pubmed record with updated data from Pubmed and then rebuild the pub hash (useful for a pubmed provenance record that had a typo in the originally harvested data but is now fixed:
pmid='25277988'
pub = Publication.find_by_pmid(pmid)
pub.update_from_pubmed
update pubmed addition data for an older sciencewire record (useful if pmcid is missing for some reason)
pmid='25277988'
pub = Publication.find_by_pmid(pmid)
pub.send(:add_any_pubmed_data_to_hash)
pub.save
update entire pub_hash for older Sciencewire provenance record
swid='62534957'
pub = Publication.find_by_sciencewire_id(swid)
sw_source_record = SciencewireSourceRecord.find_by_sciencewire_id(pub.sciencewire_id)
pub.build_from_sciencewire_hash(sw_source_record.source_as_hash)
pub.pubhash_needs_update = true
pub.save
OR
pmid='29632959'
pub = Publication.find_by_pmid(pmid)
pub.rebuild_pub_hash
pub.save
Use case: a single author has two author rows with publications associated with each. You want to merge one author into the author, carrying any existing publications but not duplicating them. This happens when two profiles are created initially because CAP was not able to match the physician information to the faculty information until after two profiles were created. They "merged" them on the CAP side, but the publications were not merged on the SUL-PUB side. This manifests itself as unexpected behavior (missing pubs, etc.). The rake task takes in two cap_profile_ids and will merge all of the publications from DUPE_CAP_PROFILED_ID's profile into PRIMARY_CAP_PROFILE_ID's profile. It will then deactivate DUPE_CAP_PROFILED_ID's profile (which should now have no publications associated with it) to prevent harvesting into it. NOTE: There is no warning or confirmation, so be sure you have the IDs correct and in the correct order in the parameter list BEFORE you run the rake task. I suggest you confirm in the rails console before hand.
RAILS_ENV=production bundle exec rake cleanup:merge_profiles[TO_CAP_PROFILE_ID,FROM_CAP_PROFILE_ID] # will merge all publications from cap_profile_id FROM into TO, without duplication
RAILS_ENV=production bundle exec rake cleanup:merge_profiles[123,456] # will merge all publications from cap_profile_id 456 into 123, without duplication
timeframe = Time.parse('June 11, 2018') # date to go back to look for lots of publications
authors = Contribution.where('created_at > ?',timeframe).where(status:'new').uniq.pluck(:author_id);
author_info = []
authors.each do |author_id|
author = Author.find(author_id)
new_pubs_since_timeframe = author.contributions.where(status:'new').where('created_at > ?',timeframe).size
new_pubs_total = author.contributions.where(status:'new').size
author_info << {cap_profile_id: author.cap_profile_id,name:"#{author.first_name} #{author.last_name}",new_pubs_since_timeframe:new_pubs_since_timeframe,new_pubs_total:new_pubs_total}
end;
author_info.each { |author| puts "#{author[:cap_profile_id]},#{author[:name]},#{author[:new_pubs_since_timeframe]},#{author[:new_pubs_total]}"};
The WoS Links Client provides additional information, such as the times cited and identifiers.
To return DOI, PMID and times cited given WoS IDs. If you are starting with DOIs, you can first look up the WoS UID given a query above.
wos_uids = ["WOS:001061548400001"]
results = WebOfScience.links_client.links(wos_uids)
=> {"WOS:001061548400001"=>{"doi"=>"10.1038/s41387-023-00244-4", "pmid"=>"MEDLINE:37689792"}}
All data via a Rake Task:
RAILS_ENV=production bundle exec rake wos:links['WOS:000081515000015']
If an instance of an author has lots of publications, possibly from a bad harvest, you can remove any and all publications for that author in the 'new' state with a rake task. Be careful, this is destructive. Note that it targets a specific provenance, so you can be more targeted. If you wan to remove more than one, just run it more than once:
Use case: a researchers has many many new publications due to name ambiguities, because a harvest
was run using last name, first initial and this user was determined to have many publications that
do not actually belong to them. This task will remove any publications associated with their profile
in the 'new' state between the dates specified, and then remove the publications
too if they are no longer connected to any one else's profile and match the specified provenance.
Should be rare in usage and then followed up with another harvest for this profile.
# for cap_profile_id = 202714 for all publications between Jan 1 2010 and Oct 1 2019
RAILS_ENV=production bundle exec rake cleanup:remove_new_contributions[202714,'Jan 1 2010','Oct 1 2019','sciencewire']
RAILS_ENV=production bundle exec rake cleanup:remove_new_contributions[202714,'Jan 1 2010','Oct 1 2019','wos']
RAILS_ENV=production bundle exec rake cleanup:remove_new_contributions[202714,'Jan 1 2010','Oct 1 2019','pubmed']
Sometimes we get bad data from the source (e.g. Web of Science) and this results in typos or all caps in places that the user notices.
We are requested to (1) suggest a data correction to the source and (2) fix locally so the user can see an immediate impact.
To date, all bad data reported to us has come from Web of Science. To suggest a correction, you must first find the publication for the author in question and locate its 'wos_uid':
cap_profile_id=123
author=Author.find_by_cap_profile_id(cap_profile_id);
pub = author.publications.find_by(title:'SOME TITLE HERE') # or find the correct pub any other way
pub.wos_uid
- Visit the WoS search UI (will need to login via SUNET): http://apps.webofknowledge.com.laneproxy.stanford.edu/WOS_GeneralSearch_input.do?product=WOS&search_mode=GeneralSearch
- Select the search field (suggest "Accession Number" and using the
wos_uid
from the publication) - Find the publication in the search results and select it to get the full publication record
- Scroll down and look on the buttom of the right-hand navigation for the "Suggest a correction" link. This allows you to fill in a form enter the suggested correction.
You can then fix the local publication row in our database. Note that this should be rare because it is manually updating and fixing typos or other issues in someone's publication data. This is generally a bad idea because it doesn't change the source record (e.g. at WoS) and the pub_hash can later be easily overwritten again if it is rebuilt from the source record (though this is not supported for WOS records anyway).
To manually update the local publication record:
cap_profile_id=123
author=Author.find_by_cap_profile_id(cap_profile_id);
pub = author.publications.find_by(title:'SOME TITLE HERE') # or find the correct pub
pub.pub_hash[:title] = 'Some Title Here' # properly case the title or fix as needed
pub.pub_hash[:author].each do |author| # properly case the authors or fix as needed
author[:display_name] = author[:display_name].titlecase
author[:first_name] = author[:first_name].titlecase
author[:last_name] = author[:last_name].titlecase
author[:full_name] = author[:full_name].titlecase
author[:name] = author[:name].titlecase
end
pub.pub_hash[:journal][:name] = pub.pub_hash[:journal][:name].titleize # any other updates to the pub hash
#pub.pub_hash[:other_fields] = '' # any other updates to the pub hash
pub.update_formatted_citations # update the citation
pub.save # save the pub
rec = WebOfScience::Record.new(record: pub.web_of_science_source_record.source_data) # to work with the record and see how it maps
WebOfScience::MapPubHash.new(rec) # the whole pub hash mapped
WebOfScience::MapCitation.new(rec) # part of the record
rec.pub_info # see parts of the record
WebOfScience::MapCitation.new(rec).send(:extract_pages,rec.pub_info["page"]) # extract pages
Look in the scripts
sub-folder for various utility scripts. Run with
cd sul_pub/current
bundle exec rails runner script/[FILENAME.rb]
RAILS_ENV=production bundle exec rake sul:publication_import_stats['1/1/2022','1/31/2022']
Or manually to fetch the numbers of active authors, numbers of authors added in the last month or in previous months:
Author.where('created_at > ?',1.month.ago).count
=> 417
Author.where('created_at > ?',1.month.ago).where(active_in_cap: true, cap_import_enabled: true).count
=> 141
Contribution.where('created_at > ?',1.month.ago).count
=> 5838
Contribution.select(:author_id).where('created_at > ?',1.month.ago).distinct.count
=> 3022
Contribution.where('created_at > ?',1.month.ago).where(status: 'approved').count
=> 2032
Contribution.where('created_at > ?',1.month.ago).where(status: 'denied').count
=> 629
Contribution.where('created_at > ?',1.month.ago).where(status: 'new').count
=> 3177
Provenance is stored in the pub_hash (publication.pub_hash[:provenance]
, but not at the publication model level, making it hard to query. You can try using identifiers though, which are stored at the publication model level and are indexed.
Likely Pubmed provenance (publications with a PMID but not a sciencewire or WOS_UID):
Publication.where(sciencewire_id: nil, wos_uid: nil).where('pmid IS NOT ?', nil).size
Likely WOS provenance (publications with a WOS_UID):
Publication.where('wos_uid IS NOT ?', nil).size
Likely Sciencewire provenance (publications with a sciencewire_id):
Publication.where('sciencewire_id IS NOT ?', nil).size
CAP and Batch provenance likely have all as nil:
Publication.where(sciencewire_id: nil, wos_uid: nil, pmid: nil).size
Exports authors and their publications. Specify the number of authors and the minimum number of publications each author must have to be exported. The authors are selected randomly. Their publications are exported to separate csv files by author in a sub-folder called "author_reports". Defaults are 100 authors, minimum of 5 publications, and output file = 'tmp/random_authors.csv'. Note that since only WoS publications are output, you may get less publications output than the min specified.
RAILS_ENV=production bundle exec rake sul:author_publications_report[100,5,'tmp/random_authors.csv']