Skip to content
This repository has been archived by the owner on Nov 26, 2019. It is now read-only.

Write new indexing process #348

Closed
hackartisan opened this issue Feb 23, 2017 · 7 comments
Closed

Write new indexing process #348

hackartisan opened this issue Feb 23, 2017 · 7 comments

Comments

@hackartisan
Copy link
Contributor

ActiveFedora::Base.reindex_everything is inefficient and must be run twice. Given that it takes more than an entire workday to run even a single time, it would be worth rewriting so that we can run it only once and more efficiently.

@hackartisan
Copy link
Contributor Author

hackartisan commented Feb 23, 2017

Two strategies discussed on slack for improving efficiency are:

  1. Multithread the process, e.g.: https://github.com/avalonmediasystem/avalon/blob/develop/lib/tasks/avalon.rake#L250-L264
  2. Do not call update_index on every object. Instead call to_solr and build up a list of objects and only write them when you have a bunch of documents (like 1000) (update, bill dueber mentioned that he played with this and found 100 to be a good number).

@hackartisan
Copy link
Contributor Author

See also this issue about 2 invocations: samvera/hydra-head#402

@hackartisan
Copy link
Contributor Author

hackartisan commented Feb 24, 2017

A single baseline run on staging took nearly 10.5 hrs:

$ time bundle exec rake chf:reindex RAILS_ENV=production                                                                                               
reindex complete

real    625m20.498s
user    54m13.081s
sys     1m18.049s

note that to reindex correctly you must run it twice (so that would take 21 hrs)

@hackartisan
Copy link
Contributor Author

hackartisan commented Feb 24, 2017

benchmarking a multithreaded run:

$ time bundle exec rake chf:reindex[2] RAILS_ENV=production

Well that didn't work.

real    602m5.382s
user    75m10.674s
sys     1m27.205s

@jrochkind
Copy link
Contributor

My PR's have been merged into ActiveFedora to batch solr adds, as well as order permissions objects first which should prevent double-index needed.

Batch solr adds significantly (like order of magnitude) speed up reindexing.

I am going to put both of these fixes in locally -- not monkey patching but making a new class that our reindex task can use. I will put a load-time check in for version of ActiveFedora expected to include these tasks, warning we may not need local code anymore.

@HackMasterA , does this make sense as an approach?

Concurrency may speed it up even more, but I think it may be fast enough to be workable with these changes -- but I can try working on concurrency if we want even faster. I'm not totally positive I'll have avoided the need for double-index, we'll have to check that.

@hackartisan
Copy link
Contributor Author

Sounds great. Feel free to close this ticket without doing any concurrency; the performance gain was definitely the goal, as opposed to the strategy for getting there.

@jrochkind
Copy link
Contributor

okay, can't close the ticket without doing that other stuff first, will do.

It'll still be slow -- takes 20 minutes just get all the ID's out of fedora, plus prob another 20-40 to actually index. But that'll still be an order of magnitude improvement! One step at a time, we'll do that first.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants